Reinforcement Learning as Probabilistic Inference - Part 3
From the optimal action conditionals, we recover the optimal policy through backward messages, relate it to value functions in RL, and connect probabilistic inference to maximum entropy reinforcement learning.
Recovering the Optimal Policy (cont.)
We ended part 2 with an introduction to the optimal action conditional . While it is not quite the action conditionals given the optimal policy that we are interested in, it is attainable from our PGM and closely related; Both and aim to select actions that lead to optimal outcomes. We will go through deriving the optimal action conditional from our probability graph model which will in the end reveal the subtle difference.
The sum-product inference algorithm
We will be using the standard sum-product inference algorithm, which is analogous to inference in Hidden Markov Model or Dynamic Bayesian networks (DBN) [1]. The sum-product algorithm computes marginal probabilities in a graphical model. In a sequential model like a DBN, the sum-product algorithm involves forward and backward passes to compute marginal probabilities: Each message represents a partial computation of the marginal probability, based on the information available in the neighborhood of the node. More here [2]. The forward pass indicates messages propagate forward in time, capturing how past states influence the present. The backward pass indicate messages propagate backward in time, capturing how future states constrain the present.
The backward message
In our PGM, the backward message is
That is the probability that a trajectory can be optimal for time steps from t to T if it begins in state with the action . Alternatively, the state-only message indicates the probability that the trajectory from t to T is optimal if it begins in state . The state-only message can be derived from the state-action message by integrating out the action:
* Note that is the action prior. Our PGM doesn’t actually contain this factor, and we can assume that it is a constant corresponding to a uniform distribution over the set of actions for simplicity
The recursive message passing algorithm for computing proceeds from the last time step backward through time to :
* In the base case, note that is simply proportional to , since there is only one factor to consider.
The backward message can be computed recursively:
where
- : The probability that step is optimal given the state-action pair .
- : The transition probability from to under action .
The backward message tells us how likely it is to achieve high rewards from time onward, given the current state . Combined with forward messages (prior dynamics), this allows the agent to infer the best action at any time step .
The optimal action conditional (finally!)
From these backward messages, we can then derive the optimal action conditional . First, note that is conditionally independent of given , which means that:
and we can disregard the past when considering the current action distribution. This makes sense because in a Markovian system, the optimal action does not depend on the past.
We can recover the optimal action distribution using the two backward messages:
Backward message Value function, State-action value function
We can relate the backward message in our PGM to Q value (state-action value) and value functions in the context of RL. We know that represents the unnormalized probability of achieving optimality from onward, given , and similarly summarizes the unnormalized probability of achieving optimality from onward, given . The logarithm of the backward message maps to the state-action value function , which is defined as the expected cumulative reward starting at and value function , which is the expected cumulative reward starting at , marginalizing over all actions.
Substituting the definitions of and into the optimal action conditional (1):
Tying back to RL
Exponent of the advantage
The value in (2) is the exponent of the advantage of action over the baseline value . In the context of RL, the advantage is a measure of how much better or worse an action is compared to the average performance of the policy from a given state. It is used to evaluate the relative value of an action within a specific state. The advantage function is defined as:
- If , the action is better than average.
- If , the action is worse than average.
- If , the action is on par with the average.
By directly comparing to , the algorithm prioritizes actions that lead to higher rewards relative to the average, making it more efficient in learning an optimal policy. Advantage provides a balance between exploring actions with potential (positive advantage) and avoiding suboptimal actions (negative advantage).
The exponent of advantage converts the advantage into a positive, weighted score where actions with higher advantages contribute exponentially more to the total score. The term alone might grow very large, but subtracting normalizes the scale relative to the value of the state.
Then when normalized, the exponential term becomes part of a probability distribution. This is the softmax function over values for all actions , which assigns higher probabilities to actions with higher :
Connecting to Maximum Entropy Reinforcement Learning
To jump to the conclusion, the optimal action conditional from our PGM equates to the optimal policy in the maximum entropy reinforcement learning framework [3] :
where:
- is the optimal soft Q-function.
- is the optimal soft value function, defined as . The softmax distribution ensures that the policy explores all actions probabilistically, while still favoring high-reward actions.
In maximum entropy reinforcement learning, the goal is to maximize both the expected reward and the entropy of the policy:
where:
- is the entropy of the policy which measures the uncertainty in the policy’s action selection:
- controls the tradeoff between reward maximization and exploration.
Deriving the entropy maximizing policy
[4] shows how soft policy iteration, involving the exponent of the advantage reaches a policy that maximizes expected rewards and entropy. [5] also shows an optimization-based approximate inference approach; variational inference, to approximate from our PGM as exact inference of the optimal action conditional is intractable:
Both terms in the numerator and denominator of the right hand side involves rollout of all possible sequences states and actions, which grows exponentially with the time horizon T integrating over all possible trajectories, which is computationally infeasible.
To employ variational inference to this problem, the goal is to fit a parameterized policy such that the trajectory distribution
matches the distribution (restricted to deterministic dynamics):
We can therefore view the inference process as minimizing the KL divergence between these two trajectory distributions, which is given by:
Negating both sides and substituting in the equations for and , we get
This shows that minimizing the KL divergence corresponds to maximizing the expected reward and the expected conditional entropy. As we hinted from early on, this is different from the standard RL objective where we only maximizes reward, and is thus separately referred to as maximum entropy reinforcement learning.
Conclusion
We have so far walked through formulating an RL problem into one that can be solved with probabilistic inference to reach to a special variation of RL; maximum entropy reinforcement learning. In the bigger picture, the extensibility and compositionality of graphical models can likely be leveraged to produce more sophisticated reinforcement learning methods, and the framework of probabilistic inference can offer a powerful toolkit for deriving effective and convergent learning algorithms for the corresponding models.
A particularly exciting recent development is the intersection of maximum entropy reinforcement learning and latent variable models, where the graphical model for control as inference is augmented with additional variables for modeling time-correlated stochasticity for exploration [6], [7] or higher-level control through learned latent action spaces [8], [9].
References
- “Hidden Markov Model,” Wikipedia, 2024.
- M. I. Jordan, R. Diankov, and X. Liu, “Sum-Product, Max A Posteriori, Bayesians and Frequentists (9/16/04).”
- B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum Entropy Inverse Reinforcement Learning.”
- T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” 2018, doi: 10.48550/arXiv.1801.01290.
- S. Levine, “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review,” 2018.
- K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller, “Learning an Embedding Space for Transferable Robot Skills,” in International Conference on Learning Representations, 2018.
- A. Gupta, R. Mendonca, Y. X. Liu, P. Abbeel, and S. Levine, “Meta-Reinforcement Learning of Structured Exploration Strategies,” 2018, doi: 10.48550/arXiv.1802.07245.
- T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement Learning with Deep Energy-Based Policies,” in Proceedings of the 34th International Conference on Machine Learning, PMLR, 2017, pp. 1352–1361.
- T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine, “Latent Space Policies for Hierarchical Reinforcement Learning,” 2018, doi: 10.48550/arXiv.1804.02808.