Home
Blog

Reinforcement Learning as Probabilistic Inference - Part 3

From the optimal action conditionals, we recover the optimal policy through backward messages, relate it to value functions in RL, and connect probabilistic inference to maximum entropy reinforcement learning.

Published

29 November 2024

Recovering the Optimal Policy (cont.)

We ended part 2 with an introduction to the optimal action conditional $p(a_t \mid s_t, O_{t:T} =1)$ . While it is not quite the action conditionals given the optimal policy $p(a_t \mid s_t, \theta^*)$ that we are interested in, it is attainable from our PGM and closely related; Both $p(a_t \mid s_t, O_{t:T} = 1)$ and $p(a_t \mid s_t, \theta^\star)$ aim to select actions that lead to optimal outcomes. We will go through deriving the optimal action conditional from our probability graph model which will in the end reveal the subtle difference.

The sum-product inference algorithm

We will be using the standard sum-product inference algorithm, which is analogous to inference in Hidden Markov Model or Dynamic Bayesian networks (DBN) [1]. The sum-product algorithm computes marginal probabilities in a graphical model. In a sequential model like a DBN, the sum-product algorithm involves forward and backward passes to compute marginal probabilities: Each message represents a partial computation of the marginal probability, based on the information available in the neighborhood of the node. More here [2]. The forward pass indicates messages propagate forward in time, capturing how past states influence the present. The backward pass indicate messages propagate backward in time, capturing how future states constrain the present.

The backward message

In our PGM, the backward message is

$\beta_t(s_t, a_t) = p(O_{t:T} \mid s_t, a_t) \quad \text{or} \quad \beta_t(s_t) = p(O_{t:T} \mid s_t).$ That is the probability that a trajectory can be optimal for time steps from t to T if it begins in state $s_t$ with the action $a_t$ . Alternatively, the state-only message $\beta_t(s_t)$ indicates the probability that the trajectory from t to T is optimal if it begins in state $s_t$ . The state-only message can be derived from the state-action message by integrating out the action:

$\beta_t(s_t) = p(O_{t:T} \mid s_t) = \int_A p(O_{t:T} \mid s_t, a_t) p(a_t \mid s_t) \, da_t = \int_A \beta_t(s_t, a_t) p(a_t \mid s_t) \, da_t.$

* Note that $p(a_t \mid s_t)$ is the action prior. Our PGM doesn’t actually contain this factor, and we can assume that it is a constant corresponding to a uniform distribution over the set of actions for simplicity $p(a_t \mid s_t) = \frac{1}{|A|}$

The recursive message passing algorithm for computing $\beta_t(s_t, a_t)$ proceeds from the last time step $t = T$ backward through time to $t = 1$ :

$\begin{aligned} \fcolorbox{red}{white}{$\beta_t(s_t, a_t)$} &= p(O_{t:T} \mid s_t, a_t) \\ &= \int_{S} \fcolorbox{red}{white}{$\beta_{t+1}(s_{t+1})$} p(s_{t+1} \mid s_t, a_t) \, p(O_t \mid s_t, a_t)ds_{t+1}. \end{aligned}$

* In the base case, note that $p(O_T \mid s_T, a_T)$ is simply proportional to $\exp(r(s_T, a_T))$ , since there is only one factor to consider.

The backward message can be computed recursively:

$\begin{aligned} \beta_t(s_t) &= \sum_{a_t} \sum_{s_{t+1}} \fcolorbox{blue}{white}{$p(O_t = 1 \mid s_t, a_t)$} \fcolorbox{green}{white}{$p(s_{t+1} \mid s_t, a_t)$} \beta_{t+1}(s_{t+1}) \end{aligned}$

where

$\textcolor{blue}{p(O_t \mid s_t, a_t)}$ : The probability that step $t$ is optimal given the state-action pair $(s_t, a_t)$ .
$\textcolor{green}{p(s_{t+1} \mid s_t, a_t)}$ : The transition probability from $s_t$ to $s_{t+1}$ under action $a_t$ .

The backward message $\beta_t(s_t)$ tells us how likely it is to achieve high rewards from time $t+1$ onward, given the current state $s_t$ . Combined with forward messages (prior dynamics), this allows the agent to infer the best action $a_t$ at any time step $t$ .

The optimal action conditional (finally!)

From these backward messages, we can then derive the optimal action conditional $p(a_t \mid s_t, O_{1:T})$ . First, note that $O_{1:(t-1)}$ is conditionally independent of $a_t$ given $s_t$ , which means that:

$p(a_t \mid s_t, O_{1:T}) = p(a_t \mid s_t, O_{t:T}),$

and we can disregard the past when considering the current action distribution. This makes sense because in a Markovian system, the optimal action does not depend on the past.

We can recover the optimal action distribution using the two backward messages:

$\begin{aligned} p(a_t \mid s_t, O_{t:T}) &= \frac{p(s_t, a_t \mid O_{t:T})}{p(s_t \mid O_{t:T})} \\ &= \frac{p(O_{t:T} \mid s_t, a_t) p(a_t \mid s_t) p(s_t)}{p(O_{t:T} \mid s_t) p(s_t)} &\quad \textcolor{gray}{\scriptsize \text{(by Bayes' rule)}} \\ &\propto \frac{p(O_{t:T} \mid s_t, a_t)}{p(O_{t:T} \mid s_t)} &\quad \textcolor{gray}{\scriptsize \text{(Assume uniform } p(a_t \mid s_t) = \frac{1}{|A|} \text{, so } p(a_t \mid s_t) \text{ cancels out)}} \\ &= \fcolorbox{red}{white}{$\frac{\beta_t(s_t, a_t)}{\beta_t(s_t)}$} \tag{1} \end{aligned}$

Backward message $\leftrightarrow$ Value function, State-action value function

We can relate the backward message in our PGM to Q value (state-action value) and value functions in the context of RL. We know that $\beta_t(s_t, a_t) = p(O_{t:T} \mid s_t, a_t) = p(O_{1:T} \mid s_t, a_t) = \exp(r(s_t, a_t))$ $\beta_t(s_t, a_t)$ represents the unnormalized probability of achieving optimality from $t$ onward, given $(s_t, a_t)$ , and similarly $\beta_t(s_t)$ summarizes the unnormalized probability of achieving optimality from $t$ onward, given $s_t$ . The logarithm of the backward message maps to the state-action value function $Q(s_t, a_t)$ , which is defined as the expected cumulative reward starting at $(s_t,a_t)$ and value function $V(s_t)$ , which is the expected cumulative reward starting at $s_t$ , marginalizing over all actions.

$Q(s_t, a_t) = \log \beta_t(s_t, a_t), \quad V(s_t) = \log \beta_t(s_t).$

Substituting the definitions of $Q(s_t, a_t)$ and $V(s_t)$ into the optimal action conditional (1):

$p(a_t \mid s_t, O_{t:T}) = \frac{\beta_t(s_t, a_t)}{\beta_t(s_t)} = \frac{\exp(Q(s_t, a_t))}{\exp(V(s_t))} = \fcolorbox{red}{white}{$\exp(Q(s_t, a_t) - V(s_t))$}. \tag{2}$

Tying back to RL

Exponent of the advantage

The value $\exp(Q(s_t, a_t) - V(s_t))$ in (2) is the exponent of the advantage of action $a_t$ over the baseline value $V(s_t)$ . In the context of RL, the advantage is a measure of how much better or worse an action is compared to the average performance of the policy from a given state. It is used to evaluate the relative value of an action within a specific state. The advantage function is defined as:

$A(s_t, a_t) = Q(s_t, a_t) - V(s_t),$

If $A(s_t, a_t) > 0$ , the action $a_t$ is better than average.
If $A(s_t, a_t) < 0$ , the action $a_t$ is worse than average.
If $A(s_t, a_t) = 0$ , the action $a_t$ is on par with the average.

By directly comparing $Q(s_t, a_t)$ to $V(s_t)$ , the algorithm prioritizes actions that lead to higher rewards relative to the average, making it more efficient in learning an optimal policy. Advantage provides a balance between exploring actions with potential (positive advantage) and avoiding suboptimal actions (negative advantage).

The exponent of advantage $\exp(Q(s_t, a_t) - V(s_t))$ converts the advantage into a positive, weighted score where actions with higher advantages contribute exponentially more to the total score. The term $\exp(Q(s_t, a_t))$ alone might grow very large, but subtracting $V(s_t)$ normalizes the scale relative to the value of the state.

Then when normalized, the exponential term becomes part of a probability distribution. This is the softmax function over $Q(s_t, a_t)$ values for all actions $a_t$ , which assigns higher probabilities to actions with higher $Q(s_t, a_t)$ :

$\pi(a_t \mid s_t) = \frac{\exp(Q(s_t, a_t) - V(s_t))}{\sum_{a'} \exp(Q(s_t, a'))}.$

Connecting to Maximum Entropy Reinforcement Learning

To jump to the conclusion, the optimal action conditional from our PGM equates to the optimal policy $\pi^\star(a_t \mid s_t)$ in the maximum entropy reinforcement learning framework [3] :

$\pi^\star(a_t \mid s_t) = \exp(Q^\star(s_t, a_t) - V^\star(s_t)),$

where:

$Q^\star(s_t, a_t)$ is the optimal soft Q-function.
$V^\star(s_t)$ is the optimal soft value function, defined as $V^\star(s_t) = \log \sum_{a'} \exp(Q^\star(s_t, a'))$ . The softmax distribution $\exp(Q(s_t, a_t) - V(s_t))$ ensures that the policy explores all actions probabilistically, while still favoring high-reward actions.

In maximum entropy reinforcement learning, the goal is to maximize both the expected reward and the entropy of the policy:

$\max_\pi \mathbb{E} \left[ \sum_{t=1}^T r(s_t, a_t) + \alpha H(\pi(\cdot \mid s_t)) \right],$

where:

$H(\pi)$ is the entropy of the policy which measures the uncertainty in the policy’s action selection: $H(\pi) = -\sum_{a} \pi(a \mid s_t) \log \pi(a \mid s_t),$
$\alpha$ controls the tradeoff between reward maximization and exploration.

Deriving the entropy maximizing policy

[4] shows how soft policy iteration, involving the exponent of the advantage reaches a policy that maximizes expected rewards and entropy. [5] also shows an optimization-based approximate inference approach; variational inference, to approximate $p(a_t \mid s_t, O_{t:T})$ from our PGM as exact inference of the optimal action conditional is intractable:

$p(a_t \mid s_t, O_{t:T}) \propto \textcolor{red}{\frac{p(O_{t:T} \mid s_t, a_t)}{p(O_{t:T} \mid s_t)}}.$

Both terms in the numerator and denominator of the right hand side involves rollout of all possible sequences states and actions, which grows exponentially with the time horizon T integrating over all possible trajectories, which is computationally infeasible.

To employ variational inference to this problem, the goal is to fit a parameterized policy $\pi(a_t \mid s_t)$ such that the trajectory distribution

$\hat{p}(\tau) \propto \mathbb{1}[p(\tau) \neq 0] \prod_{t=1}^T \pi(a_t \mid s_t)$

matches the distribution (restricted to deterministic dynamics):

$p(\tau \mid o_{1:T}) \propto \mathbb{1}[p(\tau) \neq 0] \exp \left( \sum_{t=1}^T r(s_t, a_t) \right).$

We can therefore view the inference process as minimizing the KL divergence between these two trajectory distributions, which is given by:

$D_{\text{KL}}(\hat{p}(\tau) \parallel p(\tau)) = -\mathbb{E}_{\tau \sim \hat{p}(\tau)} \left[ \log p(\tau) - \log \hat{p}(\tau) \right].$ Negating both sides and substituting in the equations for $p(\tau)$ and $\hat{p}(\tau)$ , we get

$\begin{aligned} -D_{\text{KL}}(\hat{p}(\tau) \parallel p(\tau)) &= \mathbb{E}_{\tau \sim \hat{p}(\tau)} \Bigg[ \log p(s_1) + \sum_{t=1}^T \big( \log p(s_{t+1} \mid s_t, a_t) + r(s_t, a_t) \big) \\ &\quad \quad - \log p(s_1) - \sum_{t=1}^T \big( \log p(s_{t+1} \mid s_t, a_t) + \log \pi(a_t \mid s_t) \big) \Bigg] \\ &= \mathbb{E}_{\tau \sim \hat{p}(\tau)} \left[ \sum_{t=1}^T \big( r(s_t, a_t) - \log \pi(a_t \mid s_t) \big) \right] \\ &= \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim \hat{p}(s_t, a_t)} \big[ r(s_t, a_t) - \log \pi(a_t \mid s_t) \big] \\ &= \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim \hat{p}(s_t, a_t)} \fcolorbox{red}{white}{$\big[ r(s_t, a_t) \big]$} + \mathbb{E}_{s_t \sim \hat{p}(s_t)} \fcolorbox{blue}{white}{$\big[ H(\pi(a_t \mid s_t)) \big]$}. \end{aligned}$

This shows that minimizing the KL divergence corresponds to maximizing the expected reward and the expected conditional entropy. As we hinted from early on, this is different from the standard RL objective where we only maximizes reward, and is thus separately referred to as maximum entropy reinforcement learning.

Conclusion

We have so far walked through formulating an RL problem into one that can be solved with probabilistic inference to reach to a special variation of RL; maximum entropy reinforcement learning. In the bigger picture, the extensibility and compositionality of graphical models can likely be leveraged to produce more sophisticated reinforcement learning methods, and the framework of probabilistic inference can offer a powerful toolkit for deriving effective and convergent learning algorithms for the corresponding models.

A particularly exciting recent development is the intersection of maximum entropy reinforcement learning and latent variable models, where the graphical model for control as inference is augmented with additional variables for modeling time-correlated stochasticity for exploration [6], [7] or higher-level control through learned latent action spaces [8], [9].

References

“Hidden Markov Model,” Wikipedia, 2024.
M. I. Jordan, R. Diankov, and X. Liu, “Sum-Product, Max A Posteriori, Bayesians and Frequentists (9/16/04).”
B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum Entropy Inverse Reinforcement Learning.”
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” 2018, doi: 10.48550/arXiv.1801.01290.
S. Levine, “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review,” 2018.
K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller, “Learning an Embedding Space for Transferable Robot Skills,” in International Conference on Learning Representations, 2018.
A. Gupta, R. Mendonca, Y. X. Liu, P. Abbeel, and S. Levine, “Meta-Reinforcement Learning of Structured Exploration Strategies,” 2018, doi: 10.48550/arXiv.1802.07245.
T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement Learning with Deep Energy-Based Policies,” in Proceedings of the 34th International Conference on Machine Learning, PMLR, 2017, pp. 1352–1361.
T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine, “Latent Space Policies for Hierarchical Reinforcement Learning,” 2018, doi: 10.48550/arXiv.1804.02808.