Reinforcement Learning as Probabilistic Inference - Part 2

Building on part 1, we establish decision-making as a probabilistic graphical model, connect RL to a trajectory prediction problem, and formulate optimal trajectory prediction as probabilistic inference.

Decision Making as a Probabilistic Graphical Model (PGMs)

Decision making problems (formalized as RL or optimal control), can be viewed as a generalized Probabilistic Graphical Model (PGM) augmented with utilities or rewards, that are viewed as extrinsic signals. PGMs are a mathematical framework for representing and reasoning about uncertain systems. It uses a graph-based representation as the basis for compactly encoding a complex distribution over a high-dimensional space. It turns out that graph as a representation of a set of independencies, and the graph as a skeleton for factorizing a distribution, are in a deep sense, equivalent! In this graphical representation, illustrated in the figure below, the nodes (or ovals) correspond to the variables in our domain, and the edges correspond to direct probabilistic interactions between them [1].

Example description
Figure 1: Different perspectives on probabilistic graphical models: top — the graphical representation; middle — the independencies induced by the graph structure; bottom — the factorization induced by the graph structure. (a) A sample Bayesian network. (b) A sample Markov network. [1]

[1] describes two families of graphical representations of distributions; Bayesian networks represented with a directed acyclic graph and Markov networks, using undirected graphs. PGMs concisely represent independence properties and factorization structure of distributions.

Conditional interdependencies among variables are encoded as:

Factorization—how the joint probability distribution breaks into smaller, simpler components–—is explicitly represented as:

In the framework of PGMs, it is sufficient to write down the model and pose the question, and the objectives like parameter estimation, marginalization, or conditional probability computation arise directly from the graphical structure.

A common approach to solving decision making problems is to model the underlying dynamics as a probabilistic model e.g. with model-based RL, and determine an optimal plan or policy in a distinct process, e.g. trajectory optimization. In this frame, determining an optimal course of action or an optimal decision-making strategy (a policy) is considered a fundamentally distinct type of problem than probabilistic inference.

Reinforcement Learning \leftrightarrow Trajectory Prediction

Is it possible to cast the decision making problem into an inference problem using a particular type of PGM? [2] claims is it possible to formulate these two so that they return the same results ; a trajectory following a policy that we get from solving the sequential decision making task with RL and an optimal trajectory from conditioning on a probabilistic model trained on trajectory prediction.

The Decision Making Problem - standard RL formulation

We will use sS s \in S to denote states and aA a \in A to denote actions, which may each be discrete or continuous. States evolve according to the stochastic dynamics p(st+1st,at) p(s_{t+1} \mid s_t, a_t) , which are in general unknown.We will follow a discrete-time finite-horizon derivation, with horizon T T , and omit discount factors for now. A task in this framework can be defined by a reward function r(st,at) r(s_t, a_t) . Solving a task typically involves recovering a policy p(atst,θ) p(a_t \mid s_t, \theta) , which specifies a distribution over actions conditioned on the state and parameterized by some parameter vector θ \theta .

A standard reinforcement learning policy search problem is then given by the following maximization:

θ=argmaxθVector of policy parameterst=1TE(st,at)p(st,atθ)[r(st,at)]Expected rewards over trajectories \theta^\star = \textcolor{blue}{\underbrace{\textcolor{black}{\arg\max_\theta}}_{\text{Vector of policy parameters}}} % \overbrace{\sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim p(s_t, a_t \mid \theta)}}^{\text{Expected rewards}} \textcolor{blue}{\underbrace{\textcolor{black}{\sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim p(s_t, a_t \mid \theta)}\left[ r(s_t, a_t) \right]}}_{\text{Expected rewards over trajectories}}} We are trying to find the vector of policy parameters θ \theta that maximize the expectation of reward tr(st,at) \sum_t r(s_t, a_t) with state and action taken according to the policy, for each step t=1 to T across the trajectory. Lets express the random variable in terms of a distribution of trajectories. Starting with:

t=1TE(st,at)p(st,atθ)[], \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim p(s_t, a_t \mid \theta)} \left[ \dots \right],

we can rewrite the expectation as a sum over all possible state-action pairs from t=1t = 1 to TT:

(s1,a1),,(sT,aT)p(s1,a1,,sT,aTθ)[]. \sum_{(s_1, a_1), \dots, (s_T, a_T)} p(s_1, a_1, \dots, s_T, a_T \mid \theta) \left[ \dots \right].

Recognizing that the sequence (s1,a1,,sT,aT)(s_1, a_1, \dots, s_T, a_T) represents a trajectory τ\tau, we can express this as:

τp(τ)[], \sum_{\tau} p(\tau) \left[ \dots \right],

where:

p(τ)=p(s1,a1,,sT,aTθ)=p(s1)t=1Tp(atst,θ)Action likelihoodp(st+1st,at)Transition likelihood. \begin{aligned} p(\tau) &= p(s_1, a_1, \dots, s_T, a_T \mid \theta) \\ &= p(s_1) \prod_{t=1}^T \textcolor{blue}{\underbrace{\textcolor{black}{p(a_t \mid s_t, \theta)}}_{\text{Action likelihood}}} \textcolor{blue}{\underbrace{\textcolor{black}{p(s_{t+1} \mid s_t, a_t)}}_{\text{Transition likelihood}}}. \end{aligned}

So this is the final expression we are trying to optimize in a standard reinforcement learning policy search problem :

θ=argmaxθVector of policy parametersτp(τ)[t=1Tr(st,at)]Expected total reward over trajectories(1) \theta^\star = \textcolor{blue}{\underbrace{\textcolor{black}{\arg\max_\theta}}_{\text{Vector of policy parameters}}} \textcolor{blue}{\underbrace{\textcolor{black}{\sum_{\tau} p(\tau) \left[ \sum_{t=1}^T r(s_t, a_t) \right]}}_{\text{Expected total reward over trajectories}}} \tag{1}

Formulating the Decision-Making PGM

The next question we have to ask how can we formulate a probabilistic graphical model such that inferring the posterior action conditional p(atst,θ)p(a_t \mid s_t, \theta) gives us the optimal policy?

We have to introduce an additional binary random variable OtO_t, which we call an optimality variable, to this model where Ot=1O_t = 1 denotes that time step tt is optimal, and Ot=0O_t = 0 denotes that it is not optimal.

PGM for RL
We formulate RL with the PGM with optimality variable so that we can infer the most probable action sequence or most probable action distributions while conditioning on the optimality variables being true.[2]

We will choose the distribution over the optimality variable as such:

p(Ot=1st,at)Unnormalized probability of timestep t being optimal=exp(r(st,at))proportional to the Exponent of the reward. \textcolor{blue}{\underbrace{\textcolor{black}{p(O_t = 1 \mid s_t, a_t)}}_{\text{Unnormalized probability of timestep } t \text{ being optimal}}} = \textcolor{red}{\underbrace{\textcolor{black}{\exp(r(s_t, a_t))}}_{\text{proportional to the Exponent of the reward}}}.

This seems a bit arbitrary; why are we taking the exponent of the reward?

This form resembles the Boltzmann distribution in statistical mechanics, where probabilities are proportional to exp(E/T) \exp(-E / T) , with E E as energy and T T as temperature. Similarly, in reinforcement learning (RL), the reward r(st,at) r(s_t, a_t) plays a role analogous to negative energy, and taking exp(r(st,at)) \exp(r(s_t, a_t)) ensures that higher rewards correspond to higher probabilities.

What this is really trying to get at is this — If p(Ot=1st,at)p(O_t = 1 \mid s_t, a_t) represents the probability that time step tt is optimal, then the reward r(st,at)r(s_t, a_t) can be interpreted as a log-probability:

logp(Ot=1st,at)r(st,at). \log p(O_t = 1 \mid s_t, a_t) \propto r(s_t, a_t).

Exponentiating this ensures that the probabilities are normalized and non-negative. This approach is consistent with softmax-based action selection in RL, where the probability of choosing an action is proportional to exp(r(st,at)) \exp(r(s_t, a_t)) . It allows smooth exploration by providing a probabilistic weighting over possible actions.

Planning = Inferring the optimal trajectory

From our PGM formulation with the optimality variable, we can infer the posterior probability of observing a trajectory given optimality for all steps in the trajectory. Given a policy (we haven’t optimized the policy parameters yet), if we would like to plan for an optimal action sequence starting from some initial state s1s_1, we can condition on o1:To_{1:T} and choose p(s1)=δ(s1)p(s_1) = \delta(s_1). In other words, maximum a posteriori inference corresponds to a kind of planning problem!

p(τo1:T)p(τ,o1:T)=p(s1)t=1Tp(Ot=1st,at)p(st+1st,at)=p(s1)t=1Texp(r(st,at))p(st+1st,at)=p(s1)t=1Tp(st+1st,at)exp(t=1Tr(st,at)). \begin{aligned} \fcolorbox{blue}{white}{$p(\tau \mid o_{1:T})$} &\propto p(\tau, o_{1:T}) \\ &= p(s_1) \prod_{t=1}^T p(O_t = 1 \mid s_t, a_t) p(s_{t+1} \mid s_t, a_t) \\ &= p(s_1) \prod_{t=1}^T \exp(r(s_t, a_t)) p(s_{t+1} \mid s_t, a_t) \\ &= \fcolorbox{green}{white}{$p(s_1) \prod_{t=1}^T p(s_{t+1} \mid s_t, a_t)$} \fcolorbox{red}{white}{$\exp \left( \sum_{t=1}^T r(s_t, a_t) \right)$}. \end{aligned}

* We are not directly concerned with the marginalization; instead, we want the unnormalized joint probability.

In the case where the dynamics are deterministic, the first term is a constant for all trajectories that are dynamically feasible. p(τo1:T)1[p(τ)0]exp(t=1Tr(st,at)). p(\tau \mid o_{1:T}) \propto \fcolorbox{red}{white}{$\mathbb{1}[p(\tau) \neq 0]$} \exp \left( \sum_{t=1}^T r(s_t, a_t) \right). Here, the indicator function simply indicates that the trajectory τ\tau is dynamically consistent (meaning that p(st+1st,at)0p(s_{t+1} \mid s_t, a_t) \neq 0) and the initial state is correct. So the probability of a trajectory occurring given all steps optimal is exponentially proportional to the sum of the rewards. In other words, the trajectory with the highest cumulative rewards would likely be the optimal trajectory, which is exactly what we want.

Recovering the Optimal Policy

Now we know how to find the optimal plan given a policy. We are not done yet. Remember from equation (1) that we are trying to find the optimal policy, where the expected cumulative reward across across all possible trajectories is maximized.

Lets first start by writing the optimal action conditionals within our probabilistic graphical model. This still assumes a fixed policy that is not necessarily optimal.

p(atst,Ot:T=1) p(a_t \mid s_t, O_{t:T} = 1)

Before we go ahead, we want to make the distinction between the optimal action conditionals and the action conditionals given the optimal policy (which is the expected outcome of a standard RL frame), as the optimal action conditional is independent of the parameter θ\theta. p(atst,θ) p(a_t \mid s_t, \theta^*) Nevertheless the optimal action conditional and action conditionals given the optimal policy both aim to select actions that lead to optimal outcomes and are therefore closely related. We will continue deriving the optimal action conditional in part 3!

References

  1. D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques, Nachdr. in Adaptive Computation and Machine Learning. Cambridge, Mass.: MIT Press, 2010.
  2. S. Levine, “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review,” 2018.