Definition

A discrete-time partially observable Markov decision process (POMDP) models the relationship between an agent and its environment. Formally, a POMDP is a 7-tuple $(S,A,T,R,\Omega,O,\gamma)$, where

$S$ is a set of states;
$A$ is a set of actions;
$T$ is a set of conditional transition probabilities between states;
$R: S \times A \to \mathbb{R}$ is the reward function;
$\Omega$ is a set of observations,
$O$ is a set of conditional observation probabilities;
$\gamma \in [0, 1]$ is the discount factor.

At each time period, the environment is in some state $s\in S$. The agent takes an action $a\in A$, which causes the environment to transition to state $s’$ with probability $T(s’|s,a)$. At the same time, the agent receives an observation $o\in \Omega$ which depends on the new state of the environment, $s’$, and on the just taken action, $a$, with probability $O(o|s’,a)$. Finally, the agent receives a reward $r$ equal to $R(s, a)$. Then the process repeats. The goal is for the agent to choose actions at each time step that maximize its expected future discounted reward: $\mathbb{E}\left[\sum_{t=0}^{\infty}{\gamma}^tr_t\right]$, where $r_t$ is the reward earned at time $t$. The discount factor $\gamma$ determines how much immediate rewards are favored over more distant rewards.

When $\gamma =0$ the agent only cares about which action will yield the largest expected immediate reward; when $\gamma =1$ the agent cares about maximizing the expected sum of future rewards.

$\begin{align*} \text{Underlying Model}\qquad&s_0\rightarrow s_1\rightarrow \cdots \qquad\qquad\\ &\downarrow\ \ \ \ \downarrow \qquad\qquad\\ \text{Observation}\qquad &o_0\rightarrow o_1\rightarrow \cdots \qquad\qquad\\ &\downarrow\ \ \ \ \downarrow \qquad\qquad\\ \text{Belief}\qquad&b_0\rightarrow b_1\rightarrow \cdots \qquad\qquad \end{align*}$

Belief MDP

The belief $b(s)$ of the agent is a function describing the probability of being in state $s$ at one moment. After having taken the action $a$ and observing $o$, an agent needs to update its belief in the state the environment may (or not) be in. Since the state is Markovian (by assumption), maintaining a belief over the states solely requires knowledge of the previous belief state, the action taken, and the current observation. The operation is denoted $b’ = \tau(b,a,o)$. Below we describe how this belief update is computed.

After reaching $s’$, the agent observes $o\in \Omega$ with probability $O(o\mid s’,a)$. Let $b$ be a probability distribution over the state space $S$. $b(s)$ denotes the probability that the environment is in state $s$. Given $b(s)$, then after taking action $a$ and observing $o$,

$b'(s')=\eta O(o\mid s',a)\sum_{s\in S}T(s'\mid s,a)b(s)$

where $\eta =\frac{1}{\mathbb{P}(o\mid b,a)}$ is a normalizing constant with $\mathbb{P}(o\mid b,a)=\sum _{s’\in S}O(o\mid s’,a)\sum_{s\in S}T(s’\mid s,a)b(s)$.

A Markovian belief state allows a POMDP to be formulated as a Markov decision process where every belief is a state. The resulting belief MDP will thus be defined on a continuous state space (even if the “originating” POMDP has a finite number of states: there are infinite belief states (in $B$) because there are an infinite number of mixtures of the originating states (of $S$)), since there are infinite beliefs for any given POMDP.

Formally, the belief MDP is defined as a tuple $(B,A,\tau,r,\gamma)$ where

$B$ is the set of belief states over the POMDP states;
$A$ is the same finite set of action as for the original POMDP;
$\tau$ is the belief state transition function;
$r:B \times A \to \mathbb{R}$ is the reward function on belief states;
$\gamma$ is the discount factor equal to the $\gamma$ in the original POMDP.

Of these, $\tau$ and $r$ need to be derived from the original POMDP. $\tau$ is

$\tau(b,a,b') = \sum_{o\in \Omega} \mathbb{P}(b'|b,a,o) \mathbb{P}(o | a, b),$

where $\mathbb{P}(o|a,b)$ is the value derived in the previous section and

$\mathbb{P}(b'|b,a,o)=\begin{cases} 1&\text{if the belief update with arguments }b,a,o\text{ returns }b'\\ 0&\text{otherwise } \end{cases}.$

The belief MDP reward function ($r$) is the expected reward from the POMDP reward function over the belief state distribution:

$r(b,a) = \sum_{s\in S} b(s) R(s,a).$

The belief MDP is not partially observable anymore, since at any given time the agent knows its belief, and by extension the state of the belief MDP.

Policy And Value Function

Unlike the “originating” POMDP (where each action is available from only one state), in the corresponding Belief MDP all belief states allow all actions, since you (almost) always have some probability of believing you are in any (originating) state. As such, $\pi$ specifies an action $a=\pi(b)$ for any belief $b$.

Here it is assumed the objective is to maximize the expected total discounted reward over an infinite horizon. When $R$ defines a cost, the objective becomes the minimization of the expected cost.

The expected reward for policy $\pi$ starting from belief $b_{0}$ is defined as

$V^\pi(b_0) = \sum_{t=0}^\infty \gamma^t r(b_t, a_t) = \sum_{t=0}^\infty \gamma^t \mathbb{E}\Bigl[ R(s_t,a_t) \mid b_0, \pi \Bigr]$

where $\gamma<1$ is the discount factor. The optimal policy $\pi^\ast$ is obtained by optimizing the long-term reward.

$\pi ^{\ast}=\underset {\pi }{\mbox{argmax}}\ V^{\pi }(b_{0})$

where $b_{0}$ is the initial belief.

The optimal policy, denoted by $\pi^\ast$, yields the highest expected reward value for each belief state, compactly represented by the optimal value function $V^{\ast}$. This value function is solution to the Bellman optimality equation:

$V^{\ast}(b)=\max_{a\in A}\Bigl[ r(b,a)+\gamma \sum _{o\in \Omega }\mathbb{P}(o\mid b,a)V^{\ast}(\tau (b,a,o))\Bigr]$

For finite-horizon POMDPs, the optimal value function is piecewise-linear and convex. It can be represented as a finite set of vectors. In the infinite-horizon formulation, a finite vector set can approximate $V^{\ast}$ arbitrarily closely, whose shape remains convex. Value iteration applies dynamic programming update to gradually improve on the value until convergence to an $\epsilon$-optimal value function, and preserves its piecewise linearity and convexity. By improving the value, the policy is implicitly improved. Another dynamic programming technique called policy iteration explicitly represents and improves the policy instead. These are the same as value iteration and policy iteration MDP.