Markov Decision Process

Markov Decision Process

A Markov decision process is a $5$-tuple $(S,A,P_{\cdot}(\cdot,\cdot),R_{\cdot}(\cdot,\cdot),\gamma)$, where

  • $S$ is a finite set of states.

  • $A$ is a finite set of actions (alternatively, $A(s)$ is the finite set of actions available from state $s$).

  • $R_{a}(s,s’)$ is the immediate reward (or expected immediate reward) received after transitioning from state $s$ to state $s’$, due to $a$. It can also be like $R(s)$ or $R(s,a)$.

  • $P_a(s,s’)=P(s_{t+1}=s’|s_t=s,a_t=a)$ is the probability that action $a$ in state $s$ at time $t$ will lead to state $s’$ at time $t+1$.

  • $\gamma\in[0,1]$ is a discount factor, which represents the difference in importance between future rewards and present rewards.

The solution to the MDP is the policy $\pi(s)\rightarrow a$, which maximizes the long-term expected reward. In reinforcement learning domains, we simply assume the policy is deterministic and history-independent.

Total Reward

$V(s)$, the state value function, is the total discounted reward from time-step $t$,

We can show that

Policies

The long term and delayed reward is given by

which is not equal to the immediate reward $R(s)$.

From Bellman Equation, the value function can be decomposed into two parts:immediate reward $R(s)$ and discounted value of successor state $\gamma V(S_{t+1})$.

Similar result holds for $V^{\pi}(s)$.

Policy-Iterated RL

Policy iteration tries to evaluate and improve the policy in turn. Once a policy $\pi$ has been improved by using $V^{\pi}$ to yield a better policy $\pi’$, we can then compute $V^{\pi’}$ and improve it again to yield an even better policy. We can thus obtain a sequence of monotonically improving policies and value functions:
where $\xrightarrow{E}$ denotes a policy evaluation and $\xrightarrow{I}$ denotes a policy improvement. Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations.

Value-Iterated RL

One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set. If policy evaluation is done iteratively, then convergence exactly to occurs only in the limit.

In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration. One important special case is when policy evaluation is stopped after just one sweep (one backup of each state). This algorithm is called value iteration. It can be written as a particularly simple backup operation that combines the policy improvement and truncated policy evaluation steps:

for all $s\in S$. For arbitrary $V_0$, the sequence $\{V_k\}$ can be shown to converge to $V^\ast$ under the $V^\ast$.

Code Example

1
2
3
import numpy as np
import gym
from gym import wrappers

Policy-Iterated RL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def policy_evaluation(env, policy, gamma=1.0):
""" Iteratively evaluate the value-function under policy.
Alternatively, we could formulate a set of linear equations in iterms of v[s]
and solve them to find the value function.
"""
v = np.zeros(env.nS)
eps = 1e-10
while True:
prev_v = np.copy(v)
for s in range(env.nS):
policy_a = policy[s]
v[s] = sum([p * (r + gamma * prev_v[s_]) for p, s_, r, _ in env.P[s][policy_a]])
if (np.sum((np.fabs(prev_v - v))) <= eps):
# value converged
break
return v

def policy_improvement(v, gamma = 1.0):
""" Extract the policy given a value-function """
policy = np.zeros(env.nS)
for s in range(env.nS):
q_sa = np.zeros(env.nA)
for a in range(env.nA):
q_sa[a] = sum([p * (r + gamma * v[s_]) for p, s_, r, _ in env.P[s][a]])
policy[s] = np.argmax(q_sa)
return policy

def policy_iteration(env, gamma = 1.0):
""" Policy-Iteration algorithm """
# initialize a random policy
policy = np.random.choice(env.nA, size=(env.nS))
# parameter
max_iterations = 200000
gamma = 1.0
# run iterations
for i in range(max_iterations):
# calculate the value given the old policy
old_policy_v = policy_evaluation(env, policy, gamma)
# find the new policy
new_policy = policy_improvement(old_policy_v, gamma)
if (np.all(policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
policy = new_policy
return policy

Evaluation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def run_episode(env, policy, gamma = 1.0, render = False):
""" Runs an episode and return the total reward """
obs = env.reset()
total_reward = 0
step_idx = 0
while True:
if render:
env.render()
obs, reward, done , _ = env.step(int(policy[obs]))
total_reward += (gamma ** step_idx * reward)
step_idx += 1
if done:
break
return total_reward

def evaluate_policy(env, policy, gamma = 1.0, n = 100, render = False):
scores = [run_episode(env, policy, gamma, render) for _ in range(n)]
return np.mean(scores)

env_name = 'FrozenLake8x8-v0'
env = gym.make(env_name).unwrapped
optimal_policy = policy_iteration(env, gamma = 1.0)
scores = evaluate_policy(env, optimal_policy, gamma = 1.0, render = False)
print('Average scores = ', np.mean(scores))
# Policy-Iteration converged at step 13.
# Average scores = 1.0

Value-Iterated RL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def extract_policy(v, gamma = 1.0):
""" Extract the policy given a value-function """
policy = np.zeros(env.nS)
for s in range(env.nS):
q_sa = np.zeros(env.action_space.n)
for a in range(env.action_space.n):
for next_sr in env.P[s][a]:
# next_sr is a tuple of (probability, next state, reward, done)
p, s_, r, _ = next_sr
q_sa[a] += (p * (r + gamma * v[s_]))
policy[s] = np.argmax(q_sa)
return policy


def value_iteration(env, gamma = 1.0):
""" Value-iteration algorithm """
v = np.zeros(env.nS) # initialize value-function
max_iterations = 100000
eps = 1e-20
for i in range(max_iterations):
prev_v = np.copy(v)
for s in range(env.nS):
q_sa = [sum([p*(r + prev_v[s_]) for p, s_, r, _ in env.P[s][a]]) for a in range(env.nA)]
v[s] = max(q_sa)
if (np.sum(np.fabs(prev_v - v)) <= eps):
print ('Value-iteration converged at iteration# %d.' %(i+1))
break
return v

Evaluation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def run_episode(env, policy, gamma = 1.0, render = False):
""" Evaluates policy by using it to run an episode and finding its
total reward.
args:
env: gym environment.
policy: the policy to be used.
gamma: discount factor.
render: boolean to turn rendering on/off.
returns:
total reward: real value of the total reward recieved by agent under policy.
"""
obs = env.reset()
total_reward = 0
step_idx = 0
while True:
if render:
env.render()
obs, reward, done , _ = env.step(int(policy[obs]))
total_reward += (gamma ** step_idx * reward)
step_idx += 1
if done:
break
return total_reward


def evaluate_policy(env, policy, gamma = 1.0, n = 100):
""" Evaluates a policy by running it n times.
returns:
average total reward
"""
scores = [
run_episode(env, policy, gamma = gamma, render = False)
for _ in range(n)]
return np.mean(scores)

env_name = 'FrozenLake8x8-v0'
gamma = 1.0
env = gym.make(env_name).unwrapped
optimal_v = value_iteration(env, gamma);
policy = extract_policy(optimal_v, gamma)
policy_score = evaluate_policy(env, policy, gamma, n=1000)
print('Policy average score = ', policy_score)

# Value-iteration converged at iteration# 2357.
# Policy average score = 1.0

References

  1. Policy Iteration
Your support will encourage me to continue to create.