马尔科夫决策过程

马尔科夫决策过程

一个马尔科夫决策过程由$5$元组$(S,A,P_{\cdot}(\cdot,\cdot),R_{\cdot}(\cdot,\cdot),\gamma)$描述,其中

  • $S$是有限的状态集合;

  • $A(s)$是有限的动作集合, $s\in S$;

  • $R_{a}(s,s’)$是从状态$s$采取动作$a$转移到状态$s’$的及时报酬,也常记为$R(s)$或者$R(s,a)$;

  • $P_a(s,s’)=P(s_{t+1}=s’|s_t=s,a_t=a)$是从状态$s$采取动作$a$转移到状态$s’$的概率;

  • $\gamma\in[0,1]$是一个折扣因子,表示了不同时间报酬的重要性。

MDP目标是寻找一个策略$\pi(s)\rightarrow a$以最大化长时间的报酬期望。在强化学习领域中,我们研究的马尔科夫决策过程策略是最简单的历史无关、确定性策略。

总报酬

状态$s$的效用函数或值函数$V(s)$从时间$t$开始的总折扣报酬,

可以证明

策略

最优策略满足

在给定策略$\pi$的条件下,长期折扣报酬期望为

这与即时报酬$R(s)$不同.

由贝尔曼方程,值函数可以分解为两部分:即时报酬$R(s)$和下一阶段的值函数乘上折扣因子$\gamma V(S_{t+1})$.

相似结论也对$V^{\pi}(s)$成立.

策略迭代强化学习

策略迭代尝试轮流评估、改善策略。一旦一个策略$\pi$通过值函数$V^{\pi}$改善后,得到更优策略$\pi’$,我们可以计算新的值函数$V^{\pi’}$并且继续改进策略。因此我们可以得到一个单调改善的策略序列:
其中$\xrightarrow{E}$表示策略评估,$\xrightarrow{I}$表示策略改善。在这个过程,每个策略都严格优于前一个策略,除非该策略已经是最优的。因为有限马尔科夫决策过程只有有限个策略,这保证了算法的收敛性。

值迭代强化学习

策略迭代的一个缺陷是每一次迭代中的策略评估需要扫描全部状态。如果我们在每一次迭代只进行一次策略评估和策略改善,那么收敛得到的值函数对应的也应该是最优策略对应的值函数,在下式,我们并没有直接求出改善后的策略,而是直接最大化值函数:

对任意$s\in S$。对任意$V_0$,可以证明序列$\{V_k\}$收敛到$V^{\ast}$,只要$V^{\ast}$存在。

代码

1
2
3
import numpy as np
import gym
from gym import wrappers

策略迭代强化学习

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def policy_evaluation(env, policy, gamma=1.0):
""" Iteratively evaluate the value-function under policy.
Alternatively, we could formulate a set of linear equations in iterms of v[s]
and solve them to find the value function.
"""
v = np.zeros(env.nS)
eps = 1e-10
while True:
prev_v = np.copy(v)
for s in range(env.nS):
policy_a = policy[s]
v[s] = sum([p * (r + gamma * prev_v[s_]) for p, s_, r, _ in env.P[s][policy_a]])
if (np.sum((np.fabs(prev_v - v))) <= eps):
# value converged
break
return v

def policy_improvement(v, gamma = 1.0):
""" Extract the policy given a value-function """
policy = np.zeros(env.nS)
for s in range(env.nS):
q_sa = np.zeros(env.nA)
for a in range(env.nA):
q_sa[a] = sum([p * (r + gamma * v[s_]) for p, s_, r, _ in env.P[s][a]])
policy[s] = np.argmax(q_sa)
return policy

def policy_iteration(env, gamma = 1.0):
""" Policy-Iteration algorithm """
# initialize a random policy
policy = np.random.choice(env.nA, size=(env.nS))
# parameter
max_iterations = 200000
gamma = 1.0
# run iterations
for i in range(max_iterations):
# calculate the value given the old policy
old_policy_v = policy_evaluation(env, policy, gamma)
# find the new policy
new_policy = policy_improvement(old_policy_v, gamma)
if (np.all(policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
policy = new_policy
return policy

模型评估

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def run_episode(env, policy, gamma = 1.0, render = False):
""" Runs an episode and return the total reward """
obs = env.reset()
total_reward = 0
step_idx = 0
while True:
if render:
env.render()
obs, reward, done , _ = env.step(int(policy[obs]))
total_reward += (gamma ** step_idx * reward)
step_idx += 1
if done:
break
return total_reward

def evaluate_policy(env, policy, gamma = 1.0, n = 100, render = False):
scores = [run_episode(env, policy, gamma, render) for _ in range(n)]
return np.mean(scores)

env_name = 'FrozenLake8x8-v0'
env = gym.make(env_name).unwrapped
optimal_policy = policy_iteration(env, gamma = 1.0)
scores = evaluate_policy(env, optimal_policy, gamma = 1.0, render = False)
print('Average scores = ', np.mean(scores))
# Policy-Iteration converged at step 13.
# Average scores = 1.0

值迭代强化学习

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def extract_policy(v, gamma = 1.0):
""" Extract the policy given a value-function """
policy = np.zeros(env.nS)
for s in range(env.nS):
q_sa = np.zeros(env.action_space.n)
for a in range(env.action_space.n):
for next_sr in env.P[s][a]:
# next_sr is a tuple of (probability, next state, reward, done)
p, s_, r, _ = next_sr
q_sa[a] += (p * (r + gamma * v[s_]))
policy[s] = np.argmax(q_sa)
return policy


def value_iteration(env, gamma = 1.0):
""" Value-iteration algorithm """
v = np.zeros(env.nS) # initialize value-function
max_iterations = 100000
eps = 1e-20
for i in range(max_iterations):
prev_v = np.copy(v)
for s in range(env.nS):
q_sa = [sum([p*(r + prev_v[s_]) for p, s_, r, _ in env.P[s][a]]) for a in range(env.nA)]
v[s] = max(q_sa)
if (np.sum(np.fabs(prev_v - v)) <= eps):
print ('Value-iteration converged at iteration# %d.' %(i+1))
break
return v

模型评估

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def run_episode(env, policy, gamma = 1.0, render = False):
""" Evaluates policy by using it to run an episode and finding its
total reward.
args:
env: gym environment.
policy: the policy to be used.
gamma: discount factor.
render: boolean to turn rendering on/off.
returns:
total reward: real value of the total reward recieved by agent under policy.
"""
obs = env.reset()
total_reward = 0
step_idx = 0
while True:
if render:
env.render()
obs, reward, done , _ = env.step(int(policy[obs]))
total_reward += (gamma ** step_idx * reward)
step_idx += 1
if done:
break
return total_reward


def evaluate_policy(env, policy, gamma = 1.0, n = 100):
""" Evaluates a policy by running it n times.
returns:
average total reward
"""
scores = [
run_episode(env, policy, gamma = gamma, render = False)
for _ in range(n)]
return np.mean(scores)

env_name = 'FrozenLake8x8-v0'
gamma = 1.0
env = gym.make(env_name).unwrapped
optimal_v = value_iteration(env, gamma);
policy = extract_policy(optimal_v, gamma)
policy_score = evaluate_policy(env, policy, gamma, n=1000)
print('Policy average score = ', policy_score)

# Value-iteration converged at iteration# 2357.
# Policy average score = 1.0

参考文献

  1. Policy Iteration
Your support will encourage me to continue to create.