Introduction
Q-leaning is a model-free (no need to build a model of environment) reinforcement learning method. The goal of Q-leaning is to learn a policy, which can be used to guide a agent to take corresponding action in different states. For any finite Markov Decision Process, Q-learning can find the optimal policy that miximizes the expected total reward.
Preliminaries
Reinforcement learning includes a agent, a state set $S$ and an action set $A$. When the state $s_t\in S$ is observed, the agent will choose an action $a_t\in A$. Then it will get a reward $r_t$ and goes to the next state $s_{t+1}$.
Our goal is to maximize the future total reward, i.e., the expection of total reward given the current state. In practice, we usually use weighted sum, e.g., discount criterion to compute the expected total reward.
Q-learning is used to learn a function $Q:S\times A\mapsto \mathbb{R}$. At any state $s_t$, the agent can choose an action $a_t$ such that $Q(s_t,a_t)$ is maximized。
Algorithm
To use Q-learning, wehave to find a way to learn the $Q$ function. The simplest way is to maintain a table with size $|S|\times |A|$, and use value iteration to update the Q-table. The update rule is given by
whick requires $s_t$, $r_t$ and $s_{t+1}$ at each update step.
Code
1 | import gym |
Variants
- Deep Q-learning
- Double Q-learning
- Double Deep Q-learning
- Delayed Q-learning
- Greedly GQ