📖 Value Function Methods (2)
# Q Learning
## recap: fitted Q-iteration is off-policy

### Q-iteration optimization

## Online Q-learning algorithms
1. take exactly one action $\mathbf{a}_i$ and observe $\left(\mathbf{s}_i, \mathbf{a}_i, \mathbf{s}_i^{\prime}, r_i\right)$
- off-policy: not necessary to get the action using the latest policy
3. $\mathbf{y}_i=r\left(\mathbf{s}_i, \mathbf{a}_i\right)+\gamma \max _{\mathbf{a}^{\prime}} Q_\phi\left(\mathbf{s}_i^{\prime}, \mathbf{a}_i^{\prime}\right)$
4. $\phi \leftarrow \phi-\alpha \frac{d Q_\phi}{d \phi}\left(\mathbf{s}_i, \mathbf{a}_i\right)\left(Q_\phi\left(\mathbf{s}_i, \mathbf{a}_i\right)-\mathbf{y}_i\right)$
### exploration with Q-learning
To get the final policy:
1. `simple argmax`:
$$
\pi\left(\mathbf{a}_t \mid \mathbf{s}_t\right)=\left\{\begin{array}{l}
1 \text { if } \mathbf{a}_t=\arg \max _{\mathbf{a}_t} Q_\phi\left(\mathbf{s}_t, \mathbf{a}_t\right) \\
0 \text { otherwise }
\end{array}\right.
$$
- not good in online setting: 一开始的policy可能很烂, 生成的那一个action也可能很拉,有可能使得后面的 $\mathbf{y_{i}}$很不准确, 这样使得 $Q$ model很不准确, 然后Policy就进步缓慢
2. `eplison-greedy`:
$$
\pi\left(\mathbf{a}_t \mid \mathbf{s}_t\right)=\left\{\begin{array}{l}
1-\epsilon \text { if } \mathbf{a}_t=\arg \max _{\mathbf{a}_t} Q_\phi\left(\mathbf{s}_t, \mathbf{a}_t\right) \\
\epsilon /(|\mathcal{A}|-1) \text { otherwise }
\end{array}\right.
$$
- smoothing: 给其他的actions一些probabilities
- drawbacks: 如果有多个action都同样的好, 但是只会有一个action的概率较大, 而其他的actions概率仍然很小
3. `Boltzmann exploration`
$$
\pi\left(\mathbf{a}_t \mid \mathbf{s}_t\right) \propto \exp \left(Q_\phi\left(\mathbf{s}_t, \mathbf{a}_t\right)\right)
$$
- 好的actions的概率都较大(相等)
## Literatures
1. [https://huggingface.co/blog/deep-rl-dqn](https://huggingface.co/blog/deep-rl-dqn)