📖 Value Function Methods (2)

# Q Learning ## recap: fitted Q-iteration is off-policy ![](https://i.imgur.com/eyVrTD7.png) ### Q-iteration optimization ![](https://i.imgur.com/AqTXAWv.png) ## Online Q-learning algorithms 1. take exactly one action $\mathbf{a}_i$ and observe $\left(\mathbf{s}_i, \mathbf{a}_i, \mathbf{s}_i^{\prime}, r_i\right)$ - off-policy: not necessary to get the action using the latest policy 3. $\mathbf{y}_i=r\left(\mathbf{s}_i, \mathbf{a}_i\right)+\gamma \max _{\mathbf{a}^{\prime}} Q_\phi\left(\mathbf{s}_i^{\prime}, \mathbf{a}_i^{\prime}\right)$ 4. $\phi \leftarrow \phi-\alpha \frac{d Q_\phi}{d \phi}\left(\mathbf{s}_i, \mathbf{a}_i\right)\left(Q_\phi\left(\mathbf{s}_i, \mathbf{a}_i\right)-\mathbf{y}_i\right)$ ### exploration with Q-learning To get the final policy: 1. `simple argmax`: $$ \pi\left(\mathbf{a}_t \mid \mathbf{s}_t\right)=\left\{\begin{array}{l} 1 \text { if } \mathbf{a}_t=\arg \max _{\mathbf{a}_t} Q_\phi\left(\mathbf{s}_t, \mathbf{a}_t\right) \\ 0 \text { otherwise } \end{array}\right. $$ - not good in online setting: 一开始的policy可能很烂, 生成的那一个action也可能很拉,有可能使得后面的 $\mathbf{y_{i}}$很不准确, 这样使得 $Q$ model很不准确, 然后Policy就进步缓慢 2. `eplison-greedy`: $$ \pi\left(\mathbf{a}_t \mid \mathbf{s}_t\right)=\left\{\begin{array}{l} 1-\epsilon \text { if } \mathbf{a}_t=\arg \max _{\mathbf{a}_t} Q_\phi\left(\mathbf{s}_t, \mathbf{a}_t\right) \\ \epsilon /(|\mathcal{A}|-1) \text { otherwise } \end{array}\right. $$ - smoothing: 给其他的actions一些probabilities - drawbacks: 如果有多个action都同样的好, 但是只会有一个action的概率较大, 而其他的actions概率仍然很小 3. `Boltzmann exploration` $$ \pi\left(\mathbf{a}_t \mid \mathbf{s}_t\right) \propto \exp \left(Q_\phi\left(\mathbf{s}_t, \mathbf{a}_t\right)\right) $$ - 好的actions的概率都较大(相等) ## Literatures 1. [https://huggingface.co/blog/deep-rl-dqn](https://huggingface.co/blog/deep-rl-dqn)