📖 Value Function Methods (1) - CS 285 at UC Berkeley Deep Reinforcement Learning

==Question: Can we omit policy gradient completely?== - $A^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right)$ : how much better is $\mathbf{a}_t$ than the average action according to $\pi$ - $\arg \max _{\mathbf{a}_t} A^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right):$ best action from $\mathbf{s}_t$, if we then follow $\pi$ - these are regardless of what $\pi\left(\mathbf{a}_t \mid \mathbf{s}_t\right)$ is! let's foget policies: **basic idea**: for each iteration, build a new policy by: $\pi^{\prime}\left(\mathbf{a}_t \mid \mathbf{s}_t\right)=\left\{\begin{array}{l}1 \text { if } \mathbf{a}_t=\arg \max _{\mathbf{a}_t} A^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right) \\ 0 \text { otherwise }\end{array}\right.$ - as good as $\pi$ (probably better) # Policy Iteration ![](https://i.imgur.com/yM5mYvF.png) To evaluate $A^\pi(\mathbf{s}, \mathbf{a})$: as before: $A^\pi(\mathbf{s}, \mathbf{a})=r(\mathbf{s}, \mathbf{a})+\gamma E\left[V^\pi\left(\mathbf{s}^{\prime}\right)\right]-V^\pi(\mathbf{s})$ let's evaluate $V^\pi(\mathbf{s})$ ! ## 1) Dynamic Programming ![](https://i.imgur.com/cu9jYXu.png) ### Policy Iteration with Dynamic Programming ![](https://i.imgur.com/RMUFwK8.png) ### Even Simpler Dynamic Programming ![](https://i.imgur.com/stGMiOX.png) ## 2) Fitted Value Iteration how do we represent $V(\mathbf{s})$ ? - big table, one entry for each discrete $\mathbf{s}$ neural net function $V: \mathcal{S} \rightarrow \mathbb{R}$ ![](https://i.imgur.com/Jd4CIqQ.png) problems: too many discrete states e.g. ![](https://i.imgur.com/zdGzVHd.png) - neural net function $V: \mathcal{S} \rightarrow \mathbb{R}$ ![](https://i.imgur.com/pIHNd84.png) ### fitted value iteration algorithm ![](https://i.imgur.com/Ki6zDU7.png) > step1: need to know outcomes for different actions (transition dynamics)! What if we don't know? --> Fitted Q-iteration ## 3) Fitted Q-Iteration ### fitted Q-iteration algorithm ![](https://i.imgur.com/hy49HKb.png) <font color = 'green'>+ works even for off-policy samples (unlike actor-critic) + only one network, no high-variance policy gradient</font> <font color = 'red'>- no convergence guarantees for non-linear function approximation (more on this later)</font> ==Full fitted Q-iteration algorithm:== ![](https://i.imgur.com/dirfawN.png) - hyperparameters: dataset size $N$, collection policy iterations $K$ gradient steps $S$ # Summary - value-based methods ![](https://i.imgur.com/A1vnXn1.png) - Don't learn a policy explicitly - Just learn value or Q-function - If we have value function, we have a policy (argmax($V$/$Q$))