📖 Actor Critic - CS 285 at UC Berkeley Deep Reinforcement Learning

# Policy Evaluation ## improve previous policy gradient Previous policy gradient: ![](https://i.imgur.com/5izrWy0.png) - $\hat{Q}_{i, t}:$ estimate of expected reward ($Q\left(\mathbf{s}_t, \mathbf{a}_t\right)$) if we take action $\mathbf{a}_{i, t}$ in state $\mathbf{s}_{i, t}$ can we get a better estimate? - $Q\left(\mathbf{s}_t, \mathbf{a}_t\right)=\sum_{t^{\prime}=t}^T E_{\pi_\theta}\left[r\left(\mathbf{s}_{t^{\prime}}, \mathbf{a}_{t^{\prime}}\right) \mid \mathbf{s}_t, \mathbf{a}_t\right]:$ true expected reward-to-go - ==$\hat{Q}_{i, t}$的问题==：仅用一个sample来估计。 --> 应该在当前的$(s_{t},a_{t})$的位置sample很多个样本来计算 - ==baseline的问题==： previous: $b_t=\frac{1}{N} \sum_i Q\left(\mathbf{s}_{i, t}, \mathbf{a}_{i, t}\right)$; better: $b = V\left(\mathbf{s}_t\right)=E_{\mathbf{a}_t \sim \pi_\theta\left(\mathbf{a}_t \mid \mathbf{s}_t\right)}\left[Q\left(\mathbf{s}_t, \mathbf{a}_t\right)\right]$ - $V(s_{t})$是在$s_{t}$出采取不同action的平均reward to go ==> $$ \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta\left(\mathbf{a}_{i, t} \mid \mathbf{s}_{i, t}\right) A^\pi\left(\mathbf{s}_{i, t}, \mathbf{a}_{i, t}\right) $$ - $A^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right)=Q^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right)-V^\pi\left(\mathbf{s}_t\right):$ *advantage function*: how much better $\mathbf{a}_t$ is than the average performance of the policy in the state $s_{t}$ > 在实际中我们并不能确切的算出$A^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right)$, 只能去近似估计它. The better the approximation, the lower the variance **idea**: train a model for policy evaluation/return estimate ## NNs for policy evaluation (reward model) $$ Q^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right)=r\left(\mathbf{s}_t, \mathbf{a}_t\right)+E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_t, \mathbf{a}_t\right)}\left[V^\pi\left(\mathbf{s}_{t+1}\right)\right] $$ - $\approx r\left(\mathbf{s}_t, \mathbf{a}_t\right)+V^\pi\left(\mathbf{s}_{t+1}\right)$ - --> $A^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right) \approx r\left(\mathbf{s}_t, \mathbf{a}_t\right)+V^\pi\left(\mathbf{s}_{t+1}\right)-V^\pi\left(\mathbf{s}_t\right)$ - let's just train a model for $V^{\pi}(s)$ - $s$ --> NN: $f_{\phi}(s)$ --> $V^{\pi}(s)$ ![](https://i.imgur.com/9Ml5YEa.png) - 这个过程也叫做policy evaluation: 当我们知道$V^{\pi}(s)$后，我们可以立即求出objective $J(\theta)=E_{\mathbf{s}_1 \sim p\left(\mathbf{s}_1\right)}\left[V^\pi\left(\mathbf{s}_1\right)\right]$ ### training ![](https://i.imgur.com/pYNcITP.jpg) ![](https://i.imgur.com/bPK6shO.jpg) # From Evaluation to Actor Critic ![](https://i.imgur.com/RddBZD4.png) - it's called "actor-critic": "actor": represents $Q$, "critic" represents $V$ ## aside: discount factors ![](https://i.imgur.com/urt4qov.png) **discount factors for policy gradient:** $$ \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta\left(\mathbf{a}_{i, t} \mid \mathbf{s}_{i, t}\right)\left(\sum_{t^{\prime}=t}^T \gamma^{t^{\prime}-t} r\left(\mathbf{s}_{i, t^{\prime}}, \mathbf{a}_{i, t^{\prime}}\right)\right) $$ ## actor -critic algorithms (with discount) **batch actor-critic algorithm:** 1. sample $\left\{\mathbf{s}_i, \mathbf{a}_i\right\}$ from $\pi_\theta(\mathbf{a} \mid \mathbf{s})$ (run it on the robot) 2. fit $\hat{V}_\phi^\pi(\mathbf{s})$ to sampled reward sums 3. evaluate $\hat{A}^\pi\left(\mathbf{s}_i, \mathbf{a}_i\right)=r\left(\mathbf{s}_i, \mathbf{a}_i\right)+\gamma \hat{V}_\phi^\pi\left(\mathbf{s}_i^{\prime}\right)-\hat{V}_\phi^\pi\left(\mathbf{s}_i\right)$ 4. $\nabla_\theta J(\theta) \approx \sum_i \nabla_\theta \log \pi_\theta\left(\mathbf{a}_i \mid \mathbf{s}_i\right) \hat{A}^\pi\left(\mathbf{s}_i, \mathbf{a}_i\right)$ 5. $\theta \leftarrow \theta+\alpha \nabla_\theta J(\theta)$ **online actor-critic algorithm:** 1. take action $\mathbf{a} \sim \pi_\theta(\mathbf{a} \mid \mathbf{s})$, get $\left(\mathbf{s}, \mathbf{a}, \mathbf{s}^{\prime}, r\right)$ 2. update $\hat{V}_\phi^\pi$ using target $r+\gamma \hat{V}_\phi^\pi\left(\mathbf{s}^{\prime}\right)$ 3. evaluate $\hat{A}^\pi(\mathbf{s}, \mathbf{a})=r(\mathbf{s}, \mathbf{a})+\gamma \hat{V}_\phi^\pi\left(\mathbf{s}^{\prime}\right)-\hat{V}_\phi^\pi(\mathbf{s})$ 4. $\nabla_\theta J(\theta) \approx \nabla_\theta \log \pi_\theta(\mathbf{a} \mid \mathbf{s}) \hat{A}^\pi(\mathbf{s}, \mathbf{a})$ 5. $\theta \leftarrow \theta+\alpha \nabla_\theta J(\theta)$ # Actor-critic design decisions ## Architecture design online actor-critic algorithm: 1. take action $\mathbf{a} \sim \pi_\theta(\mathbf{a} \mid \mathbf{s})$, get $\left(\mathbf{s}, \mathbf{a}, \mathbf{s}^{\prime}, r\right)$ 2. update $\hat{V}_\phi^\pi$ using target $r+\gamma \hat{V}_\phi^\pi\left(\mathbf{s}^{\prime}\right)$ 3. evaluate $\hat{A}^\pi(\mathbf{s}, \mathbf{a})=r(\mathbf{s}, \mathbf{a})+\gamma \hat{V}_\phi^\pi\left(\mathbf{s}^{\prime}\right)-\hat{V}_\phi^\pi(\mathbf{s})$ 4. $\nabla_\theta J(\theta) \approx \nabla_\theta \log \pi_\theta(\mathbf{a} \mid \mathbf{s}) \hat{A}^\pi(\mathbf{s}, \mathbf{a})$ 5. $\theta \leftarrow \theta+\alpha \nabla_\theta J(\theta)$ **Two-Network design** ![](https://i.imgur.com/MxWQonM.png) +: simple \& stable -: no shared features between actor \& critic **Shared-Network design** ![](https://i.imgur.com/zWwzKZc.png) ## Online actor-critic in practice ![](https://i.imgur.com/PefF0qR.png) > step 5 (gradient descent/asenct) on a single sample works not well --> works best with a batch (e.g. parallel workers) ### Variants #### 1. off-policy online actor-critic ![](https://i.imgur.com/0D7pWmV.png) 1. replay buffer里面装的是旧的policy对给出的state $s$ 生成的action $a$ (提前装好了) 2. 在后面迭代过程中新的policy会对同一个state $s$ 生成新的action $a$, 然后把新的$\left(\mathbf{s}, \mathbf{a}, \mathbf{s}^{\prime}, r\right)$放进replay buffer **但是不更新state $s$ (不与环境交互)** **Problems**: 1. step 3中的target中的$V^{\pi}(s\prime)$并不能反应最新的policy的产生的action对应的score ==> do not learn $V$, learn $Q$ instead ![](https://i.imgur.com/KKlWL5R.png) 2. 同理, step 5中的action $a_{i}$也应该是最新的action,而不是replay buffer的: $\mathbf{a}_i^\pi \sim \pi_\theta\left(\mathbf{a} \mid \mathbf{s}_i\right)$ 3. in practice: step 5中并不用advantage function, 而用$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i \nabla_\theta \log \pi_\theta\left(\mathbf{a}_i^\pi \mid \mathbf{s}_i\right) \hat{Q}^\pi\left(\mathbf{s}_i, \mathbf{a}_i^\pi\right)$ - higher variance, but convenient - higher variance is ok here: we can reduce variance by generating a lot of new actions(but keep the states unchanged) # critic as baselines in policy gradients ## critic as state-dependent baselines ![](https://i.imgur.com/SAZHlBO.png) ## critic as action-dependent baselines ![](https://i.imgur.com/KoOpbak.png) ## Eligibility traces \& n-step returns ![](https://i.imgur.com/P28tZPz.png) ## Generalized advantage estimation ![](https://i.imgur.com/QL0XJ51.png)