📖 Actor Critic
# Policy Evaluation
## improve previous policy gradient
Previous policy gradient:

- $\hat{Q}_{i, t}:$ estimate of expected reward ($Q\left(\mathbf{s}_t, \mathbf{a}_t\right)$) if we take action $\mathbf{a}_{i, t}$ in state $\mathbf{s}_{i, t}$ can we get a better estimate?
- $Q\left(\mathbf{s}_t, \mathbf{a}_t\right)=\sum_{t^{\prime}=t}^T E_{\pi_\theta}\left[r\left(\mathbf{s}_{t^{\prime}}, \mathbf{a}_{t^{\prime}}\right) \mid \mathbf{s}_t, \mathbf{a}_t\right]:$ true expected reward-to-go
- ==$\hat{Q}_{i, t}$的问题==: 仅用一个sample来估计。 --> 应该在当前的$(s_{t},a_{t})$的位置sample很多个样本来计算
- ==baseline的问题==: previous: $b_t=\frac{1}{N} \sum_i Q\left(\mathbf{s}_{i, t}, \mathbf{a}_{i, t}\right)$;
better: $b = V\left(\mathbf{s}_t\right)=E_{\mathbf{a}_t \sim \pi_\theta\left(\mathbf{a}_t \mid \mathbf{s}_t\right)}\left[Q\left(\mathbf{s}_t, \mathbf{a}_t\right)\right]$
- $V(s_{t})$是在$s_{t}$出采取不同action的平均reward to go
==>
$$
\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta\left(\mathbf{a}_{i, t} \mid \mathbf{s}_{i, t}\right) A^\pi\left(\mathbf{s}_{i, t}, \mathbf{a}_{i, t}\right)
$$
- $A^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right)=Q^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right)-V^\pi\left(\mathbf{s}_t\right):$ *advantage function*: how much better $\mathbf{a}_t$ is than the average performance of the policy in the state $s_{t}$
> 在实际中我们并不能确切的算出$A^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right)$, 只能去近似估计它. The better the approximation, the lower the variance
**idea**: train a model for policy evaluation/return estimate
## NNs for policy evaluation (reward model)
$$
Q^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right)=r\left(\mathbf{s}_t, \mathbf{a}_t\right)+E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_t, \mathbf{a}_t\right)}\left[V^\pi\left(\mathbf{s}_{t+1}\right)\right]
$$
- $\approx r\left(\mathbf{s}_t, \mathbf{a}_t\right)+V^\pi\left(\mathbf{s}_{t+1}\right)$
- --> $A^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right) \approx r\left(\mathbf{s}_t, \mathbf{a}_t\right)+V^\pi\left(\mathbf{s}_{t+1}\right)-V^\pi\left(\mathbf{s}_t\right)$
- let's just train a model for $V^{\pi}(s)$
- $s$ --> NN: $f_{\phi}(s)$ --> $V^{\pi}(s)$

- 这个过程也叫做policy evaluation: 当我们知道$V^{\pi}(s)$后, 我们可以立即求出objective $J(\theta)=E_{\mathbf{s}_1 \sim p\left(\mathbf{s}_1\right)}\left[V^\pi\left(\mathbf{s}_1\right)\right]$
### training


# From Evaluation to Actor Critic

- it's called "actor-critic": "actor": represents $Q$, "critic" represents $V$
## aside: discount factors

**discount factors for policy gradient:**
$$
\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta\left(\mathbf{a}_{i, t} \mid \mathbf{s}_{i, t}\right)\left(\sum_{t^{\prime}=t}^T \gamma^{t^{\prime}-t} r\left(\mathbf{s}_{i, t^{\prime}}, \mathbf{a}_{i, t^{\prime}}\right)\right)
$$
## actor -critic algorithms (with discount)
**batch actor-critic algorithm:**
1. sample $\left\{\mathbf{s}_i, \mathbf{a}_i\right\}$ from $\pi_\theta(\mathbf{a} \mid \mathbf{s})$ (run it on the robot)
2. fit $\hat{V}_\phi^\pi(\mathbf{s})$ to sampled reward sums
3. evaluate $\hat{A}^\pi\left(\mathbf{s}_i, \mathbf{a}_i\right)=r\left(\mathbf{s}_i, \mathbf{a}_i\right)+\gamma \hat{V}_\phi^\pi\left(\mathbf{s}_i^{\prime}\right)-\hat{V}_\phi^\pi\left(\mathbf{s}_i\right)$
4. $\nabla_\theta J(\theta) \approx \sum_i \nabla_\theta \log \pi_\theta\left(\mathbf{a}_i \mid \mathbf{s}_i\right) \hat{A}^\pi\left(\mathbf{s}_i, \mathbf{a}_i\right)$
5. $\theta \leftarrow \theta+\alpha \nabla_\theta J(\theta)$
**online actor-critic algorithm:**
1. take action $\mathbf{a} \sim \pi_\theta(\mathbf{a} \mid \mathbf{s})$, get $\left(\mathbf{s}, \mathbf{a}, \mathbf{s}^{\prime}, r\right)$
2. update $\hat{V}_\phi^\pi$ using target $r+\gamma \hat{V}_\phi^\pi\left(\mathbf{s}^{\prime}\right)$
3. evaluate $\hat{A}^\pi(\mathbf{s}, \mathbf{a})=r(\mathbf{s}, \mathbf{a})+\gamma \hat{V}_\phi^\pi\left(\mathbf{s}^{\prime}\right)-\hat{V}_\phi^\pi(\mathbf{s})$
4. $\nabla_\theta J(\theta) \approx \nabla_\theta \log \pi_\theta(\mathbf{a} \mid \mathbf{s}) \hat{A}^\pi(\mathbf{s}, \mathbf{a})$
5. $\theta \leftarrow \theta+\alpha \nabla_\theta J(\theta)$
# Actor-critic design decisions
## Architecture design
online actor-critic algorithm:
1. take action $\mathbf{a} \sim \pi_\theta(\mathbf{a} \mid \mathbf{s})$, get $\left(\mathbf{s}, \mathbf{a}, \mathbf{s}^{\prime}, r\right)$
2. update $\hat{V}_\phi^\pi$ using target $r+\gamma \hat{V}_\phi^\pi\left(\mathbf{s}^{\prime}\right)$
3. evaluate $\hat{A}^\pi(\mathbf{s}, \mathbf{a})=r(\mathbf{s}, \mathbf{a})+\gamma \hat{V}_\phi^\pi\left(\mathbf{s}^{\prime}\right)-\hat{V}_\phi^\pi(\mathbf{s})$
4. $\nabla_\theta J(\theta) \approx \nabla_\theta \log \pi_\theta(\mathbf{a} \mid \mathbf{s}) \hat{A}^\pi(\mathbf{s}, \mathbf{a})$
5. $\theta \leftarrow \theta+\alpha \nabla_\theta J(\theta)$
**Two-Network design**

+: <font color = 'green'>simple \& stable</font>
-: <font color = 'red'>no shared features between actor \& critic</font>
**Shared-Network design**

## Online actor-critic in practice

> step 5 (gradient descent/asenct) on a single sample works not well --> works best with a batch (e.g. parallel workers)
### Variants
#### 1. off-policy online actor-critic

<font color = 'red'>1. replay buffer里面装的是旧的policy对给出的state $s$ 生成的action $a$ (提前装好了)
2. 在后面迭代过程中新的policy会对同一个state $s$ 生成新的action $a$, 然后把新的$\left(\mathbf{s}, \mathbf{a}, \mathbf{s}^{\prime}, r\right)$放进replay buffer **但是不更新state $s$ (不与环境交互)**
**Problems**:
1. step 3中的target中的$V^{\pi}(s\prime)$并不能反应最新的policy的产生的action对应的score ==> do not learn $V$, learn $Q$ instead

2. 同理, step 5中的action $a_{i}$也应该是最新的action,而不是replay buffer的: $\mathbf{a}_i^\pi \sim \pi_\theta\left(\mathbf{a} \mid \mathbf{s}_i\right)$
3. in practice: step 5中并不用advantage function, 而用$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i \nabla_\theta \log \pi_\theta\left(\mathbf{a}_i^\pi \mid \mathbf{s}_i\right) \hat{Q}^\pi\left(\mathbf{s}_i, \mathbf{a}_i^\pi\right)$
- higher variance, but convenient
- higher variance is ok here: we can reduce variance by generating a lot of new actions(but keep the states unchanged)</font>
# critic as baselines in policy gradients
## critic as state-dependent baselines

## critic as action-dependent baselines

## Eligibility traces \& n-step returns

## Generalized advantage estimation
