📖 Policy Gradient
# finite horizon case
goal:

$$
J(\theta)=E_{\tau \sim p_\theta(\tau)}\left[\sum_t r\left(\mathbf{s}_t, \mathbf{a}_t\right)\right] \approx \frac{1}{N} \sum_i \sum_t r\left(\mathbf{s}_{i, t}, \mathbf{a}_{i, t}\right)
$$
算法:
1. run the policy in real world $N$ times to get $N$ trajectories ($N$越大越准确)
2. evaluate $J(\theta)$ and $\nabla_\theta J(\theta)$ ==运用Monte Carlo估计法==
3. update policy: $\theta \leftarrow \theta+\alpha \nabla_\theta J(\theta)$
# Reinforce Algo

1. sample $\left\{\tau^i\right\}$ from $\pi_\theta\left(\mathbf{a}_t \mid \mathbf{s}_t\right)$ (run the policy)
2. $\nabla_\theta J(\theta) \approx \sum_i\left(\sum_t \nabla_\theta \log \pi_\theta\left(\mathbf{a}_t^i \mid \mathbf{s}_t^i\right)\right)\left(\sum_t r\left(\mathbf{s}_t^i, \mathbf{a}_t^i\right)\right)$
3. $\theta \leftarrow \theta+\alpha \nabla_\theta J(\theta)$
具体计算gradient的过程:

# 理解Policy Gradient
supervised learning: X: $s_{t}$, Y: $a_{t}$
MLE:
$$
\nabla_\theta J_{\mathrm{ML}}(\theta) \approx \frac{1}{N} \sum_{i=1}^N\left(\sum_{t=1}^T \nabla_\theta \log \pi_\theta\left(\mathbf{a}_{i, t} \mid \mathbf{s}_{i, t}\right)\right)
$$
Policy Gradient:
$$
\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N\left(\sum_{t=1}^T \nabla_\theta \log \pi_\theta\left(\mathbf{a}_{i, t} \mid \mathbf{s}_{i, t}\right)\right)\left(\sum_{t=1}^T r\left(\mathbf{s}_{i, t}, \mathbf{a}_{i, t}\right)\right)
$$
supervised learning中, ML的目的是使得给出的training data的likelihood最大, 而RL中则不一样, 因为其中的action并不一定是很好的(因为它是由之前不太好的policy生成的), 所以我们要给gradient一个权重 (reward), ==因此RL的gradient可以看成一个weighted version of MLE gradient==
- 好的trajectory (high reward) ==> made more likely
- 差的trajectory (low reward) ==> made less likely
## Partial Observability
In policy gradient, markov property is not actually used! So we can use policy gradient in partially observed MDPs without modification
## Policy Gradient的问题

# Reducing Variance
## Causality trick

## Baselines trick

# Off-policy policy gradient
policy gradient is on-policy:
- 
- 每个timestep都需要run新的policy来得到new samples, 来计算objective func和gradient
- Neural networks change only a little bit with each gradient step
- On-policy learning can be extremely inefficient!
<font color='red'>what if we don't have samples from $p_\theta(\tau) ?$
(we have samples from some $\bar{p}(\tau)$ instead)
- $\bar{p}(\tau)$ could be previous policy (we use old samples)</font>
**Trick: importance sampling重要性采样**:
当我们要计算某个随机变量 $x$ 的函数期望值($x \sim p_{\theta}(x)$, $E_{x \sim p_{\theta}(x)}[f(x)]$), 而我们只有$x \sim q_{\theta}(x)$的samples时用到的trick

- 这个计算过程没有approximation (unbiased)
- 这两项期望值是相等的,但他们的variance可能不同
- 
## derive the gradient

