📖 Value Function Methods (1)
==Question: Can we omit policy gradient completely?==
- $A^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right)$ : how much better is $\mathbf{a}_t$ than the average action according to $\pi$
- $\arg \max _{\mathbf{a}_t} A^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right):$ best action from $\mathbf{s}_t$, if we then follow $\pi$
- these are regardless of what $\pi\left(\mathbf{a}_t \mid \mathbf{s}_t\right)$ is!
let's foget policies:
**basic idea**: for each iteration, build a new policy by: $\pi^{\prime}\left(\mathbf{a}_t \mid \mathbf{s}_t\right)=\left\{\begin{array}{l}1 \text { if } \mathbf{a}_t=\arg \max _{\mathbf{a}_t} A^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right) \\ 0 \text { otherwise }\end{array}\right.$
- as good as $\pi$ (probably better)
# Policy Iteration

To evaluate $A^\pi(\mathbf{s}, \mathbf{a})$: as before: $A^\pi(\mathbf{s}, \mathbf{a})=r(\mathbf{s}, \mathbf{a})+\gamma E\left[V^\pi\left(\mathbf{s}^{\prime}\right)\right]-V^\pi(\mathbf{s})$
let's evaluate $V^\pi(\mathbf{s})$ !
## 1) Dynamic Programming

### Policy Iteration with Dynamic Programming

### Even Simpler Dynamic Programming

## 2) Fitted Value Iteration
how do we represent $V(\mathbf{s})$ ?
- big table, one entry for each discrete $\mathbf{s}$ neural net function $V: \mathcal{S} \rightarrow \mathbb{R}$

problems: too many discrete states e.g. 
- neural net function $V: \mathcal{S} \rightarrow \mathbb{R}$

### fitted value iteration algorithm

> step1: need to know outcomes for different actions (transition dynamics)! What if we don't know? --> Fitted Q-iteration
## 3) Fitted Q-Iteration
### fitted Q-iteration algorithm

<font color = 'green'>+ works even for off-policy samples (unlike actor-critic)
+ only one network, no high-variance policy gradient</font>
<font color = 'red'>- no convergence guarantees for non-linear function approximation (more on this later)</font>
==Full fitted Q-iteration algorithm:==

- hyperparameters: dataset size $N$, collection policy iterations $K$ gradient steps $S$
# Summary
- value-based methods

- Don't learn a policy explicitly
- Just learn value or Q-function
- If we have value function, we have a policy (argmax($V$/$Q$))