🔖 Notation
# Notation

$\mathbf{s}_t$: environment state
$\mathbf{o}_t$: observation
$\mathbf{a}_t$: action
$r\left(\mathbf{s}_t, \mathbf{a}_t\right)$: reward function
$\pi_\theta\left(\mathbf{a}_t \mid \mathbf{o}_t\right)-$ policy
$\pi_\theta\left(\mathbf{a}_t \mid \mathbf{s}_t\right)-$ policy (fully observed)
