diff --git a/docs/chapter4/chapter4_questions&keywords.md b/docs/chapter4/chapter4_questions&keywords.md index cea548a..cba8d29 100644 --- a/docs/chapter4/chapter4_questions&keywords.md +++ b/docs/chapter4/chapter4_questions&keywords.md @@ -100,9 +100,11 @@ \nabla logp_{\theta}(\tau|{\theta}) = \sum_{t=1}^T \nabla_{\theta}log \pi_{\theta}(a_t|s_t) $$ 带入第三个式子,可以将其化简为: - $$ - \nabla_{\theta}J(\theta) =E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] = E_{\tau \sim p_{\theta}}[(\nabla_{\theta}log\pi_{\theta}(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ = \frac{1}{N}\sum_{i=1}^N[(\sum_{t=1}^T\nabla_{\theta}log \pi_{\theta}(a_{i,t}|s_{i,t}))(\sum_{t=1}^Nr(s_{i,t},a_{i,t}))] - $$ + $$\begin{aligned} + \nabla_{\theta}J(\theta) = E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] \\ + &= E_{\tau \sim p_{\theta}}[(\nabla_{\theta}log\pi_{\theta}(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ + &= \frac{1}{N}\sum_{i=1}^N[(\sum_{t=1}^T\nabla_{\theta}log \pi_{\theta}(a_{i,t}|s_{i,t}))(\sum_{t=1}^Nr(s_{i,t},a_{i,t}))] + \end{aligned}$$ - 高冷的面试官:可以说一下你了解到的基于梯度策略的优化时的小技巧吗?