diff --git a/docs/chapter4/chapter4_questions&keywords.md b/docs/chapter4/chapter4_questions&keywords.md index ec4f328..eb04dd4 100644 --- a/docs/chapter4/chapter4_questions&keywords.md +++ b/docs/chapter4/chapter4_questions&keywords.md @@ -83,10 +83,11 @@ \theta^* = \theta + \alpha\nabla J({\theta}) $$ 所以我们仅仅需要计算(更新)$\nabla J({\theta})$ ,也就是计算回报函数 $J({\theta})$ 关于 $\theta$ 的梯度,也就是策略梯度,计算方法如下: - $$ - \nabla_{\theta}J(\theta) = \int {\nabla}_{\theta}p_{\theta}(\tau)r(\tau)d_{\tau} - =\int p_{\theta}{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)d_{\tau}=E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] - $$ + $$\begin{aligned} + \nabla_{\theta}J(\theta) &= \int {\nabla}_{\theta}p_{\theta}(\tau)r(\tau)d_{\tau} \\ + &= \int p_{\theta}{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)d_{\tau} \\ + &= E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] + \end{aligned}$$ 接着我们继续讲上式展开,对于 $p_{\theta}(\tau)$ ,即 $p_{\theta}(\tau|{\theta})$ : $$ p_{\theta}(\tau|{\theta}) = p(s_1)\prod_{t=1}^T \pi_{\theta}(a_t|s_t)p(s_{t+1}|s_t,a_t)