Update chapter4_questions&keywords.md

This commit is contained in:
Yiyuan Yang
2021-05-24 15:04:19 +08:00
committed by GitHub
parent a0633fba6f
commit 96fac5674c

View File

@@ -83,10 +83,11 @@
\theta^* = \theta + \alpha\nabla J({\theta})
$$
所以我们仅仅需要计算(更新)$\nabla J({\theta})$ ,也就是计算回报函数 $J({\theta})$ 关于 $\theta$ 的梯度,也就是策略梯度,计算方法如下:
$$
\nabla_{\theta}J(\theta) = \int {\nabla}_{\theta}p_{\theta}(\tau)r(\tau)d_{\tau}
=\int p_{\theta}{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)d_{\tau}=E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)]
$$
$$\begin{aligned}
\nabla_{\theta}J(\theta) &= \int {\nabla}_{\theta}p_{\theta}(\tau)r(\tau)d_{\tau} \\
&= \int p_{\theta}{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)d_{\tau} \\
&= E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)]
\end{aligned}$$
接着我们继续讲上式展开,对于 $p_{\theta}(\tau)$ ,即 $p_{\theta}(\tau|{\theta})$ :
$$
p_{\theta}(\tau|{\theta}) = p(s_1)\prod_{t=1}^T \pi_{\theta}(a_t|s_t)p(s_{t+1}|s_t,a_t)