From 0d43b83ab69b07a9a8466e950f7449b1453c57a6 Mon Sep 17 00:00:00 2001 From: David Young <46375780+yyysjz1997@users.noreply.github.com> Date: Fri, 5 Feb 2021 23:44:11 +0800 Subject: [PATCH] Update chapter4_questions&keywords.md --- docs/chapter4/chapter4_questions&keywords.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/docs/chapter4/chapter4_questions&keywords.md b/docs/chapter4/chapter4_questions&keywords.md index b58d456..83043bd 100644 --- a/docs/chapter4/chapter4_questions&keywords.md +++ b/docs/chapter4/chapter4_questions&keywords.md @@ -104,4 +104,13 @@ \nabla_{\theta}J(\theta) =E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] = E_{\tau \sim p_{\theta}}[(\nabla_{\theta}log\pi_{\theta}(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] = \frac{1}{N}\sum_{i=1}^N[(\sum_{t=1}^T\nabla_{\theta}log \pi_{\theta}(a_{i,t}|s_{i,t}))(\sum_{t=1}^Nr(s_{i,t},a_{i,t}))] $$ + - 高冷的面试官:可以说一下你了解到的基于梯度策略的优化时的小技巧吗? + + 答: + + 1. **Add a baseline:**为了防止所有的reward都大于0,从而导致每一个stage和action的变换,会使得每一项的概率都会上升。所以通常为了解决这个问题,我们把reward 减掉一项叫做 b,这项 b 叫做 baseline。你减掉这项 b 以后,就可以让 $R(\tau^n)-b$ 这一项, 有正有负。 所以如果得到的 total reward $R(\tau^n)$ 大于 b 的话,就让它的概率上升。如果这个 total reward 小于 b,就算它是正的,正的很小也是不好的,你就要让这一项的概率下降。 如果$R(\tau^n)