From 90b91444794d715bce9ffa357b76312ff1930148 Mon Sep 17 00:00:00 2001 From: Yiyuan Yang Date: Mon, 24 May 2021 09:27:40 +0800 Subject: [PATCH 1/8] Update chapter4_questions&keywords.md --- docs/chapter4/chapter4_questions&keywords.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/chapter4/chapter4_questions&keywords.md b/docs/chapter4/chapter4_questions&keywords.md index 119e976..cea548a 100644 --- a/docs/chapter4/chapter4_questions&keywords.md +++ b/docs/chapter4/chapter4_questions&keywords.md @@ -101,7 +101,7 @@ $$ 带入第三个式子,可以将其化简为: $$ - \nabla_{\theta}J(\theta) =E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] = E_{\tau \sim p_{\theta}}[(\nabla_{\theta}log\pi_{\theta}(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] = \frac{1}{N}\sum_{i=1}^N[(\sum_{t=1}^T\nabla_{\theta}log \pi_{\theta}(a_{i,t}|s_{i,t}))(\sum_{t=1}^Nr(s_{i,t},a_{i,t}))] + \nabla_{\theta}J(\theta) =E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] = E_{\tau \sim p_{\theta}}[(\nabla_{\theta}log\pi_{\theta}(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ = \frac{1}{N}\sum_{i=1}^N[(\sum_{t=1}^T\nabla_{\theta}log \pi_{\theta}(a_{i,t}|s_{i,t}))(\sum_{t=1}^Nr(s_{i,t},a_{i,t}))] $$ - 高冷的面试官:可以说一下你了解到的基于梯度策略的优化时的小技巧吗? From cfcc10815a05b533255c5b0a0019aade852eca87 Mon Sep 17 00:00:00 2001 From: Yiyuan Yang Date: Mon, 24 May 2021 09:38:15 +0800 Subject: [PATCH 2/8] Update chapter4_questions&keywords.md --- docs/chapter4/chapter4_questions&keywords.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/chapter4/chapter4_questions&keywords.md b/docs/chapter4/chapter4_questions&keywords.md index cea548a..cba8d29 100644 --- a/docs/chapter4/chapter4_questions&keywords.md +++ b/docs/chapter4/chapter4_questions&keywords.md @@ -100,9 +100,11 @@ \nabla logp_{\theta}(\tau|{\theta}) = \sum_{t=1}^T \nabla_{\theta}log \pi_{\theta}(a_t|s_t) $$ 带入第三个式子,可以将其化简为: - $$ - \nabla_{\theta}J(\theta) =E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] = E_{\tau \sim p_{\theta}}[(\nabla_{\theta}log\pi_{\theta}(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ = \frac{1}{N}\sum_{i=1}^N[(\sum_{t=1}^T\nabla_{\theta}log \pi_{\theta}(a_{i,t}|s_{i,t}))(\sum_{t=1}^Nr(s_{i,t},a_{i,t}))] - $$ + $$\begin{aligned} + \nabla_{\theta}J(\theta) = E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] \\ + &= E_{\tau \sim p_{\theta}}[(\nabla_{\theta}log\pi_{\theta}(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ + &= \frac{1}{N}\sum_{i=1}^N[(\sum_{t=1}^T\nabla_{\theta}log \pi_{\theta}(a_{i,t}|s_{i,t}))(\sum_{t=1}^Nr(s_{i,t},a_{i,t}))] + \end{aligned}$$ - 高冷的面试官:可以说一下你了解到的基于梯度策略的优化时的小技巧吗? From 14a00afcc0c3fe6f5387b7950da74368bf9684bc Mon Sep 17 00:00:00 2001 From: Yiyuan Yang Date: Mon, 24 May 2021 09:41:07 +0800 Subject: [PATCH 3/8] Update chapter4_questions&keywords.md --- docs/chapter4/chapter4_questions&keywords.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/chapter4/chapter4_questions&keywords.md b/docs/chapter4/chapter4_questions&keywords.md index cba8d29..29c4a19 100644 --- a/docs/chapter4/chapter4_questions&keywords.md +++ b/docs/chapter4/chapter4_questions&keywords.md @@ -101,9 +101,9 @@ $$ 带入第三个式子,可以将其化简为: $$\begin{aligned} - \nabla_{\theta}J(\theta) = E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] \\ - &= E_{\tau \sim p_{\theta}}[(\nabla_{\theta}log\pi_{\theta}(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ - &= \frac{1}{N}\sum_{i=1}^N[(\sum_{t=1}^T\nabla_{\theta}log \pi_{\theta}(a_{i,t}|s_{i,t}))(\sum_{t=1}^Nr(s_{i,t},a_{i,t}))] + \nabla_{\theta}J(\theta) &=& E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] \\ + &=& E_{\tau \sim p_{\theta}}[(\nabla_{\theta}log\pi_{\theta}(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ + &=& \frac{1}{N}\sum_{i=1}^N[(\sum_{t=1}^T\nabla_{\theta}log \pi_{\theta}(a_{i,t}|s_{i,t}))(\sum_{t=1}^Nr(s_{i,t},a_{i,t}))] \end{aligned}$$ - 高冷的面试官:可以说一下你了解到的基于梯度策略的优化时的小技巧吗? From beafd08c46ce6438a2cc97f73ae2c2898b4e7808 Mon Sep 17 00:00:00 2001 From: Yiyuan Yang Date: Mon, 24 May 2021 09:43:57 +0800 Subject: [PATCH 4/8] Update chapter4_questions&keywords.md --- docs/chapter4/chapter4_questions&keywords.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/chapter4/chapter4_questions&keywords.md b/docs/chapter4/chapter4_questions&keywords.md index 29c4a19..ec4f328 100644 --- a/docs/chapter4/chapter4_questions&keywords.md +++ b/docs/chapter4/chapter4_questions&keywords.md @@ -101,9 +101,9 @@ $$ 带入第三个式子,可以将其化简为: $$\begin{aligned} - \nabla_{\theta}J(\theta) &=& E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] \\ - &=& E_{\tau \sim p_{\theta}}[(\nabla_{\theta}log\pi_{\theta}(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ - &=& \frac{1}{N}\sum_{i=1}^N[(\sum_{t=1}^T\nabla_{\theta}log \pi_{\theta}(a_{i,t}|s_{i,t}))(\sum_{t=1}^Nr(s_{i,t},a_{i,t}))] + \nabla_{\theta}J(\theta) &= E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] \\ + &= E_{\tau \sim p_{\theta}}[(\nabla_{\theta}log\pi_{\theta}(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ + &= \frac{1}{N}\sum_{i=1}^N[(\sum_{t=1}^T\nabla_{\theta}log \pi_{\theta}(a_{i,t}|s_{i,t}))(\sum_{t=1}^Nr(s_{i,t},a_{i,t}))] \end{aligned}$$ - 高冷的面试官:可以说一下你了解到的基于梯度策略的优化时的小技巧吗? From 7fe103be4231990faad18aa4822ee94f366e2450 Mon Sep 17 00:00:00 2001 From: Yiyuan Yang Date: Mon, 24 May 2021 15:00:25 +0800 Subject: [PATCH 5/8] Update chapter1_questions&keywords.md --- docs/chapter1/chapter1_questions&keywords.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/chapter1/chapter1_questions&keywords.md b/docs/chapter1/chapter1_questions&keywords.md index 88d525e..4a57b13 100644 --- a/docs/chapter1/chapter1_questions&keywords.md +++ b/docs/chapter1/chapter1_questions&keywords.md @@ -74,6 +74,7 @@ - 基于策略迭代和基于价值迭代的强化学习方法有什么区别? 答: + 1. 基于策略迭代的强化学习方法,agent会制定一套动作策略(确定在给定状态下需要采取何种动作),并根据这个策略进行操作。强化学习算法直接对策略进行优化,使制定的策略能够获得最大的奖励;基于价值迭代的强化学习方法,agent不需要制定显式的策略,它维护一个价值表格或价值函数,并通过这个价值表格或价值函数来选取价值最大的动作。 2. 基于价值迭代的方法只能应用在不连续的、离散的环境下(如围棋或某些游戏领域),对于行为集合规模庞大、动作连续的场景(如机器人控制领域),其很难学习到较好的结果(此时基于策略迭代的方法能够根据设定的策略来选择连续的动作); 3. 基于价值迭代的强化学习算法有 Q-learning、 Sarsa 等,而基于策略迭代的强化学习算法有策略梯度算法等。 From a0633fba6f6564ddfccb096143cd7e0d7bcb5ae7 Mon Sep 17 00:00:00 2001 From: Yiyuan Yang Date: Mon, 24 May 2021 15:02:19 +0800 Subject: [PATCH 6/8] Update chapter3_questions&keywords.md --- docs/chapter3/chapter3_questions&keywords.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/chapter3/chapter3_questions&keywords.md b/docs/chapter3/chapter3_questions&keywords.md index 01ea01c..5ced147 100644 --- a/docs/chapter3/chapter3_questions&keywords.md +++ b/docs/chapter3/chapter3_questions&keywords.md @@ -5,7 +5,6 @@ - **P函数和R函数:** P函数反应的是状态转移的概率,即反应的环境的随机性,R函数就是Reward function。但是我们通常处于一个未知的环境(即P函数和R函数是未知的)。 - **Q表格型表示方法:** 表示形式是一种表格形式,其中横坐标为 action(agent)的行为,纵坐标是环境的state,其对应着每一个时刻agent和环境的情况,并通过对应的reward反馈去做选择。一般情况下,Q表格是一个已经训练好的表格,不过,我们也可以每进行一步,就更新一下Q表格,然后用下一个状态的Q值来更新这个状态的Q值(即时序差分方法)。 - **时序差分(Temporal Difference):** 一种Q函数(Q值)的更新方式,也就是可以拿下一步的 Q 值 $Q(S_{t+_1},A_{t+1})$ 来更新我这一步的 Q 值 $Q(S_t,A_t)$ 。完整的计算公式如下:$Q(S_t,A_t) \larr Q(S_t,A_t) + \alpha [R_{t+1}+\gamma Q(S_{t+1},A_{t+1})-Q(S_t,A_t)]$ - - **SARSA算法:** 一种更新前一时刻状态的单步更新的强化学习算法,也是一种on-policy策略。该算法由于每次更新值函数需要知道前一步的状态(state),前一步的动作(action)、奖励(reward)、当前状态(state)、将要执行的动作(action),即 $(S_{t}, A_{t}, R_{t+1}, S_{t+1}, A_{t+1})$ 这几个值,所以被称为SARSA算法。agent每进行一次循环,都会用 $(S_{t}, A_{t}, R_{t+1}, S_{t+1}, A_{t+1})$ 对于前一步的Q值(函数)进行一次更新。 ## 2 Questions From 96fac5674ca52aeaeebbc85e36617bc594fd22ac Mon Sep 17 00:00:00 2001 From: Yiyuan Yang Date: Mon, 24 May 2021 15:04:19 +0800 Subject: [PATCH 7/8] Update chapter4_questions&keywords.md --- docs/chapter4/chapter4_questions&keywords.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/chapter4/chapter4_questions&keywords.md b/docs/chapter4/chapter4_questions&keywords.md index ec4f328..eb04dd4 100644 --- a/docs/chapter4/chapter4_questions&keywords.md +++ b/docs/chapter4/chapter4_questions&keywords.md @@ -83,10 +83,11 @@ \theta^* = \theta + \alpha\nabla J({\theta}) $$ 所以我们仅仅需要计算(更新)$\nabla J({\theta})$ ,也就是计算回报函数 $J({\theta})$ 关于 $\theta$ 的梯度,也就是策略梯度,计算方法如下: - $$ - \nabla_{\theta}J(\theta) = \int {\nabla}_{\theta}p_{\theta}(\tau)r(\tau)d_{\tau} - =\int p_{\theta}{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)d_{\tau}=E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] - $$ + $$\begin{aligned} + \nabla_{\theta}J(\theta) &= \int {\nabla}_{\theta}p_{\theta}(\tau)r(\tau)d_{\tau} \\ + &= \int p_{\theta}{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)d_{\tau} \\ + &= E_{\tau \sim p_{\theta}(\tau)}[{\nabla}_{\theta}logp_{\theta}(\tau)r(\tau)] + \end{aligned}$$ 接着我们继续讲上式展开,对于 $p_{\theta}(\tau)$ ,即 $p_{\theta}(\tau|{\theta})$ : $$ p_{\theta}(\tau|{\theta}) = p(s_1)\prod_{t=1}^T \pi_{\theta}(a_t|s_t)p(s_{t+1}|s_t,a_t) From 28db2b58e1f4ed8c6797058527639caf2e57130f Mon Sep 17 00:00:00 2001 From: Yiyuan Yang Date: Mon, 24 May 2021 15:05:22 +0800 Subject: [PATCH 8/8] Update chapter5_questions&keywords.md --- docs/chapter5/chapter5_questions&keywords.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/chapter5/chapter5_questions&keywords.md b/docs/chapter5/chapter5_questions&keywords.md index 8f43579..f260a6d 100644 --- a/docs/chapter5/chapter5_questions&keywords.md +++ b/docs/chapter5/chapter5_questions&keywords.md @@ -35,9 +35,11 @@ - 高冷的面试官:请问什么是重要性采样呀? 答:使用另外一种数据分布,来逼近所求分布的一种方法,算是一种期望修正的方法,公式是: - $$ - \int f(x) p(x) d x=\int f(x) \frac{p(x)}{q(x)} q(x) d x=E_{x \sim q}[f(x){\frac{p(x)}{q(x)}}]=E_{x \sim p}[f(x)] - $$ + $$\begin{aligned} + \int f(x) p(x) d x &= \int f(x) \frac{p(x)}{q(x)} q(x) d x \\ + &= E_{x \sim q}[f(x){\frac{p(x)}{q(x)}}] \\ + &= E_{x \sim p}[f(x)] + \end{aligned}$$ 我们在已知 $q$ 的分布后,可以使用上述公式计算出从 $p$ 分布的期望值。也就可以使用 $q$ 来对于 $p$ 进行采样了,即为重要性采样。 - 高冷的面试官:请问on-policy跟off-policy的区别是什么?