Merge branch 'master' of https://github.com/datawhalechina/easy-rl

2022-12-04 20:51:38 +08:00
parent c1415c2794 e241934c19
commit f030fe283d
13 changed files with 33 additions and 20 deletions
--- a/docs/chapter3/chapter3.md
+++ b/docs/chapter3/chapter3.md
@@ -598,7 +598,7 @@ Q学习是一种**异策略（off-policy）**算法。如图 3.31 所示，异
 <div align=center>
 <img width="550" src="../img/ch3/3.17.png"/>
 </div>
-<div align=center>图 3.21 异策略</div>
+<div align=center>图 3.31 异策略</div>


 再例如，如图 3.32 所示，比如环境是波涛汹涌的大海，但学习策略（learning policy）太“胆小”了，无法直接与环境交互学习，所以我们有了探索策略（exploratory policy），探索策略是一个不畏风浪的海盗，它非常激进，可以在环境中探索。因此探索策略有很多经验，它可以把这些经验“写成稿子”，然后“喂”给学习策略。学习策略可以通过稿子进行学习。
--- a/papers/DQN/PDF/Deep
+++ b/papers/DQN/PDF/Deep
--- a/papers/DQN/PDF/Dueling
+++ b/papers/DQN/PDF/Dueling
--- a/papers/DQN/PDF/Prioritized
+++ b/papers/DQN/PDF/Prioritized
--- a/papers/DQN/PDF/Rainbow_Combining
+++ b/papers/DQN/PDF/Rainbow_Combining
--- a/papers/Policy_gradient/PDF/Proximal
+++ b/papers/Policy_gradient/PDF/Proximal
--- a/papers/Policy_gradient/Proximal
+++ b/papers/Policy_gradient/Proximal
@@ -80,7 +80,7 @@ $$Var_{x\sim p}[\frac{q(x)}{p(x)}f(x)]=\mathbb{E}_{x \sim p}[(\frac{q(x)}{p(x)}f

 > 举一个来自蘑菇书《Easy RL》中的例子：
 >
-> ![image-20221101210127972](image-20221101210127972.png)
+> ![image-20221101210127972](img/PPO_1.png)
 >
 > 这里的红线表示f(x)的曲线，蓝，绿线表示不同分布的x，其中纵坐标越高，在该分布中越容易被取到。其中p(x)的样本分布中，计算f(x)期望为负。实际上在利用重要性采样，从q中采样数据估计p时，有极高的几率从q(x)分布采样到的x计算f(x)为正，极少采样到x计算f(x)为负。虽然在取得为负的点计算期望时会乘以一个特别大的重要性权重使得重要性采样得到f(x)期望正确，但是前提是能采样到这样的点。在现实采样中，很有可能因为采样次数不足导致无法采样到这样的点，导致最终重要性采样失败。

@@ -141,13 +141,13 @@ $$L^{CLIP}(\theta)=\hat{\mathbb{E}}_t[min(r_t(\theta) \hat A_t,clip(r_t(\theta),

 这里如此改造损失函数是为了**限制损失函数在一定范围内**，从而**限制梯度**，最终**限制策略参数的更新幅度**，控制前后两次策略的分布差距，使得在使用上一次策略采样的样本更新有效。

-PPO-clip方法并未使用KL散度，作者画图对比其他函数随着更新的变化情况![image-20221102181630708](image-20221102181630708.png)
+PPO-clip方法并未使用KL散度，作者画图对比其他函数随着更新的变化情况![image-20221102181630708](img/PPO_2.png)

 可见随着时间步增加，KL散度与损失函数均在增加，加了裁剪之后的损失函数可以维持在一定水平内。

 作者通过实验发现，当$\epsilon=0.2$时效果最好。

-#### 2. 自适应惩罚系数（PPO-Penalty）
+#### 2. 自适应惩罚系数（PPO-pen）

 传承TRPO的思想,使用KL散度来衡量新旧策略分布之间的差距。但是，这里使用了一个自适应的参数$\beta$,算法的具体流程为：

@@ -166,15 +166,15 @@ PPO-clip方法并未使用KL散度，作者画图对比其他函数随着更新

 #### 3.PPO算法总览

-使用了重要性采样，PPO可以重复使用上一次的采样数据多次更新。将Loss Function更换为$L^{CLIP}\,\,or\,\,L^{KLPEN}$。
+使用了重要性采样，PPO可以重复使用上一次的采样数据多次更新。将LossFunction更换为$L^{CLIP}\,\,or\,\,L^{KLPEN}$。

-结合流行的方法，加入RL中的''正则化''$entropy \quad bonus\,\,S[\pi_\theta](s_t)$（为了增加探索的能力）。计算advantage减少方差时候也会用到状态价值函数$V_\theta(s_t)$(有时策略与价值会共享网络)，所以增加了$L_{t}^{LF}(\theta)=(V_\theta(s_t)-V_t^{targ})^2$来训练网络能够学习估计出较为真实的状态价值函数。最后的损失函数就是如下形式：
+结合流行的方法，加入RL中的''正则化''$entropy bonus\,\,S[\pi_\theta](s_t)$（为了增加探索的能力）。计算advantage减少方差时候也会用到状态价值函数$V_\theta(s_t)$(有时策略与价值会共享网络)，所以增加了$L_{t}^{LF}(\theta)=(V_\theta(s_t)-V_t^{targ})^2$来训练网络能够学习估计出较为真实的状态价值函数。最后的损失函数就是如下形式：

 $$L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}}_t[L_t^{CLIP}(\theta)-c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)]\tag{13}$$

 算法如下

-![image-20221103144745532](image-20221103144745532.png)
+![image-20221103144745532](img/PPO_3.png)

 ### 四. 实验

@@ -182,13 +182,13 @@ $$L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}}_t[L_t^{CLIP}(\theta)-c_1L_t^{VF}(\the

 作者对比了：不加裁剪与惩罚，裁剪，惩罚三种损失函数（这里并未使用状态价值函数共享结构与entropy bouns）即公式$(6),(10),(12)$在OpenAI MuJoCo physics engine任务上使用不同超参数的效果，结果如下：

-![image-20221103145827303](image-20221103145827303.png)
+![image-20221103145827303](img/PPO_4.png)

 在该任务上，PPO-clip算法获得了最高的分数，自适应惩罚系数分数略高于固定惩罚系数。

 #### 2. 对比了其他连续控制算法

-![image-20221103150107326](image-20221103150107326.png)
+![image-20221103150107326](img/PPO_5.png)

 PPO-clip基本上超越了原有的算法。

@@ -198,4 +198,12 @@ PPO-clip基本上超越了原有的算法。

 ### 五. 结论

-我们引入了近端策略优化，这是一组利用多个随机梯度上升时期（一次采样多次利用）来执行每次策略更新的策略优化方法。这些方法具有TRPO的稳定性和可靠性，但实现起来要简单得多，只需要对普通策略梯度实现进行几行代码更改，适用于更一般的设置(例如，当为策略和值函数使用联合架构时)，并且具有更好的总体性能。
+我们引入了近端策略优化，这是一组利用多个随机梯度上升时期（一次采样多次利用）来执行每次策略更新的策略优化方法。这些方法具有TRPO的稳定性和可靠性，但实现起来要简单得多，只需要对普通策略梯度实现进行几行代码更改，适用于更一般的设置(例如，当为策略和值函数使用联合架构时)，并且具有更好的总体性能。
+
+### 六、作者信息
+
+于天琪，就读于哈尔滨工程大学陈赓实验班
+
+知乎主页：https://www.zhihu.com/people/Yutianqi 
+
+qq:2206422122 
--- a/papers/Policy_gradient/img/PPO_1.png
+++ b/papers/Policy_gradient/img/PPO_1.png
--- a/papers/Policy_gradient/img/PPO_2.png
+++ b/papers/Policy_gradient/img/PPO_2.png
--- a/papers/Policy_gradient/img/PPO_3.png
+++ b/papers/Policy_gradient/img/PPO_3.png
--- a/papers/Policy_gradient/img/PPO_4.png
+++ b/papers/Policy_gradient/img/PPO_4.png
--- a/papers/Policy_gradient/img/PPO_5.png
+++ b/papers/Policy_gradient/img/PPO_5.png
--- a/papers/readme.md
+++ b/papers/readme.md
@@ -12,22 +12,27 @@
 | --------------- | ------------------------------------------------------------ | --------------------------------------------- | -------- |
 | DQN             | Playing Atari with Deep Reinforcement Learning (**DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Playing%20Atari%20with%20Deep%20Reinforcement%20Learning.md)  [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Playing%20Atari%20with%20Deep%20Reinforcement%20Learning.pdf) | https://arxiv.org/abs/1312.5602               |          |
 |                 | Deep Recurrent Q-Learning for Partially Observable MDPs [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Deep%20Recurrent%20Q-Learning%20for%20Partially%20Observable%20MDPs.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Deep%20Recurrent%20Q-Learning%20for%20Partially%20Observable%20MDPs.pdf) | https://arxiv.org/abs/1507.06527              |          |
-|                 | Dueling Network Architectures for Deep Reinforcement Learning (**Dueling DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Dueling%20Network%20Architectures%20for%20Deep%20Reinforceme.md) | https://arxiv.org/abs/1511.06581              |          |
-|                 | Deep Reinforcement Learning with Double Q-learning (**Double DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Deep%20Reinforcement%20Learning%20with%20Double%20Q-learning.md) | https://arxiv.org/abs/1509.06461              |          |
-|                 | Prioritized Experience Replay (**PER**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Prioritized%20Experience%20Replay.md) | https://arxiv.org/abs/1511.05952              |          |
-|                 | Rainbow: Combining Improvements in Deep Reinforcement Learning (**Rainbow**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Rainbow_Combining%20Improvements%20in%20Deep%20Reinforcement%20Learning.md) | https://arxiv.org/abs/1710.02298              |          |
+|                 | Dueling Network Architectures for Deep Reinforcement Learning (**Dueling DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Dueling%20Network%20Architectures%20for%20Deep%20Reinforceme.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Dueling%20Network%20Architectures%20for%20Deep%20Reinforceme.pdf) | https://arxiv.org/abs/1511.06581              |          |
+|                 | Deep Reinforcement Learning with Double Q-learning (**Double DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Deep%20Reinforcement%20Learning%20with%20Double%20Q-learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Deep%20Reinforcement%20Learning%20with%20Double%20Q-learning.pdf) | https://arxiv.org/abs/1509.06461              |          |
+|                 | Prioritized Experience Replay (**PER**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Prioritized%20Experience%20Replay.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Prioritized%20Experience%20Replay.pdf) | https://arxiv.org/abs/1511.05952              |          |
+|                 | Rainbow: Combining Improvements in Deep Reinforcement Learning (**Rainbow**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Rainbow_Combining%20Improvements%20in%20Deep%20Reinforcement%20Learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Rainbow_Combining%20Improvements%20in%20Deep%20Reinforcement%20Learning.pdf) | https://arxiv.org/abs/1710.02298              |          |
 | Policy gradient | Asynchronous Methods for Deep Reinforcement Learning (**A3C**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Asynchronous%20Methods%20for%20Deep%20Reinforcement%20Learning.md) | https://arxiv.org/abs/1602.01783              |          |
-|                 | Trust Region Policy Optimization (**TRPO**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Trust%20Region%20Policy%20Optimization.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Trust%20Region%20Policy%20Optimization.pdf)| https://arxiv.org/abs/1502.05477              |          |
+|                 | Trust Region Policy Optimization (**TRPO**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Trust%20Region%20Policy%20Optimization.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Trust%20Region%20Policy%20Optimization.pdf) | https://arxiv.org/abs/1502.05477              |          |
 |                 | High-Dimensional Continuous Control Using Generalized Advantage Estimation (**GAE**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/High-Dimensional%20Continuous%20Control%20Using%20Generalized%20Advantage%20Estimation.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/High-Dimensional%20Continuous%20Control%20Using%20Generalised%20Advantage%20Estimation.pdf) | https://arxiv.org/abs/1506.02438              |          |
-|                 | Proximal Policy Optimization Algorithms (**PPO**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Proximal%20Policy%20Optimization%20Algorithms.md) | https://arxiv.org/abs/1707.06347              |          |
+|                 | Proximal Policy Optimization Algorithms (**PPO**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Proximal%20Policy%20Optimization%20Algorithms.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Proximal%20Policy%20Optimization%20Algorithms.pdf) | https://arxiv.org/abs/1707.06347              |          |
 |                 | Emergence of Locomotion Behaviours in Rich Environments (**PPO-Penalty**) | https://arxiv.org/abs/1707.02286              |          |
-|                 | Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (**ACKTP**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Scalable%20trust-region%20method%20for%20deep%20reinforcement%20learning%20using%20Kronecker-factored.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Scalable%20trust-region%20method%20for%20deep%20reinforcement%20learning%20using%20Kronecker-factored.pdf)| https://arxiv.org/abs/1708.05144              |          |
+|                 | Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (**ACKTP**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Scalable%20trust-region%20method%20for%20deep%20reinforcement%20learning%20using%20Kronecker-factored.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Scalable%20trust-region%20method%20for%20deep%20reinforcement%20learning%20using%20Kronecker-factored.pdf) | https://arxiv.org/abs/1708.05144              |          |
 |                 | Sample Efficient Actor-Critic with Experience Replay (**ACER**) | https://arxiv.org/abs/1611.01224              |          |
-|                 | Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor(**SAC**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Soft%20Actor-Critic_Off-Policy%20Maximum%20Entropy%20Deep%20Reinforcement%20Learning%20with%20a%20Stochastic%20Actor.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Soft%20Actor-Critic_Off-Policy%20Maximum%20Entropy%20Deep%20Reinforcement%20Learning%20with%20a%20Stochastic%20Actor.pdf) | https://arxiv.org/abs/1801.01290              |          |
+|                 | Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor (**SAC**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Soft%20Actor-Critic_Off-Policy%20Maximum%20Entropy%20Deep%20Reinforcement%20Learning%20with%20a%20Stochastic%20Actor.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Soft%20Actor-Critic_Off-Policy%20Maximum%20Entropy%20Deep%20Reinforcement%20Learning%20with%20a%20Stochastic%20Actor.pdf) | https://arxiv.org/abs/1801.01290              |          |
 |                 | Deterministic Policy Gradient Algorithms (**DPG**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Deterministic%20Policy%20Gradient%20Algorithms.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Deterministic%20Policy%20Gradient%20Algorithms.pdf) | http://proceedings.mlr.press/v32/silver14.pdf |          |
 |                 | Continuous Control With Deep Reinforcement Learning (**DDPG**) | https://arxiv.org/abs/1509.02971              |          |
-|                 | Addressing Function Approximation Error in Actor-Critic Methods (**TD3**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.pdf)| https://arxiv.org/abs/1802.09477              |          |
+|                 | Addressing Function Approximation Error in Actor-Critic Methods (**TD3**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.pdf) | https://arxiv.org/abs/1802.09477              |          |
 |                 | A Distributional Perspective on Reinforcement Learning (**C51**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/A%20Distributional%20Perspective%20on%20Reinforcement%20Learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/A%20Distributional%20Perspective%20on%20Reinforcement%20Learning.pdf) | https://arxiv.org/abs/1707.06887              |          |
-|                 |                                                              |                                               |          |
+|                 | Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic (**Q-Prop**) | https://arxiv.org/abs/1611.02247              |          |
+|                 | Action-depedent Control Variates for Policy Optimization via Stein’s Identity (**Stein Control Variates**) | https://arxiv.org/abs/1710.11198              |          |
+|                 | The Mirage of Action-Dependent Baselines in Reinforcement Learning | https://arxiv.org/abs/1802.10031              |          |
+|                 | Bridging the Gap Between Value and Policy Based Reinforcement Learning (**PCL**) | https://arxiv.org/abs/1702.08892              |          |
+
+