This commit is contained in:
johnjim0816
2022-12-04 20:51:38 +08:00
13 changed files with 33 additions and 20 deletions

View File

@@ -598,7 +598,7 @@ Q学习是一种**异策略off-policy**算法。如图 3.31 所示,异
<div align=center> <div align=center>
<img width="550" src="../img/ch3/3.17.png"/> <img width="550" src="../img/ch3/3.17.png"/>
</div> </div>
<div align=center>图 3.21 异策略</div> <div align=center>图 3.31 异策略</div>
再例如,如图 3.32 所示比如环境是波涛汹涌的大海但学习策略learning policy太“胆小”了无法直接与环境交互学习所以我们有了探索策略exploratory policy探索策略是一个不畏风浪的海盗它非常激进可以在环境中探索。因此探索策略有很多经验它可以把这些经验“写成稿子”然后“喂”给学习策略。学习策略可以通过稿子进行学习。 再例如,如图 3.32 所示比如环境是波涛汹涌的大海但学习策略learning policy太“胆小”了无法直接与环境交互学习所以我们有了探索策略exploratory policy探索策略是一个不畏风浪的海盗它非常激进可以在环境中探索。因此探索策略有很多经验它可以把这些经验“写成稿子”然后“喂”给学习策略。学习策略可以通过稿子进行学习。

Binary file not shown.

View File

@@ -80,7 +80,7 @@ $$Var_{x\sim p}[\frac{q(x)}{p(x)}f(x)]=\mathbb{E}_{x \sim p}[(\frac{q(x)}{p(x)}f
> 举一个来自蘑菇书《Easy RL》中的例子 > 举一个来自蘑菇书《Easy RL》中的例子
> >
> ![image-20221101210127972](image-20221101210127972.png) > ![image-20221101210127972](img/PPO_1.png)
> >
> 这里的红线表示f(x)的曲线绿线表示不同分布的x其中纵坐标越高在该分布中越容易被取到。其中p(x)的样本分布中计算f(x)期望为负。实际上在利用重要性采样从q中采样数据估计p时有极高的几率从q(x)分布采样到的x计算f(x)为正极少采样到x计算f(x)为负。虽然在取得为负的点计算期望时会乘以一个特别大的重要性权重使得重要性采样得到f(x)期望正确,但是前提是能采样到这样的点。在现实采样中,很有可能因为采样次数不足导致无法采样到这样的点,导致最终重要性采样失败。 > 这里的红线表示f(x)的曲线绿线表示不同分布的x其中纵坐标越高在该分布中越容易被取到。其中p(x)的样本分布中计算f(x)期望为负。实际上在利用重要性采样从q中采样数据估计p时有极高的几率从q(x)分布采样到的x计算f(x)为正极少采样到x计算f(x)为负。虽然在取得为负的点计算期望时会乘以一个特别大的重要性权重使得重要性采样得到f(x)期望正确,但是前提是能采样到这样的点。在现实采样中,很有可能因为采样次数不足导致无法采样到这样的点,导致最终重要性采样失败。
@@ -141,13 +141,13 @@ $$L^{CLIP}(\theta)=\hat{\mathbb{E}}_t[min(r_t(\theta) \hat A_t,clip(r_t(\theta),
这里如此改造损失函数是为了**限制损失函数在一定范围内**,从而**限制梯度**,最终**限制策略参数的更新幅度**,控制前后两次策略的分布差距,使得在使用上一次策略采样的样本更新有效。 这里如此改造损失函数是为了**限制损失函数在一定范围内**,从而**限制梯度**,最终**限制策略参数的更新幅度**,控制前后两次策略的分布差距,使得在使用上一次策略采样的样本更新有效。
PPO-clip方法并未使用KL散度作者画图对比其他函数随着更新的变化情况![image-20221102181630708](image-20221102181630708.png) PPO-clip方法并未使用KL散度作者画图对比其他函数随着更新的变化情况![image-20221102181630708](img/PPO_2.png)
可见随着时间步增加KL散度与损失函数均在增加加了裁剪之后的损失函数可以维持在一定水平内。 可见随着时间步增加KL散度与损失函数均在增加加了裁剪之后的损失函数可以维持在一定水平内。
作者通过实验发现,当$\epsilon=0.2$时效果最好。 作者通过实验发现,当$\epsilon=0.2$时效果最好。
#### 2. 自适应惩罚系数PPO-Penalty #### 2. 自适应惩罚系数PPO-pen
传承TRPO的思想,使用KL散度来衡量新旧策略分布之间的差距。但是这里使用了一个自适应的参数$\beta$,算法的具体流程为: 传承TRPO的思想,使用KL散度来衡量新旧策略分布之间的差距。但是这里使用了一个自适应的参数$\beta$,算法的具体流程为:
@@ -168,13 +168,13 @@ PPO-clip方法并未使用KL散度作者画图对比其他函数随着更新
使用了重要性采样PPO可以重复使用上一次的采样数据多次更新。将LossFunction更换为$L^{CLIP}\,\,or\,\,L^{KLPEN}$。 使用了重要性采样PPO可以重复使用上一次的采样数据多次更新。将LossFunction更换为$L^{CLIP}\,\,or\,\,L^{KLPEN}$。
结合流行的方法加入RL中的''正则化''$entropy \quad bonus\,\,S[\pi_\theta](s_t)$为了增加探索的能力。计算advantage减少方差时候也会用到状态价值函数$V_\theta(s_t)$(有时策略与价值会共享网络),所以增加了$L_{t}^{LF}(\theta)=(V_\theta(s_t)-V_t^{targ})^2$来训练网络能够学习估计出较为真实的状态价值函数。最后的损失函数就是如下形式: 结合流行的方法加入RL中的''正则化''$entropy bonus\,\,S[\pi_\theta](s_t)$为了增加探索的能力。计算advantage减少方差时候也会用到状态价值函数$V_\theta(s_t)$(有时策略与价值会共享网络),所以增加了$L_{t}^{LF}(\theta)=(V_\theta(s_t)-V_t^{targ})^2$来训练网络能够学习估计出较为真实的状态价值函数。最后的损失函数就是如下形式:
$$L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}}_t[L_t^{CLIP}(\theta)-c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)]\tag{13}$$ $$L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}}_t[L_t^{CLIP}(\theta)-c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)]\tag{13}$$
算法如下 算法如下
![image-20221103144745532](image-20221103144745532.png) ![image-20221103144745532](img/PPO_3.png)
### 四. 实验 ### 四. 实验
@@ -182,13 +182,13 @@ $$L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}}_t[L_t^{CLIP}(\theta)-c_1L_t^{VF}(\the
作者对比了不加裁剪与惩罚裁剪惩罚三种损失函数这里并未使用状态价值函数共享结构与entropy bouns即公式$(6),(10),(12)$在OpenAI MuJoCo physics engine任务上使用不同超参数的效果结果如下 作者对比了不加裁剪与惩罚裁剪惩罚三种损失函数这里并未使用状态价值函数共享结构与entropy bouns即公式$(6),(10),(12)$在OpenAI MuJoCo physics engine任务上使用不同超参数的效果结果如下
![image-20221103145827303](image-20221103145827303.png) ![image-20221103145827303](img/PPO_4.png)
在该任务上PPO-clip算法获得了最高的分数自适应惩罚系数分数略高于固定惩罚系数。 在该任务上PPO-clip算法获得了最高的分数自适应惩罚系数分数略高于固定惩罚系数。
#### 2. 对比了其他连续控制算法 #### 2. 对比了其他连续控制算法
![image-20221103150107326](image-20221103150107326.png) ![image-20221103150107326](img/PPO_5.png)
PPO-clip基本上超越了原有的算法。 PPO-clip基本上超越了原有的算法。
@@ -199,3 +199,11 @@ PPO-clip基本上超越了原有的算法。
### 五. 结论 ### 五. 结论
我们引入了近端策略优化这是一组利用多个随机梯度上升时期一次采样多次利用来执行每次策略更新的策略优化方法。这些方法具有TRPO的稳定性和可靠性但实现起来要简单得多只需要对普通策略梯度实现进行几行代码更改适用于更一般的设置(例如,当为策略和值函数使用联合架构时),并且具有更好的总体性能。 我们引入了近端策略优化这是一组利用多个随机梯度上升时期一次采样多次利用来执行每次策略更新的策略优化方法。这些方法具有TRPO的稳定性和可靠性但实现起来要简单得多只需要对普通策略梯度实现进行几行代码更改适用于更一般的设置(例如,当为策略和值函数使用联合架构时),并且具有更好的总体性能。
### 六、作者信息
于天琪,就读于哈尔滨工程大学陈赓实验班
知乎主页https://www.zhihu.com/people/Yutianqi
qq:2206422122

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 153 KiB

View File

@@ -12,14 +12,14 @@
| --------------- | ------------------------------------------------------------ | --------------------------------------------- | -------- | | --------------- | ------------------------------------------------------------ | --------------------------------------------- | -------- |
| DQN | Playing Atari with Deep Reinforcement Learning (**DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Playing%20Atari%20with%20Deep%20Reinforcement%20Learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Playing%20Atari%20with%20Deep%20Reinforcement%20Learning.pdf) | https://arxiv.org/abs/1312.5602 | | | DQN | Playing Atari with Deep Reinforcement Learning (**DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Playing%20Atari%20with%20Deep%20Reinforcement%20Learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Playing%20Atari%20with%20Deep%20Reinforcement%20Learning.pdf) | https://arxiv.org/abs/1312.5602 | |
| | Deep Recurrent Q-Learning for Partially Observable MDPs [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Deep%20Recurrent%20Q-Learning%20for%20Partially%20Observable%20MDPs.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Deep%20Recurrent%20Q-Learning%20for%20Partially%20Observable%20MDPs.pdf) | https://arxiv.org/abs/1507.06527 | | | | Deep Recurrent Q-Learning for Partially Observable MDPs [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Deep%20Recurrent%20Q-Learning%20for%20Partially%20Observable%20MDPs.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Deep%20Recurrent%20Q-Learning%20for%20Partially%20Observable%20MDPs.pdf) | https://arxiv.org/abs/1507.06527 | |
| | Dueling Network Architectures for Deep Reinforcement Learning (**Dueling DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Dueling%20Network%20Architectures%20for%20Deep%20Reinforceme.md) | https://arxiv.org/abs/1511.06581 | | | | Dueling Network Architectures for Deep Reinforcement Learning (**Dueling DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Dueling%20Network%20Architectures%20for%20Deep%20Reinforceme.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Dueling%20Network%20Architectures%20for%20Deep%20Reinforceme.pdf) | https://arxiv.org/abs/1511.06581 | |
| | Deep Reinforcement Learning with Double Q-learning (**Double DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Deep%20Reinforcement%20Learning%20with%20Double%20Q-learning.md) | https://arxiv.org/abs/1509.06461 | | | | Deep Reinforcement Learning with Double Q-learning (**Double DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Deep%20Reinforcement%20Learning%20with%20Double%20Q-learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Deep%20Reinforcement%20Learning%20with%20Double%20Q-learning.pdf) | https://arxiv.org/abs/1509.06461 | |
| | Prioritized Experience Replay (**PER**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Prioritized%20Experience%20Replay.md) | https://arxiv.org/abs/1511.05952 | | | | Prioritized Experience Replay (**PER**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Prioritized%20Experience%20Replay.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Prioritized%20Experience%20Replay.pdf) | https://arxiv.org/abs/1511.05952 | |
| | Rainbow: Combining Improvements in Deep Reinforcement Learning (**Rainbow**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Rainbow_Combining%20Improvements%20in%20Deep%20Reinforcement%20Learning.md) | https://arxiv.org/abs/1710.02298 | | | | Rainbow: Combining Improvements in Deep Reinforcement Learning (**Rainbow**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Rainbow_Combining%20Improvements%20in%20Deep%20Reinforcement%20Learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Rainbow_Combining%20Improvements%20in%20Deep%20Reinforcement%20Learning.pdf) | https://arxiv.org/abs/1710.02298 | |
| Policy gradient | Asynchronous Methods for Deep Reinforcement Learning (**A3C**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Asynchronous%20Methods%20for%20Deep%20Reinforcement%20Learning.md) | https://arxiv.org/abs/1602.01783 | | | Policy gradient | Asynchronous Methods for Deep Reinforcement Learning (**A3C**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Asynchronous%20Methods%20for%20Deep%20Reinforcement%20Learning.md) | https://arxiv.org/abs/1602.01783 | |
| | Trust Region Policy Optimization (**TRPO**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Trust%20Region%20Policy%20Optimization.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Trust%20Region%20Policy%20Optimization.pdf) | https://arxiv.org/abs/1502.05477 | | | | Trust Region Policy Optimization (**TRPO**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Trust%20Region%20Policy%20Optimization.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Trust%20Region%20Policy%20Optimization.pdf) | https://arxiv.org/abs/1502.05477 | |
| | High-Dimensional Continuous Control Using Generalized Advantage Estimation (**GAE**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/High-Dimensional%20Continuous%20Control%20Using%20Generalized%20Advantage%20Estimation.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/High-Dimensional%20Continuous%20Control%20Using%20Generalised%20Advantage%20Estimation.pdf) | https://arxiv.org/abs/1506.02438 | | | | High-Dimensional Continuous Control Using Generalized Advantage Estimation (**GAE**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/High-Dimensional%20Continuous%20Control%20Using%20Generalized%20Advantage%20Estimation.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/High-Dimensional%20Continuous%20Control%20Using%20Generalised%20Advantage%20Estimation.pdf) | https://arxiv.org/abs/1506.02438 | |
| | Proximal Policy Optimization Algorithms (**PPO**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Proximal%20Policy%20Optimization%20Algorithms.md) | https://arxiv.org/abs/1707.06347 | | | | Proximal Policy Optimization Algorithms (**PPO**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Proximal%20Policy%20Optimization%20Algorithms.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Proximal%20Policy%20Optimization%20Algorithms.pdf) | https://arxiv.org/abs/1707.06347 | |
| | Emergence of Locomotion Behaviours in Rich Environments (**PPO-Penalty**) | https://arxiv.org/abs/1707.02286 | | | | Emergence of Locomotion Behaviours in Rich Environments (**PPO-Penalty**) | https://arxiv.org/abs/1707.02286 | |
| | Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (**ACKTP**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Scalable%20trust-region%20method%20for%20deep%20reinforcement%20learning%20using%20Kronecker-factored.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Scalable%20trust-region%20method%20for%20deep%20reinforcement%20learning%20using%20Kronecker-factored.pdf) | https://arxiv.org/abs/1708.05144 | | | | Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (**ACKTP**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Scalable%20trust-region%20method%20for%20deep%20reinforcement%20learning%20using%20Kronecker-factored.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Scalable%20trust-region%20method%20for%20deep%20reinforcement%20learning%20using%20Kronecker-factored.pdf) | https://arxiv.org/abs/1708.05144 | |
| | Sample Efficient Actor-Critic with Experience Replay (**ACER**) | https://arxiv.org/abs/1611.01224 | | | | Sample Efficient Actor-Critic with Experience Replay (**ACER**) | https://arxiv.org/abs/1611.01224 | |
@@ -28,6 +28,11 @@
| | Continuous Control With Deep Reinforcement Learning (**DDPG**) | https://arxiv.org/abs/1509.02971 | | | | Continuous Control With Deep Reinforcement Learning (**DDPG**) | https://arxiv.org/abs/1509.02971 | |
| | Addressing Function Approximation Error in Actor-Critic Methods (**TD3**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.pdf) | https://arxiv.org/abs/1802.09477 | | | | Addressing Function Approximation Error in Actor-Critic Methods (**TD3**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.pdf) | https://arxiv.org/abs/1802.09477 | |
| | A Distributional Perspective on Reinforcement Learning (**C51**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/A%20Distributional%20Perspective%20on%20Reinforcement%20Learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/A%20Distributional%20Perspective%20on%20Reinforcement%20Learning.pdf) | https://arxiv.org/abs/1707.06887 | | | | A Distributional Perspective on Reinforcement Learning (**C51**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/A%20Distributional%20Perspective%20on%20Reinforcement%20Learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/A%20Distributional%20Perspective%20on%20Reinforcement%20Learning.pdf) | https://arxiv.org/abs/1707.06887 | |
| | | | | | | Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic (**Q-Prop**) | https://arxiv.org/abs/1611.02247 | |
| | Action-depedent Control Variates for Policy Optimization via Steins Identity (**Stein Control Variates**) | https://arxiv.org/abs/1710.11198 | |
| | The Mirage of Action-Dependent Baselines in Reinforcement Learning | https://arxiv.org/abs/1802.10031 | |
| | Bridging the Gap Between Value and Policy Based Reinforcement Learning (**PCL**) | https://arxiv.org/abs/1702.08892 | |