diff --git a/docs/chapter3/chapter3.md b/docs/chapter3/chapter3.md index 58d9537..d217513 100644 --- a/docs/chapter3/chapter3.md +++ b/docs/chapter3/chapter3.md @@ -598,7 +598,7 @@ Q学习是一种**异策略(off-policy)**算法。如图 3.31 所示,异
-
图 3.21 异策略
+
图 3.31 异策略
再例如,如图 3.32 所示,比如环境是波涛汹涌的大海,但学习策略(learning policy)太“胆小”了,无法直接与环境交互学习,所以我们有了探索策略(exploratory policy),探索策略是一个不畏风浪的海盗,它非常激进,可以在环境中探索。因此探索策略有很多经验,它可以把这些经验“写成稿子”,然后“喂”给学习策略。学习策略可以通过稿子进行学习。 diff --git a/papers/DQN/PDF/Deep Reinforcement Learning with Double Q-learning.pdf b/papers/DQN/PDF/Deep Reinforcement Learning with Double Q-learning.pdf new file mode 100644 index 0000000..ed13053 Binary files /dev/null and b/papers/DQN/PDF/Deep Reinforcement Learning with Double Q-learning.pdf differ diff --git a/papers/DQN/PDF/Dueling Network Architectures for Deep Reinforceme.pdf b/papers/DQN/PDF/Dueling Network Architectures for Deep Reinforceme.pdf new file mode 100644 index 0000000..4cdce35 Binary files /dev/null and b/papers/DQN/PDF/Dueling Network Architectures for Deep Reinforceme.pdf differ diff --git a/papers/DQN/PDF/Prioritized Experience Replay.pdf b/papers/DQN/PDF/Prioritized Experience Replay.pdf new file mode 100644 index 0000000..d784b49 Binary files /dev/null and b/papers/DQN/PDF/Prioritized Experience Replay.pdf differ diff --git a/papers/DQN/PDF/Rainbow_Combining Improvements in Deep Reinforcement Learning.pdf b/papers/DQN/PDF/Rainbow_Combining Improvements in Deep Reinforcement Learning.pdf new file mode 100644 index 0000000..19e6950 Binary files /dev/null and b/papers/DQN/PDF/Rainbow_Combining Improvements in Deep Reinforcement Learning.pdf differ diff --git a/papers/Policy_gradient/PDF/Proximal Policy Optimization Algorithms.pdf b/papers/Policy_gradient/PDF/Proximal Policy Optimization Algorithms.pdf new file mode 100644 index 0000000..8e12af3 Binary files /dev/null and b/papers/Policy_gradient/PDF/Proximal Policy Optimization Algorithms.pdf differ diff --git a/papers/Policy_gradient/Proximal Policy Optimization Algorithms.md b/papers/Policy_gradient/Proximal Policy Optimization Algorithms.md index d372c31..f8fcc2f 100644 --- a/papers/Policy_gradient/Proximal Policy Optimization Algorithms.md +++ b/papers/Policy_gradient/Proximal Policy Optimization Algorithms.md @@ -80,7 +80,7 @@ $$Var_{x\sim p}[\frac{q(x)}{p(x)}f(x)]=\mathbb{E}_{x \sim p}[(\frac{q(x)}{p(x)}f > 举一个来自蘑菇书《Easy RL》中的例子: > -> ![image-20221101210127972](image-20221101210127972.png) +> ![image-20221101210127972](img/PPO_1.png) > > 这里的红线表示f(x)的曲线,蓝,绿线表示不同分布的x,其中纵坐标越高,在该分布中越容易被取到。其中p(x)的样本分布中,计算f(x)期望为负。实际上在利用重要性采样,从q中采样数据估计p时,有极高的几率从q(x)分布采样到的x计算f(x)为正,极少采样到x计算f(x)为负。虽然在取得为负的点计算期望时会乘以一个特别大的重要性权重使得重要性采样得到f(x)期望正确,但是前提是能采样到这样的点。在现实采样中,很有可能因为采样次数不足导致无法采样到这样的点,导致最终重要性采样失败。 @@ -141,13 +141,13 @@ $$L^{CLIP}(\theta)=\hat{\mathbb{E}}_t[min(r_t(\theta) \hat A_t,clip(r_t(\theta), 这里如此改造损失函数是为了**限制损失函数在一定范围内**,从而**限制梯度**,最终**限制策略参数的更新幅度**,控制前后两次策略的分布差距,使得在使用上一次策略采样的样本更新有效。 -PPO-clip方法并未使用KL散度,作者画图对比其他函数随着更新的变化情况![image-20221102181630708](image-20221102181630708.png) +PPO-clip方法并未使用KL散度,作者画图对比其他函数随着更新的变化情况![image-20221102181630708](img/PPO_2.png) 可见随着时间步增加,KL散度与损失函数均在增加,加了裁剪之后的损失函数可以维持在一定水平内。 作者通过实验发现,当$\epsilon=0.2$时效果最好。 -#### 2. 自适应惩罚系数(PPO-Penalty) +#### 2. 自适应惩罚系数(PPO-pen) 传承TRPO的思想,使用KL散度来衡量新旧策略分布之间的差距。但是,这里使用了一个自适应的参数$\beta$,算法的具体流程为: @@ -166,15 +166,15 @@ PPO-clip方法并未使用KL散度,作者画图对比其他函数随着更新 #### 3.PPO算法总览 -使用了重要性采样,PPO可以重复使用上一次的采样数据多次更新。将Loss Function更换为$L^{CLIP}\,\,or\,\,L^{KLPEN}$。 +使用了重要性采样,PPO可以重复使用上一次的采样数据多次更新。将LossFunction更换为$L^{CLIP}\,\,or\,\,L^{KLPEN}$。 -结合流行的方法,加入RL中的''正则化''$entropy \quad bonus\,\,S[\pi_\theta](s_t)$(为了增加探索的能力)。计算advantage减少方差时候也会用到状态价值函数$V_\theta(s_t)$(有时策略与价值会共享网络),所以增加了$L_{t}^{LF}(\theta)=(V_\theta(s_t)-V_t^{targ})^2$来训练网络能够学习估计出较为真实的状态价值函数。最后的损失函数就是如下形式: +结合流行的方法,加入RL中的''正则化''$entropy bonus\,\,S[\pi_\theta](s_t)$(为了增加探索的能力)。计算advantage减少方差时候也会用到状态价值函数$V_\theta(s_t)$(有时策略与价值会共享网络),所以增加了$L_{t}^{LF}(\theta)=(V_\theta(s_t)-V_t^{targ})^2$来训练网络能够学习估计出较为真实的状态价值函数。最后的损失函数就是如下形式: $$L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}}_t[L_t^{CLIP}(\theta)-c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)]\tag{13}$$ 算法如下 -![image-20221103144745532](image-20221103144745532.png) +![image-20221103144745532](img/PPO_3.png) ### 四. 实验 @@ -182,13 +182,13 @@ $$L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}}_t[L_t^{CLIP}(\theta)-c_1L_t^{VF}(\the 作者对比了:不加裁剪与惩罚,裁剪,惩罚三种损失函数(这里并未使用状态价值函数共享结构与entropy bouns)即公式$(6),(10),(12)$在OpenAI MuJoCo physics engine任务上使用不同超参数的效果,结果如下: -![image-20221103145827303](image-20221103145827303.png) +![image-20221103145827303](img/PPO_4.png) 在该任务上,PPO-clip算法获得了最高的分数,自适应惩罚系数分数略高于固定惩罚系数。 #### 2. 对比了其他连续控制算法 -![image-20221103150107326](image-20221103150107326.png) +![image-20221103150107326](img/PPO_5.png) PPO-clip基本上超越了原有的算法。 @@ -198,4 +198,12 @@ PPO-clip基本上超越了原有的算法。 ### 五. 结论 -我们引入了近端策略优化,这是一组利用多个随机梯度上升时期(一次采样多次利用)来执行每次策略更新的策略优化方法。这些方法具有TRPO的稳定性和可靠性,但实现起来要简单得多,只需要对普通策略梯度实现进行几行代码更改,适用于更一般的设置(例如,当为策略和值函数使用联合架构时),并且具有更好的总体性能。 \ No newline at end of file +我们引入了近端策略优化,这是一组利用多个随机梯度上升时期(一次采样多次利用)来执行每次策略更新的策略优化方法。这些方法具有TRPO的稳定性和可靠性,但实现起来要简单得多,只需要对普通策略梯度实现进行几行代码更改,适用于更一般的设置(例如,当为策略和值函数使用联合架构时),并且具有更好的总体性能。 + +### 六、作者信息 + +于天琪,就读于哈尔滨工程大学陈赓实验班 + +知乎主页:https://www.zhihu.com/people/Yutianqi + +qq:2206422122 \ No newline at end of file diff --git a/papers/Policy_gradient/img/PPO_1.png b/papers/Policy_gradient/img/PPO_1.png new file mode 100644 index 0000000..7cbed79 Binary files /dev/null and b/papers/Policy_gradient/img/PPO_1.png differ diff --git a/papers/Policy_gradient/img/PPO_2.png b/papers/Policy_gradient/img/PPO_2.png new file mode 100644 index 0000000..f46aafa Binary files /dev/null and b/papers/Policy_gradient/img/PPO_2.png differ diff --git a/papers/Policy_gradient/img/PPO_3.png b/papers/Policy_gradient/img/PPO_3.png new file mode 100644 index 0000000..af2e769 Binary files /dev/null and b/papers/Policy_gradient/img/PPO_3.png differ diff --git a/papers/Policy_gradient/img/PPO_4.png b/papers/Policy_gradient/img/PPO_4.png new file mode 100644 index 0000000..549b448 Binary files /dev/null and b/papers/Policy_gradient/img/PPO_4.png differ diff --git a/papers/Policy_gradient/img/PPO_5.png b/papers/Policy_gradient/img/PPO_5.png new file mode 100644 index 0000000..3e78076 Binary files /dev/null and b/papers/Policy_gradient/img/PPO_5.png differ diff --git a/papers/readme.md b/papers/readme.md index 0432429..54c86a6 100644 --- a/papers/readme.md +++ b/papers/readme.md @@ -12,22 +12,27 @@ | --------------- | ------------------------------------------------------------ | --------------------------------------------- | -------- | | DQN | Playing Atari with Deep Reinforcement Learning (**DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Playing%20Atari%20with%20Deep%20Reinforcement%20Learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Playing%20Atari%20with%20Deep%20Reinforcement%20Learning.pdf) | https://arxiv.org/abs/1312.5602 | | | | Deep Recurrent Q-Learning for Partially Observable MDPs [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Deep%20Recurrent%20Q-Learning%20for%20Partially%20Observable%20MDPs.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Deep%20Recurrent%20Q-Learning%20for%20Partially%20Observable%20MDPs.pdf) | https://arxiv.org/abs/1507.06527 | | -| | Dueling Network Architectures for Deep Reinforcement Learning (**Dueling DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Dueling%20Network%20Architectures%20for%20Deep%20Reinforceme.md) | https://arxiv.org/abs/1511.06581 | | -| | Deep Reinforcement Learning with Double Q-learning (**Double DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Deep%20Reinforcement%20Learning%20with%20Double%20Q-learning.md) | https://arxiv.org/abs/1509.06461 | | -| | Prioritized Experience Replay (**PER**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Prioritized%20Experience%20Replay.md) | https://arxiv.org/abs/1511.05952 | | -| | Rainbow: Combining Improvements in Deep Reinforcement Learning (**Rainbow**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Rainbow_Combining%20Improvements%20in%20Deep%20Reinforcement%20Learning.md) | https://arxiv.org/abs/1710.02298 | | +| | Dueling Network Architectures for Deep Reinforcement Learning (**Dueling DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Dueling%20Network%20Architectures%20for%20Deep%20Reinforceme.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Dueling%20Network%20Architectures%20for%20Deep%20Reinforceme.pdf) | https://arxiv.org/abs/1511.06581 | | +| | Deep Reinforcement Learning with Double Q-learning (**Double DQN**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Deep%20Reinforcement%20Learning%20with%20Double%20Q-learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Deep%20Reinforcement%20Learning%20with%20Double%20Q-learning.pdf) | https://arxiv.org/abs/1509.06461 | | +| | Prioritized Experience Replay (**PER**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Prioritized%20Experience%20Replay.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Prioritized%20Experience%20Replay.pdf) | https://arxiv.org/abs/1511.05952 | | +| | Rainbow: Combining Improvements in Deep Reinforcement Learning (**Rainbow**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/Rainbow_Combining%20Improvements%20in%20Deep%20Reinforcement%20Learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/DQN/PDF/Rainbow_Combining%20Improvements%20in%20Deep%20Reinforcement%20Learning.pdf) | https://arxiv.org/abs/1710.02298 | | | Policy gradient | Asynchronous Methods for Deep Reinforcement Learning (**A3C**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Asynchronous%20Methods%20for%20Deep%20Reinforcement%20Learning.md) | https://arxiv.org/abs/1602.01783 | | -| | Trust Region Policy Optimization (**TRPO**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Trust%20Region%20Policy%20Optimization.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Trust%20Region%20Policy%20Optimization.pdf)| https://arxiv.org/abs/1502.05477 | | +| | Trust Region Policy Optimization (**TRPO**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Trust%20Region%20Policy%20Optimization.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Trust%20Region%20Policy%20Optimization.pdf) | https://arxiv.org/abs/1502.05477 | | | | High-Dimensional Continuous Control Using Generalized Advantage Estimation (**GAE**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/High-Dimensional%20Continuous%20Control%20Using%20Generalized%20Advantage%20Estimation.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/High-Dimensional%20Continuous%20Control%20Using%20Generalised%20Advantage%20Estimation.pdf) | https://arxiv.org/abs/1506.02438 | | -| | Proximal Policy Optimization Algorithms (**PPO**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Proximal%20Policy%20Optimization%20Algorithms.md) | https://arxiv.org/abs/1707.06347 | | +| | Proximal Policy Optimization Algorithms (**PPO**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Proximal%20Policy%20Optimization%20Algorithms.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Proximal%20Policy%20Optimization%20Algorithms.pdf) | https://arxiv.org/abs/1707.06347 | | | | Emergence of Locomotion Behaviours in Rich Environments (**PPO-Penalty**) | https://arxiv.org/abs/1707.02286 | | -| | Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (**ACKTP**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Scalable%20trust-region%20method%20for%20deep%20reinforcement%20learning%20using%20Kronecker-factored.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Scalable%20trust-region%20method%20for%20deep%20reinforcement%20learning%20using%20Kronecker-factored.pdf)| https://arxiv.org/abs/1708.05144 | | +| | Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (**ACKTP**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Scalable%20trust-region%20method%20for%20deep%20reinforcement%20learning%20using%20Kronecker-factored.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Scalable%20trust-region%20method%20for%20deep%20reinforcement%20learning%20using%20Kronecker-factored.pdf) | https://arxiv.org/abs/1708.05144 | | | | Sample Efficient Actor-Critic with Experience Replay (**ACER**) | https://arxiv.org/abs/1611.01224 | | -| | Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor(**SAC**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Soft%20Actor-Critic_Off-Policy%20Maximum%20Entropy%20Deep%20Reinforcement%20Learning%20with%20a%20Stochastic%20Actor.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Soft%20Actor-Critic_Off-Policy%20Maximum%20Entropy%20Deep%20Reinforcement%20Learning%20with%20a%20Stochastic%20Actor.pdf) | https://arxiv.org/abs/1801.01290 | | +| | Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor (**SAC**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Soft%20Actor-Critic_Off-Policy%20Maximum%20Entropy%20Deep%20Reinforcement%20Learning%20with%20a%20Stochastic%20Actor.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Soft%20Actor-Critic_Off-Policy%20Maximum%20Entropy%20Deep%20Reinforcement%20Learning%20with%20a%20Stochastic%20Actor.pdf) | https://arxiv.org/abs/1801.01290 | | | | Deterministic Policy Gradient Algorithms (**DPG**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Deterministic%20Policy%20Gradient%20Algorithms.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Deterministic%20Policy%20Gradient%20Algorithms.pdf) | http://proceedings.mlr.press/v32/silver14.pdf | | | | Continuous Control With Deep Reinforcement Learning (**DDPG**) | https://arxiv.org/abs/1509.02971 | | -| | Addressing Function Approximation Error in Actor-Critic Methods (**TD3**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.pdf)| https://arxiv.org/abs/1802.09477 | | +| | Addressing Function Approximation Error in Actor-Critic Methods (**TD3**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.pdf) | https://arxiv.org/abs/1802.09477 | | | | A Distributional Perspective on Reinforcement Learning (**C51**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/A%20Distributional%20Perspective%20on%20Reinforcement%20Learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/A%20Distributional%20Perspective%20on%20Reinforcement%20Learning.pdf) | https://arxiv.org/abs/1707.06887 | | -| | | | | +| | Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic (**Q-Prop**) | https://arxiv.org/abs/1611.02247 | | +| | Action-depedent Control Variates for Policy Optimization via Stein’s Identity (**Stein Control Variates**) | https://arxiv.org/abs/1710.11198 | | +| | The Mirage of Action-Dependent Baselines in Reinforcement Learning | https://arxiv.org/abs/1802.10031 | | +| | Bridging the Gap Between Value and Policy Based Reinforcement Learning (**PCL**) | https://arxiv.org/abs/1702.08892 | | + +