Merge branch 'master' of github.com:datawhalechina/easy-rl

This commit is contained in:
qiwang067
2022-12-14 14:14:29 +08:00
48 changed files with 11131 additions and 5 deletions

View File

@@ -1,3 +1,6 @@
[![GitHub issues](https://img.shields.io/github/issues/datawhalechina/easy-rl)](https://github.com/datawhalechina/easy-rl/issues) [![GitHub stars](https://img.shields.io/github/stars/datawhalechina/easy-rl)](https://github.com/datawhalechina/easy-rl/stargazers) [![GitHub forks](https://img.shields.io/github/forks/datawhalechina/easy-rl)](https://github.com/datawhalechina/easy-rl/network) [![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2Fdatawhalechina%2Feasy-rl%2F&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false)](https://hits.seeyoufarm.com) <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://img.shields.io/badge/license-CC%20BY--NC--SA%204.0-lightgrey" /></a>
# 蘑菇书EasyRL
李宏毅老师的《深度强化学习》是强化学习领域经典的中文视频之一。李老师幽默风趣的上课风格让晦涩难懂的强化学习理论变得轻松易懂,他会通过很多有趣的例子来讲解强化学习理论。比如老师经常会用玩 Atari 游戏的例子来讲解强化学习算法。此外,为了教程的完整性,我们整理了周博磊老师的《强化学习纲要》、李科浇老师的《世界冠军带你从零实践强化学习》以及多个强化学习的经典资料作为补充。对于想入门强化学习又想看中文讲解的人来说绝对是非常推荐的。
@@ -80,7 +83,11 @@ PDF版本是全书初稿人民邮电出版社的编辑老师们对初稿进
| [第十三章 AlphaStar 论文解读](https://datawhalechina.github.io/easy-rl/#/chapter13/chapter13) | | |
## 算法实战
[点击](https://github.com/johnjim0816/rl-tutorials)或者网页点击```projects```文件夹进入算法实战
算法实战部分包括附书代码和JoyRL代码
* [蘑菇书附书代码](https://github.com/datawhalechina/easy-rl/tree/master/notebooks)
* [JoyRL离线版](https://github.com/johnjim0816/rl-tutorials/tree/master/joyrl)
* [JoyRL上线版](https://github.com/datawhalechina/joyrl)
## 经典强化学习论文解读
@@ -133,6 +140,10 @@ url = {https://github.com/datawhalechina/easy-rl}
特别感谢 [@Sm1les](https://github.com/Sm1les)、[@LSGOMYP](https://github.com/LSGOMYP) 对本项目的帮助与支持。
另外十分感谢大家对于Easy-RL的关注。
[![Stargazers repo roster for @datawhalechina/easy-rl](https://reporoster.com/stars/datawhalechina/easy-rl)](https://github.com/datawhalechina/easy-rl/stargazers)
[![Forkers repo roster for @datawhalechina/easy-rl](https://reporoster.com/forks/datawhalechina/easy-rl)](https://github.com/datawhalechina/easy-rl/network/members)
## 关注我们
扫描下方二维码关注公众号Datawhale回复关键词“强化学习”即可加入“Easy-RL读者交流群”
<div align=center><img src="https://raw.githubusercontent.com/datawhalechina/easy-rl/master/docs/res/qrcode.jpeg" width = "250" height = "270" alt="Datawhale是一个专注AI领域的开源组织以“for the learner和学习者一起成长”为愿景构建对学习者最有价值的开源学习社区。关注我们一起学习成长。"></div>

370
notebooks/A2C.ipynb Normal file

File diff suppressed because one or more lines are too long

559
notebooks/DDPG.ipynb Normal file

File diff suppressed because one or more lines are too long

541
notebooks/DQN.ipynb Normal file

File diff suppressed because one or more lines are too long

490
notebooks/DoubleDQN.ipynb Normal file

File diff suppressed because one or more lines are too long

482
notebooks/DuelingDQN.ipynb Normal file

File diff suppressed because one or more lines are too long

748
notebooks/MonteCarlo.ipynb Normal file

File diff suppressed because one or more lines are too long

582
notebooks/NoisyDQN.ipynb Normal file

File diff suppressed because one or more lines are too long

644
notebooks/PER_DQN.ipynb Normal file

File diff suppressed because one or more lines are too long

522
notebooks/PPO.ipynb Normal file

File diff suppressed because one or more lines are too long

142
notebooks/PPO暂存.md Normal file
View File

@@ -0,0 +1,142 @@
## 原理简介
PPO是一种on-policy算法具有较好的性能其前身是TRPO算法也是policy gradient算法的一种它是现在 OpenAI 默认的强化学习算法,具体原理可参考[PPO算法讲解](https://datawhalechina.github.io/easy-rl/#/chapter5/chapter5)。PPO算法主要有两个变种一个是结合KL penalty的一个是用了clip方法本文实现的是后者即```PPO-clip```。
## 伪代码
要实现必先了解伪代码,伪代码如下:
![在这里插入图片描述](assets/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0pvaG5KaW0w,size_16,color_FFFFFF,t_70.png)
这是谷歌找到的一张比较适合的图,本人比较懒就没有修改,上面的```k```就是第```k```个episode第六步是用随机梯度下降的方法优化这里的损失函数(即```argmax```后面的部分)可能有点难理解,可参考[PPO paper](https://arxiv.org/abs/1707.06347),如下:
![在这里插入图片描述](assets/20210323154236878.png)
第七步就是一个平方损失函数,即实际回报与期望回报的差平方。
## 代码实战
[点击查看完整代码](https://github.com/JohnJim0816/rl-tutorials/tree/master/PPO)
### PPOmemory
首先第三步需要搜集一条轨迹信息,我们可以定义一个```PPOmemory```来存储相关信息:
```python
class PPOMemory:
def __init__(self, batch_size):
self.states = []
self.probs = []
self.vals = []
self.actions = []
self.rewards = []
self.dones = []
self.batch_size = batch_size
def sample(self):
batch_step = np.arange(0, len(self.states), self.batch_size)
indices = np.arange(len(self.states), dtype=np.int64)
np.random.shuffle(indices)
batches = [indices[i:i+self.batch_size] for i in batch_step]
return np.array(self.states),\
np.array(self.actions),\
np.array(self.probs),\
np.array(self.vals),\
np.array(self.rewards),\
np.array(self.dones),\
batches
def push(self, state, action, probs, vals, reward, done):
self.states.append(state)
self.actions.append(action)
self.probs.append(probs)
self.vals.append(vals)
self.rewards.append(reward)
self.dones.append(done)
def clear(self):
self.states = []
self.probs = []
self.actions = []
self.rewards = []
self.dones = []
self.vals = []
```
这里的push函数就是将得到的相关量放入memory中sample就是随机采样出来方便第六步的随机梯度下降。
### PPO model
model就是actor和critic两个网络了
```python
import torch.nn as nn
from torch.distributions.categorical import Categorical
class Actor(nn.Module):
def __init__(self,n_states, n_actions,
hidden_dim=256):
super(Actor, self).__init__()
self.actor = nn.Sequential(
nn.Linear(n_states, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, n_actions),
nn.Softmax(dim=-1)
)
def forward(self, state):
dist = self.actor(state)
dist = Categorical(dist)
return dist
class Critic(nn.Module):
def __init__(self, n_states,hidden_dim=256):
super(Critic, self).__init__()
self.critic = nn.Sequential(
nn.Linear(n_states, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
def forward(self, state):
value = self.critic(state)
return value
```
这里Actor就是得到一个概率分布(Categorica也可以是别的分布可以搜索torch distributionsl)critc根据当前状态得到一个值这里的输入维度可以是```n_states+n_actions```即将action信息也纳入critic网络中这样会更好一些感兴趣的小伙伴可以试试。
### PPO update
定义一个update函数主要实现伪代码中的第六步和第七步
```python
def update(self):
for _ in range(self.n_epochs):
state_arr, action_arr, old_prob_arr, vals_arr,\
reward_arr, dones_arr, batches = \
self.memory.sample()
values = vals_arr
### compute advantage ###
advantage = np.zeros(len(reward_arr), dtype=np.float32)
for t in range(len(reward_arr)-1):
discount = 1
a_t = 0
for k in range(t, len(reward_arr)-1):
a_t += discount*(reward_arr[k] + self.gamma*values[k+1]*\
(1-int(dones_arr[k])) - values[k])
discount *= self.gamma*self.gae_lambda
advantage[t] = a_t
advantage = torch.tensor(advantage).to(self.device)
### SGD ###
values = torch.tensor(values).to(self.device)
for batch in batches:
states = torch.tensor(state_arr[batch], dtype=torch.float).to(self.device)
old_probs = torch.tensor(old_prob_arr[batch]).to(self.device)
actions = torch.tensor(action_arr[batch]).to(self.device)
dist = self.actor(states)
critic_value = self.critic(states)
critic_value = torch.squeeze(critic_value)
new_probs = dist.log_prob(actions)
prob_ratio = new_probs.exp() / old_probs.exp()
weighted_probs = advantage[batch] * prob_ratio
weighted_clipped_probs = torch.clamp(prob_ratio, 1-self.policy_clip,
1+self.policy_clip)*advantage[batch]
actor_loss = -torch.min(weighted_probs, weighted_clipped_probs).mean()
returns = advantage[batch] + values[batch]
critic_loss = (returns-critic_value)**2
critic_loss = critic_loss.mean()
total_loss = actor_loss + 0.5*critic_loss
self.actor_optimizer.zero_grad()
self.critic_optimizer.zero_grad()
total_loss.backward()
self.actor_optimizer.step()
self.critic_optimizer.step()
self.memory.clear()
```
该部分首先从memory中提取搜集到的轨迹信息然后计算gae即advantage接着使用随机梯度下降更新网络最后清除memory以便搜集下一条轨迹信息。
最后实现效果如下:
![在这里插入图片描述](assets/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0pvaG5KaW0w,size_16,color_FFFFFF,t_70-20210405110725113.png)

View File

@@ -0,0 +1,202 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. 定义算法\n",
"\n",
"最基础的策略梯度算法就是REINFORCE算法又称作Monte-Carlo Policy Gradient算法。我们策略优化的目标如下\n",
"\n",
"$$\n",
"J_{\\theta}= \\Psi_{\\pi} \\nabla_\\theta \\log \\pi_\\theta\\left(a_t \\mid s_t\\right)\n",
"$$\n",
"\n",
"其中$\\Psi_{\\pi}$在REINFORCE算法中表示衰减的回报(具体公式见伪代码)也可以用优势来估计也就是我们熟知的A3C算法这个在后面包括GAE算法中都会讲到。\n",
"\n",
"### 1.1. 策略函数设计\n",
"\n",
"既然策略梯度是直接对策略函数进行梯度计算那么策略函数如何设计呢一般来讲有两种设计方式一个是softmax函数另外一个是高斯分布$\\mathbb{N}\\left(\\phi(\\mathbb{s})^{\\mathbb{\\pi}} \\theta, \\sigma^2\\right)$,前者用于离散动作空间,后者多用于连续动作空间。\n",
"\n",
"softmax函数可以表示为\n",
"$$\n",
"\\pi_\\theta(s, a)=\\frac{e^{\\phi(s, a)^{T_\\theta}}}{\\sum_b e^{\\phi(s, b)^{T^T}}}\n",
"$$\n",
"对应的梯度为:\n",
"$$\n",
"\\nabla_\\theta \\log \\pi_\\theta(s, a)=\\phi(s, a)-\\mathbb{E}_{\\pi_\\theta}[\\phi(s,)\n",
"$$\n",
"高斯分布对应的梯度为:\n",
"$$\n",
"\\nabla_\\theta \\log \\pi_\\theta(s, a)=\\frac{\\left(a-\\phi(s)^T \\theta\\right) \\phi(s)}{\\sigma^2}\n",
"$$\n",
"但是对于一些特殊的情况,例如在本次演示中动作维度=2且为离散空间这个时候可以用伯努利分布来实现这种方式其实是不推荐的这里给大家做演示也是为了展现一些特殊情况启发大家一些思考例如BernoulliBinomialGaussian分布之间的关系。简单说来Binomial分布$n = 1$时就是Bernoulli分布$n \\rightarrow \\infty$时就是Gaussian分布。\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2. 模型设计\n",
"\n",
"前面讲到尽管本次演示是离散空间但是由于动作维度等于2此时就可以用特殊的高斯分布来表示策略函数即伯努利分布。伯努利的分布实际上是用一个概率作为输入然后从中采样动作伯努利采样出来的动作只可能是0或1就像投掷出硬币的正反面。在这种情况下我们的策略模型就需要在MLP的基础上将状态作为输入将动作作为倒数第二层输出并在最后一层增加激活函数来输出对应动作的概率。不清楚激活函数作用的同学可以再看一遍深度学习相关的知识简单来说其作用就是增加神经网络的非线性。既然需要输出对应动作的概率那么输出的值需要处于0-1之间此时sigmoid函数刚好满足我们的需求实现代码参考如下。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import torch.nn as nn\n",
"import torch.nn.functional as F\n",
"class PGNet(nn.Module):\n",
" def __init__(self, input_dim,output_dim,hidden_dim=128):\n",
" \"\"\" 初始化q网络为全连接网络\n",
" input_dim: 输入的特征数即环境的状态维度\n",
" output_dim: 输出的动作维度\n",
" \"\"\"\n",
" super(PGNet, self).__init__()\n",
" self.fc1 = nn.Linear(input_dim, hidden_dim) # 输入层\n",
" self.fc2 = nn.Linear(hidden_dim,hidden_dim) # 隐藏层\n",
" self.fc3 = nn.Linear(hidden_dim, output_dim) # 输出层\n",
" def forward(self, x):\n",
" x = F.relu(self.fc1(x))\n",
" x = F.relu(self.fc2(x))\n",
" x = torch.sigmoid(self.fc3(x))\n",
" return x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.3. 更新函数设计\n",
"\n",
"前面提到我们的优化目标也就是策略梯度算法的损失函数如下:\n",
"$$\n",
"J_{\\theta}= \\Psi_{\\pi} \\nabla_\\theta \\log \\pi_\\theta\\left(a_t \\mid s_t\\right)\n",
"$$\n",
"\n",
"我们需要拆开成两个部分$\\Psi_{\\pi}$和$\\nabla_\\theta \\log \\pi_\\theta\\left(a_t \\mid s_t\\right)$分开计算,首先看值函数部分$\\Psi_{\\pi}$在REINFORCE算法中值函数是从当前时刻开始的衰减回报如下\n",
"$$\n",
"G \\leftarrow \\sum_{k=t+1}^{T} \\gamma^{k-1} r_{k}\n",
"$$\n",
"\n",
"这个实际用代码来实现的时候可能有点绕,我们可以倒过来看,在同一回合下,我们的终止时刻是$T$,那么对应的回报$G_T=\\gamma^{T-1}r_T$,而对应的$G_{T-1}=\\gamma^{T-2}r_{T-1}+\\gamma^{T-1}r_T$,在这里代码中我们使用了一个动态规划的技巧,如下:\n",
"```python\n",
"running_add = running_add * self.gamma + reward_pool[i] # running_add初始值为0\n",
"```\n",
"这个公式也是倒过来循环的,第一次的值等于:\n",
"$$\n",
"running\\_add = r_T\n",
"$$\n",
"第二次的值则等于:\n",
"$$\n",
"running\\_add = r_T*\\gamma+r_{T-1}\n",
"$$\n",
"第三次的值等于:\n",
"$$\n",
"running\\_add = (r_T*\\gamma+r_{T-1})*\\gamma+r_{T-2} = r_T*\\gamma^2+r_{T-1}*\\gamma+r_{T-2}\n",
"$$\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"from torch.distributions import Bernoulli\n",
"from torch.autograd import Variable\n",
"import numpy as np\n",
"\n",
"class PolicyGradient:\n",
" \n",
" def __init__(self, model,memory,cfg):\n",
" self.gamma = cfg['gamma']\n",
" self.device = torch.device(cfg['device']) \n",
" self.memory = memory\n",
" self.policy_net = model.to(self.device)\n",
" self.optimizer = torch.optim.RMSprop(self.policy_net.parameters(), lr=cfg['lr'])\n",
"\n",
" def sample_action(self,state):\n",
"\n",
" state = torch.from_numpy(state).float()\n",
" state = Variable(state)\n",
" probs = self.policy_net(state)\n",
" m = Bernoulli(probs) # 伯努利分布\n",
" action = m.sample()\n",
" \n",
" action = action.data.numpy().astype(int)[0] # 转为标量\n",
" return action\n",
" def predict_action(self,state):\n",
"\n",
" state = torch.from_numpy(state).float()\n",
" state = Variable(state)\n",
" probs = self.policy_net(state)\n",
" m = Bernoulli(probs) # 伯努利分布\n",
" action = m.sample()\n",
" action = action.data.numpy().astype(int)[0] # 转为标量\n",
" return action\n",
" \n",
" def update(self):\n",
" state_pool,action_pool,reward_pool= self.memory.sample()\n",
" state_pool,action_pool,reward_pool = list(state_pool),list(action_pool),list(reward_pool)\n",
" # Discount reward\n",
" running_add = 0\n",
" for i in reversed(range(len(reward_pool))):\n",
" if reward_pool[i] == 0:\n",
" running_add = 0\n",
" else:\n",
" running_add = running_add * self.gamma + reward_pool[i]\n",
" reward_pool[i] = running_add\n",
"\n",
" # Normalize reward\n",
" reward_mean = np.mean(reward_pool)\n",
" reward_std = np.std(reward_pool)\n",
" for i in range(len(reward_pool)):\n",
" reward_pool[i] = (reward_pool[i] - reward_mean) / reward_std\n",
"\n",
" # Gradient Desent\n",
" self.optimizer.zero_grad()\n",
"\n",
" for i in range(len(reward_pool)):\n",
" state = state_pool[i]\n",
" action = Variable(torch.FloatTensor([action_pool[i]]))\n",
" reward = reward_pool[i]\n",
" state = Variable(torch.from_numpy(state).float())\n",
" probs = self.policy_net(state)\n",
" m = Bernoulli(probs)\n",
" loss = -m.log_prob(action) * reward # Negtive score function x reward\n",
" # print(loss)\n",
" loss.backward()\n",
" self.optimizer.step()\n",
" self.memory.clear()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.7.13 ('easyrl')",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.7.13"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "8994a120d39b6e6a2ecc94b4007f5314b68aa69fc88a7f00edf21be39b41f49c"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

23
notebooks/README.md Normal file
View File

@@ -0,0 +1,23 @@
# 蘑菇书附书代码
## 安装说明
目前支持Python 3.7和Gym 0.25.2版本。
创建Conda环境需先安装Anaconda
```bash
conda create -n joyrl python=3.7
conda activate joyrl
pip install -r requirements.txt
```
安装Torch
```bash
# CPU
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cpuonly -c pytorch
# GPU
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
# GPU镜像安装
pip install torch==1.10.0+cu113 torchvision==0.11.0+cu113 torchaudio==0.10.0 --extra-index-url https://download.pytorch.org/whl/cu113
```

896
notebooks/Sarsa.ipynb Normal file

File diff suppressed because one or more lines are too long

View File

View File

@@ -0,0 +1,232 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 值迭代算法\n",
"作者stzhao\n",
"github: https://github.com/zhaoshitian"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 一、定义环境\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import sys,os\n",
"curr_path = os.path.abspath('')\n",
"parent_path = os.path.dirname(curr_path)\n",
"sys.path.append(parent_path)\n",
"from envs.simple_grid import DrunkenWalkEnv"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"def all_seed(env,seed = 1):\n",
" ## 这个函数主要是为了固定随机种子\n",
" import numpy as np\n",
" import random\n",
" import os\n",
" env.seed(seed) \n",
" np.random.seed(seed)\n",
" random.seed(seed)\n",
" os.environ['PYTHONHASHSEED'] = str(seed) \n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"env = DrunkenWalkEnv(map_name=\"theAlley\")\n",
"all_seed(env, seed = 1) # 设置随机种子为1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 二、价值迭代算法\n"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"def value_iteration(env, theta=0.005, discount_factor=0.9):\n",
" Q = np.zeros((env.nS, env.nA)) # 初始化一个Q表格\n",
" count = 0\n",
" while True:\n",
" delta = 0.0\n",
" Q_tmp = np.zeros((env.nS, env.nA))\n",
" for state in range(env.nS):\n",
" for a in range(env.nA):\n",
" accum = 0.0\n",
" reward_total = 0.0\n",
" for prob, next_state, reward, done in env.P[state][a]:\n",
" accum += prob* np.max(Q[next_state, :])\n",
" reward_total += prob * reward\n",
" Q_tmp[state, a] = reward_total + discount_factor * accum\n",
" delta = max(delta, abs(Q_tmp[state, a] - Q[state, a]))\n",
" Q = Q_tmp\n",
" \n",
" count += 1\n",
" if delta < theta or count > 100: # 这里设置了即使算法没有收敛跑100次也退出循环\n",
" break \n",
" return Q"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[2.25015697e+22 2.53142659e+22 4.50031394e+22 2.53142659e+22]\n",
" [2.81269621e+22 5.41444021e+22 1.01257064e+23 5.41444021e+22]\n",
" [6.32856648e+22 1.21824905e+23 2.27828393e+23 1.21824905e+23]\n",
" [1.42392746e+23 2.74106036e+23 5.12613885e+23 2.74106036e+23]\n",
" [3.20383678e+23 5.76690620e+23 1.15338124e+24 5.76690620e+23]\n",
" [7.20863276e+23 1.38766181e+24 2.59510779e+24 1.38766181e+24]\n",
" [1.62194237e+24 3.12223906e+24 5.83899253e+24 3.12223906e+24]\n",
" [3.64937033e+24 7.02503789e+24 1.31377332e+25 7.02503789e+24]\n",
" [8.21108325e+24 1.47799498e+25 2.95598997e+25 1.47799498e+25]\n",
" [1.84749373e+25 3.55642543e+25 6.65097743e+25 3.55642543e+25]\n",
" [4.15686089e+25 8.00195722e+25 1.49646992e+26 8.00195722e+25]\n",
" [9.35293701e+25 1.80044037e+26 3.36705732e+26 1.80044037e+26]\n",
" [5.89235032e+26 7.36543790e+26 7.57587898e+26 7.36543790e+26]]\n"
]
}
],
"source": [
"Q = value_iteration(env)\n",
"print(Q)"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]\n"
]
}
],
"source": [
"policy = np.zeros([env.nS, env.nA]) # 初始化一个策略表格\n",
"for state in range(env.nS):\n",
" best_action = np.argmax(Q[state, :]) #根据价值迭代算法得到的Q表格选择出策略\n",
" policy[state, best_action] = 1\n",
"\n",
"policy = [int(np.argwhere(policy[i]==1)) for i in range(env.nS) ]\n",
"print(policy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 三、测试"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": [
"num_episode = 1000 # 测试1000次\n",
"def test(env,policy):\n",
" \n",
" rewards = [] # 记录所有回合的奖励\n",
" success = [] # 记录该回合是否成功走到终点\n",
" for i_ep in range(num_episode):\n",
" ep_reward = 0 # 记录每个episode的reward\n",
" state = env.reset() # 重置环境, 重新开一局(即开始新的一个回合) 这里state=0\n",
" while True:\n",
" action = policy[state] # 根据算法选择一个动作\n",
" next_state, reward, done, _ = env.step(action) # 与环境进行一个交互\n",
" state = next_state # 更新状态\n",
" ep_reward += reward\n",
" if done:\n",
" break\n",
" if state==12: # 即走到终点\n",
" success.append(1)\n",
" else:\n",
" success.append(0)\n",
" rewards.append(ep_reward)\n",
" acc_suc = np.array(success).sum()/num_episode\n",
" print(\"测试的成功率是:\", acc_suc)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"测试的成功率是: 0.64\n"
]
}
],
"source": [
"test(env, policy)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.6 ('RL')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "88a829278351aa402b7d6303191a511008218041c5cfdb889d81328a3ea60fbc"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -0,0 +1,153 @@
# 该代码来自 openai baseline用于多线程环境
# https://github.com/openai/baselines/tree/master/baselines/common/vec_env
import numpy as np
from multiprocessing import Process, Pipe
def worker(remote, parent_remote, env_fn_wrapper):
parent_remote.close()
env = env_fn_wrapper.x()
while True:
cmd, data = remote.recv()
if cmd == 'step':
ob, reward, done, info = env.step(data)
if done:
ob = env.reset()
remote.send((ob, reward, done, info))
elif cmd == 'reset':
ob = env.reset()
remote.send(ob)
elif cmd == 'reset_task':
ob = env.reset_task()
remote.send(ob)
elif cmd == 'close':
remote.close()
break
elif cmd == 'get_spaces':
remote.send((env.observation_space, env.action_space))
else:
raise NotImplementedError
class VecEnv(object):
"""
An abstract asynchronous, vectorized environment.
"""
def __init__(self, num_envs, observation_space, action_space):
self.num_envs = num_envs
self.observation_space = observation_space
self.action_space = action_space
def reset(self):
"""
Reset all the environments and return an array of
observations, or a tuple of observation arrays.
If step_async is still doing work, that work will
be cancelled and step_wait() should not be called
until step_async() is invoked again.
"""
pass
def step_async(self, actions):
"""
Tell all the environments to start taking a step
with the given actions.
Call step_wait() to get the results of the step.
You should not call this if a step_async run is
already pending.
"""
pass
def step_wait(self):
"""
Wait for the step taken with step_async().
Returns (obs, rews, dones, infos):
- obs: an array of observations, or a tuple of
arrays of observations.
- rews: an array of rewards
- dones: an array of "episode done" booleans
- infos: a sequence of info objects
"""
pass
def close(self):
"""
Clean up the environments' resources.
"""
pass
def step(self, actions):
self.step_async(actions)
return self.step_wait()
class CloudpickleWrapper(object):
"""
Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
"""
def __init__(self, x):
self.x = x
def __getstate__(self):
import cloudpickle
return cloudpickle.dumps(self.x)
def __setstate__(self, ob):
import pickle
self.x = pickle.loads(ob)
class SubprocVecEnv(VecEnv):
def __init__(self, env_fns, spaces=None):
"""
envs: list of gym environments to run in subprocesses
"""
self.waiting = False
self.closed = False
nenvs = len(env_fns)
self.nenvs = nenvs
self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
for p in self.ps:
p.daemon = True # if the main process crashes, we should not cause things to hang
p.start()
for remote in self.work_remotes:
remote.close()
self.remotes[0].send(('get_spaces', None))
observation_space, action_space = self.remotes[0].recv()
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
def step_async(self, actions):
for remote, action in zip(self.remotes, actions):
remote.send(('step', action))
self.waiting = True
def step_wait(self):
results = [remote.recv() for remote in self.remotes]
self.waiting = False
obs, rews, dones, infos = zip(*results)
return np.stack(obs), np.stack(rews), np.stack(dones), infos
def reset(self):
for remote in self.remotes:
remote.send(('reset', None))
return np.stack([remote.recv() for remote in self.remotes])
def reset_task(self):
for remote in self.remotes:
remote.send(('reset_task', None))
return np.stack([remote.recv() for remote in self.remotes])
def close(self):
if self.closed:
return
if self.waiting:
for remote in self.remotes:
remote.recv()
for remote in self.remotes:
remote.send(('close', None))
for p in self.ps:
p.join()
self.closed = True
def __len__(self):
return self.nenvs

243
notebooks/envs/racetrack.py Normal file
View File

@@ -0,0 +1,243 @@
import time
import random
import numpy as np
import os
import matplotlib.pyplot as plt
import matplotlib.patheffects as pe
from IPython.display import clear_output
from gym.spaces import Discrete,Box
from gym import Env
from matplotlib import colors
class RacetrackEnv(Env) :
"""
Class representing a race-track environment inspired by exercise 5.12 in Sutton & Barto 2018 (p.111).
Please do not make changes to this class - it will be overwritten with a clean version when it comes to marking.
The dynamics of this environment are detailed in this coursework exercise's jupyter notebook, although I have
included rather verbose comments here for those of you who are interested in how the environment has been
implemented (though this should not impact your solution code).ss
"""
ACTIONS_DICT = {
0 : (1, -1), # Acc Vert., Brake Horiz.
1 : (1, 0), # Acc Vert., Hold Horiz.
2 : (1, 1), # Acc Vert., Acc Horiz.
3 : (0, -1), # Hold Vert., Brake Horiz.
4 : (0, 0), # Hold Vert., Hold Horiz.
5 : (0, 1), # Hold Vert., Acc Horiz.
6 : (-1, -1), # Brake Vert., Brake Horiz.
7 : (-1, 0), # Brake Vert., Hold Horiz.
8 : (-1, 1) # Brake Vert., Acc Horiz.
}
CELL_TYPES_DICT = {
0 : "track",
1 : "wall",
2 : "start",
3 : "goal"
}
metadata = {'render_modes': ['human'],
"render_fps": 4,}
def __init__(self,render_mode = 'human') :
# Load racetrack map from file.
self.track = np.flip(np.loadtxt(os.path.dirname(__file__)+"/track.txt", dtype = int), axis = 0)
# Discover start grid squares.
self.initial_states = []
for y in range(self.track.shape[0]) :
for x in range(self.track.shape[1]) :
if (self.CELL_TYPES_DICT[self.track[y, x]] == "start") :
self.initial_states.append((y, x))
high= np.array([np.finfo(np.float32).max, np.finfo(np.float32).max, np.finfo(np.float32).max, np.finfo(np.float32).max])
self.observation_space = Box(low=-high, high=high, shape=(4,), dtype=np.float32)
self.action_space = Discrete(9)
self.is_reset = False
def step(self, action : int) :
"""
Takes a given action in the environment's current state, and returns a next state,
reward, and whether the next state is done or not.
Arguments:
action {int} -- The action to take in the environment's current state. Should be an integer in the range [0-8].
Raises:
RuntimeError: Raised when the environment needs resetting.\n
TypeError: Raised when an action of an invalid type is given.\n
ValueError: Raised when an action outside the range [0-8] is given.\n
Returns:
A tuple of:\n
{(int, int, int, int)} -- The next state, a tuple of (y_pos, x_pos, y_velocity, x_velocity).\n
{int} -- The reward earned by taking the given action in the current environment state.\n
{bool} -- Whether the environment's next state is done or not.\n
"""
# Check whether a reset is needed.
if (not self.is_reset) :
raise RuntimeError(".step() has been called when .reset() is needed.\n" +
"You need to call .reset() before using .step() for the first time, and after an episode ends.\n" +
".reset() initialises the environment at the start of an episode, then returns an initial state.")
# Check that action is the correct type (either a python integer or a numpy integer).
if (not (isinstance(action, int) or isinstance(action, np.integer))) :
raise TypeError("action should be an integer.\n" +
"action value {} of type {} was supplied.".format(action, type(action)))
# Check that action is an allowed value.
if (action < 0 or action > 8) :
raise ValueError("action must be an integer in the range [0-8] corresponding to one of the legal actions.\n" +
"action value {} was supplied.".format(action))
# Update Velocity.
# With probability, 0.85 update velocity components as intended.
if (np.random.uniform() < 0.8) :
(d_y, d_x) = self.ACTIONS_DICT[action]
# With probability, 0.15 Do not change velocity components.
else :
(d_y, d_x) = (0, 0)
self.velocity = (self.velocity[0] + d_y, self.velocity[1] + d_x)
# Keep velocity within bounds (-10, 10).
if (self.velocity[0] > 10) :
self.velocity[0] = 10
elif (self.velocity[0] < -10) :
self.velocity[0] = -10
if (self.velocity[1] > 10) :
self.velocity[1] = 10
elif (self.velocity[1] < -10) :
self.velocity[1] = -10
# Update Position.
new_position = (self.position[0] + self.velocity[0], self.position[1] + self.velocity[1])
reward = 0
done = False
# If position is out-of-bounds, return to start and set velocity components to zero.
if (new_position[0] < 0 or new_position[1] < 0 or new_position[0] >= self.track.shape[0] or new_position[1] >= self.track.shape[1]) :
self.position = random.choice(self.initial_states)
self.velocity = (0, 0)
reward -= 10
# If position is in a wall grid-square, return to start and set velocity components to zero.
elif (self.CELL_TYPES_DICT[self.track[new_position]] == "wall") :
self.position = random.choice(self.initial_states)
self.velocity = (0, 0)
reward -= 10
# If position is in a track grid-squre or a start-square, update position.
elif (self.CELL_TYPES_DICT[self.track[new_position]] in ["track", "start"]) :
self.position = new_position
# If position is in a goal grid-square, end episode.
elif (self.CELL_TYPES_DICT[self.track[new_position]] == "goal") :
self.position = new_position
reward += 10
done = True
# If this gets reached, then the student has touched something they shouldn't have. Naughty!
else :
raise RuntimeError("You've met with a terrible fate, haven't you?\nDon't modify things you shouldn't!")
# Penalise every timestep.
reward -= 1
# Require a reset if the current state is done.
if (done) :
self.is_reset = False
# Return next state, reward, and whether the episode has ended.
return np.array([self.position[0], self.position[1], self.velocity[0], self.velocity[1]]), reward, done,{}
def reset(self,seed=None) :
"""
Resets the environment, ready for a new episode to begin, then returns an initial state.
The initial state will be a starting grid square randomly chosen using a uniform distribution,
with both components of the velocity being zero.
Returns:
{(int, int, int, int)} -- an initial state, a tuple of (y_pos, x_pos, y_velocity, x_velocity).
"""
# Pick random starting grid-square.
self.position = random.choice(self.initial_states)
# Set both velocity components to zero.
self.velocity = (0, 0)
self.is_reset = True
return np.array([self.position[0], self.position[1], self.velocity[0], self.velocity[1]])
def render(self, render_mode = 'human') :
"""
Renders a pretty matplotlib plot representing the current state of the environment.
Calling this method on subsequent timesteps will update the plot.
This is VERY VERY SLOW and wil slow down training a lot. Only use for debugging/testing.
Arguments:
sleep_time {float} -- How many seconds (or partial seconds) you want to wait on this rendered frame.
"""
# Turn interactive render_mode on.
plt.ion()
fig = plt.figure(num = "env_render")
ax = plt.gca()
ax.clear()
clear_output(wait = True)
# Prepare the environment plot and mark the car's position.
env_plot = np.copy(self.track)
env_plot[self.position] = 4
env_plot = np.flip(env_plot, axis = 0)
# Plot the gridworld.
cmap = colors.ListedColormap(["white", "black", "green", "red", "yellow"])
bounds = list(range(6))
norm = colors.BoundaryNorm(bounds, cmap.N)
ax.imshow(env_plot, cmap = cmap, norm = norm, zorder = 0)
# Plot the velocity.
if (not self.velocity == (0, 0)) :
ax.arrow(self.position[1], self.track.shape[0] - 1 - self.position[0], self.velocity[1], -self.velocity[0],
path_effects=[pe.Stroke(linewidth=1, foreground='black')], color = "yellow", width = 0.1, length_includes_head = True, zorder = 2)
# Set up axes.
ax.grid(which = 'major', axis = 'both', linestyle = '-', color = 'k', linewidth = 2, zorder = 1)
ax.set_xticks(np.arange(-0.5, self.track.shape[1] , 1));
ax.set_xticklabels([])
ax.set_yticks(np.arange(-0.5, self.track.shape[0], 1));
ax.set_yticklabels([])
# Draw everything.
#fig.canvas.draw()
#fig.canvas.flush_events()
plt.show()
# time sleep
time.sleep(0.1)
def get_actions(self) :
"""
Returns the available actions in the current state - will always be a list
of integers in the range [0-8].
"""
return [*self.ACTIONS_DICT]
if __name__ == "__main__":
num_steps = 1000000
env = RacetrackEnv()
state = env.reset()
print(state)
for _ in range(num_steps) :
next_state, reward, done,_ = env.step(random.choice(env.get_actions()))
print(next_state)
env.render()
if (done) :
_ = env.reset()

View File

@@ -0,0 +1,303 @@
#!/usr/bin/env python
# simple_grid.py
# based on frozen_lake.py
# adapted by Frans Oliehoek.
#
import sys
from contextlib import closing
import numpy as np
from io import StringIO
#from six import StringIO, b
import gym
from gym import utils
from gym import Env, spaces
from gym.utils import seeding
def categorical_sample(prob_n, np_random):
"""
Sample from categorical distribution
Each row specifies class probabilities
"""
prob_n = np.asarray(prob_n)
csprob_n = np.cumsum(prob_n)
return (csprob_n > np_random.rand()).argmax()
class DiscreteEnv(Env):
"""
Has the following members
- nS: number of states
- nA: number of actions
- P: transitions (*)
- isd: initial state distribution (**)
(*) dictionary of lists, where
P[s][a] == [(probability, nextstate, reward, done), ...]
(**) list or array of length nS
"""
def __init__(self, nS, nA, P, isd):
self.P = P
self.isd = isd
self.lastaction = None # for rendering
self.nS = nS
self.nA = nA
self.action_space = spaces.Discrete(self.nA)
self.observation_space = spaces.Discrete(self.nS)
self.seed()
self.s = categorical_sample(self.isd, self.np_random)
def seed(self, seed=None):
self.np_random, seed = seeding.np_random(seed)
return [seed]
def reset(self):
self.s = categorical_sample(self.isd, self.np_random)
self.lastaction = None
return int(self.s)
def step(self, a):
transitions = self.P[self.s][a]
i = categorical_sample([t[0] for t in transitions], self.np_random)
p, s, r, d = transitions[i]
self.s = s
self.lastaction = a
return (int(s), r, d, {"prob": p})
LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3
MAPS = {
"theAlley": [
"S...H...H...G"
],
"walkInThePark": [
"S.......",
".....H..",
"........",
"......H.",
"........",
"...H...G"
],
"1Dtest": [
],
"4x4": [
"S...",
".H.H",
"...H",
"H..G"
],
"8x8": [
"S.......",
"........",
"...H....",
".....H..",
"...H....",
".HH...H.",
".H..H.H.",
"...H...G"
],
}
POTHOLE_PROB = 0.2
BROKEN_LEG_PENALTY = -5
SLEEP_DEPRIVATION_PENALTY = -0.0
REWARD = 10
def generate_random_map(size=8, p=0.8):
"""Generates a random valid map (one that has a path from start to goal)
:param size: size of each side of the grid
:param p: probability that a tile is frozen
"""
valid = False
# DFS to check that it's a valid path.
def is_valid(res):
frontier, discovered = [], set()
frontier.append((0,0))
while frontier:
r, c = frontier.pop()
if not (r,c) in discovered:
discovered.add((r,c))
directions = [(1, 0), (0, 1), (-1, 0), (0, -1)]
for x, y in directions:
r_new = r + x
c_new = c + y
if r_new < 0 or r_new >= size or c_new < 0 or c_new >= size:
continue
if res[r_new][c_new] == 'G':
return True
if (res[r_new][c_new] not in '#H'):
frontier.append((r_new, c_new))
return False
while not valid:
p = min(1, p)
res = np.random.choice(['.', 'H'], (size, size), p=[p, 1-p])
res[0][0] = 'S'
res[-1][-1] = 'G'
valid = is_valid(res)
return ["".join(x) for x in res]
class DrunkenWalkEnv(DiscreteEnv):
"""
A simple grid environment, completely based on the code of 'FrozenLake', credits to
the original authors.
You're finding your way home (G) after a great party which was happening at (S).
Unfortunately, due to recreational intoxication you find yourself only moving into
the intended direction 80% of the time, and perpendicular to that the other 20%.
To make matters worse, the local community has been cutting the budgets for pavement
maintenance, which means that the way to home is full of potholes, which are very likely
to make you trip. If you fall, you are obviously magically transported back to the party,
without getting some of that hard-earned sleep.
S...
.H.H
...H
H..G
S : starting point
. : normal pavement
H : pothole, you have a POTHOLE_PROB chance of tripping
G : goal, time for bed
The episode ends when you reach the goal or trip.
You receive a reward of +10 if you reach the goal,
but get a SLEEP_DEPRIVATION_PENALTY and otherwise.
"""
metadata = {'render.modes': ['human', 'ansi']}
def __init__(self, desc=None, map_name="4x4",is_slippery=True):
""" This generates a map and sets all transition probabilities.
(by passing constructed nS, nA, P, isd to DiscreteEnv)
"""
if desc is None and map_name is None:
desc = generate_random_map()
elif desc is None:
desc = MAPS[map_name]
self.desc = desc = np.asarray(desc,dtype='c')
self.nrow, self.ncol = nrow, ncol = desc.shape
self.reward_range = (0, 1)
nA = 4
nS = nrow * ncol
isd = np.array(desc == b'S').astype('float64').ravel()
isd /= isd.sum()
# We need to pass 'P' to DiscreteEnv:
# P dictionary dict of dicts of lists, where
# P[s][a] == [(probability, nextstate, reward, done), ...]
P = {s : {a : [] for a in range(nA)} for s in range(nS)}
def convert_rc_to_s(row, col):
return row*ncol + col
#def inc(row, col, a):
def intended_destination(row, col, a):
if a == LEFT:
col = max(col-1,0)
elif a == DOWN:
row = min(row+1,nrow-1)
elif a == RIGHT:
col = min(col+1,ncol-1)
elif a == UP:
row = max(row-1,0)
return (row, col)
def construct_transition_for_intended(row, col, a, prob, li):
""" this constructs a transition to the "intended_destination(row, col, a)"
and adds it to the transition list (which could be for a different action b).
"""
newrow, newcol = intended_destination(row, col, a)
newstate = convert_rc_to_s(newrow, newcol)
newletter = desc[newrow, newcol]
done = bytes(newletter) in b'G'
rew = REWARD if newletter == b'G' else SLEEP_DEPRIVATION_PENALTY
li.append( (prob, newstate, rew, done) )
#THIS IS WHERE THE MATRIX OF TRANSITION PROBABILITIES IS COMPUTED.
for row in range(nrow):
for col in range(ncol):
# specify transitions for s=(row, col)
s = convert_rc_to_s(row, col)
letter = desc[row, col]
for a in range(4):
# specify transitions for action a
li = P[s][a]
if letter in b'G':
# We are at the goal ('G')....
# This is a strange case:
# - conceptually, we can think of this as:
# always transition to a 'terminated' state where we willget 0 reward.
#
# - But in gym, in practie, this case should not be happening at all!!!
# Gym will alreay have returned 'done' when transitioning TO the goal state (not from it).
# So we will never use the transition probabilities *from* the goal state.
# So, from gym's perspective we could specify anything we like here. E.g.,:
# li.append((1.0, 59, 42000000, True))
#
# However, if we want to be able to use the transition matrix to do value iteration, it is important
# that we get 0 reward ever after.
li.append((1.0, s, 0, True))
if letter in b'H':
#We are at a pothole ('H')
#when we are at a pothole, we trip with prob. POTHOLE_PROB
li.append((POTHOLE_PROB, s, BROKEN_LEG_PENALTY, True))
construct_transition_for_intended(row, col, a, 1.0 - POTHOLE_PROB, li)
else:
# We are at normal pavement (.)
# with prob. 0.8 we move as intended:
construct_transition_for_intended(row, col, a, 0.8, li)
# but with prob. 0.1 we move sideways to intended:
for b in [(a-1)%4, (a+1)%4]:
construct_transition_for_intended(row, col, b, 0.1, li)
super(DrunkenWalkEnv, self).__init__(nS, nA, P, isd)
def action_to_string(self, action_index):
s ="{}".format(["Left","Down","Right","Up"][action_index])
return s
def render(self, mode='human'):
outfile = StringIO() if mode == 'ansi' else sys.stdout
row, col = self.s // self.ncol, self.s % self.ncol
desc = self.desc.tolist()
desc = [[c.decode('utf-8') for c in line] for line in desc]
desc[row][col] = utils.colorize(desc[row][col], "red", highlight=True)
if self.lastaction is not None:
outfile.write(" (last action was '{action}')\n".format( action=self.action_to_string(self.lastaction) ) )
else:
outfile.write("\n")
outfile.write("\n".join(''.join(line) for line in desc)+"\n")
if mode != 'human':
with closing(outfile):
return outfile.getvalue()
if __name__ == "__main__":
# env = DrunkenWalkEnv(map_name="walkInThePark")
env = DrunkenWalkEnv(map_name="theAlley")
n_states = env.observation_space.n
n_actions = env.action_space.n

15
notebooks/envs/track.txt Normal file
View File

@@ -0,0 +1,15 @@
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 0 0 0 0 0 3 3 3 3 3 1
1 1 1 1 1 1 0 0 0 0 0 0 0 3 3 3 3 3 1
1 1 1 1 1 0 0 0 0 0 0 0 0 3 3 3 3 3 1
1 1 1 1 0 0 0 0 0 0 0 0 0 3 3 3 3 3 1
1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Binary file not shown.

After

Width:  |  Height:  |  Size: 121 KiB

View File

@@ -0,0 +1,11 @@
pyyaml==6.0
ipykernel==6.15.1
jupyter==1.0.0
matplotlib==3.5.3
seaborn==0.12.1
dill==0.3.5.1
argparse==1.4.0
pandas==1.3.5
pyglet==1.5.26
importlib-metadata<5.0
setuptools==65.2.0

View File

@@ -0,0 +1,123 @@
## Action-depedent Control Variates for Policy Optimization via Stein's Identity
作者Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, Qiang Liu
出处ICLR 2018
论文链接https://arxiv.org/abs/1710.11198
**亮点:提出了一种在策略梯度中降低估计量方差的方法,并建立起了一套构建基线函数的方法,可以在训练的过程中降低方差,提升样本利用率**
### **Motivation (Why):**
策略梯度算法在梯度估计上往往方差较大,导致训练时样本利用率较差,需要用很多的样本数据才能得到方差较小的估计量。之前在估计的时候用和状态相关的基线来降低方差,但效果并不好,这篇文章研究了用状态和动作都相关的基线来降低方差。
### **Main Idea (What):**
#### **策略梯度回顾**
**策略梯度**
强化学习问题可以理解为一个关于环境状态 $s \in S$ 和智能体行动 $a \in A$ 的马尔可夫决策过程,在一个未知的环境下,该过程由一个转换概率 $T\left(s^{\prime} \mid s, a\right)$ 和一个奖励$r(s, a)$ 紧随在状态$s$下执行的行动 $a$。智能体的行动 $a$是由策略 $\pi(a \mid s)$ 决定的。在策略梯度方法中,我们考虑一组候选政策 $\pi_\theta(a \mid s)$,其中 $\theta$ 是参数,通过最大化预期累积奖励或收益获得最佳政策。
$$
J(\theta)=\mathbb{E}_{s \sim \rho_\pi, a \sim \pi(a \mid s)}[R(\tau)],
$$
$J(\theta)$ 的梯度可以写为
$$
\nabla_\theta J(\theta)=\mathbb{E}_\pi\left[\nabla_\theta \log \pi(a \mid s) Q^\pi(s, a)\right],
$$
其中
$$Q^\pi(s, a)=\mathbb{E}_\pi\left[\sum_{t=1}^{\infty} \gamma^{t-1} r\left(s_t, a_t\right) \mid s_1=s, a_1=a\right]$$
对$\nabla_\theta J(\theta)$ 估计最简单的方式就是采集很多 $\left\{\left(s_t, a_t, r_t\right)\right\}_{t=1}^n$ 样本,然后进行蒙特卡洛估计
$$
\hat{\nabla}_\theta J(\theta)=\frac{1}{n} \sum_{t=1}^n \gamma^{t-1} \nabla_\theta \log \pi\left(a_t \mid s_t\right) \hat{Q}^\pi\left(s_t, a_t\right),
$$
其中 $\hat{Q}^\pi\left(s_t, a_t\right)$是$Q^\pi\left(s_t, a_t\right)$ 的估计量,比如 $\hat{Q}^\pi\left(s_t, a_t\right)=\sum_{j \geq t} \gamma^{j-t} r_j$。
但是这种方法估计出来的方差很多所以人们引入了控制变量来保证在期望不变的情况下降低方差
**控制变量**
在估计期望 $\mu=\mathbb{E}_\tau[g(s, a)]$ 的时候找一个方程 $f(s, a)$ 满足$\mathbb{E}_\tau[f(s, a)]=0$.。这样就可以用如下估计量来估计$\mu$
$$
\hat{\mu}=\frac{1}{n} \sum_{t=1}^n\left(g\left(s_t, a_t\right)-f\left(s_t, a_t\right)\right),
$$
方差$\operatorname{var}_\tau(g-f) / n$, 期望仍为0。这里的关键问题在于要找符合要求的$f$。在以前的研究中,一般都用状态函数 $V(s)$ 的估计量或者常数来做控制变量,因为这类函数不会改变计算梯度时估计量的方差。引入控制变量后的梯度的估计量如下:
$$
\hat{\nabla}_\theta J(\theta)=\frac{1}{n} \sum_{t=1}^n \nabla_\theta \log \pi\left(a_t \mid s_t\right) (\hat{Q}^\pi\left(s_t, a_t\right)-\phi(s_t)),
$$
但是只用和状态相关的函数来做控制变量是无法将方差降低到0的理想的情况我们想用一个和动作状态都相关的函数来做控制变量。
#### Stein控制变量的梯度下降算法
通过Stein公式引入一个和动作状态都相关的控制变量$\phi(s,a)$ 但是在引入的过程中维度存在一些问题所以通过参数重新选择的技巧扩充了维度并给出了证明得到Stein控制变量的构建方法最后构建了一族Stein控制变量。
**Stein公式**
根据Stein公式对于具有适当条件的 $\phi(s, a)$ 函数,可以得到,
$$
\mathbb{E}_{\pi(a \mid s)}\left[\nabla_a \log \pi(a \mid s) \phi(s, a)+\nabla_a \phi(s, a)\right]=0, \quad \forall s
$$
这给我们构建控制变量提供了一些思路。值得注意的是,上面公式的左边可以写作$\int \nabla_a(\pi(a \mid s) \phi(s, a)) d a$。
**Stein控制变量**
上面公式左边部分的维度和估计策略梯度的维度不一样,前者是根据 $a$ 来计算的,而后者是根据 $\theta$ 我们需要在 $\nabla_a \log \pi(a \mid s)$ 和 $\nabla_\theta \log \pi(a \mid s)$ 之间构建链接以此来通过Stein不等式得到可以用于策略梯度的控制变量。 我们通过以下方法:
我们可以通过$a=f_\theta(s, \xi)$ 来表达 $a \sim \pi_\theta(a \mid s)$,其中 $\xi$ 是一个独立于 $\theta$ 的随机噪声。本文用$\pi(a, \xi \mid s)$来表示 $(a, \xi)$ 在给定 $s$ 上的分布。可以得到,$\pi(a \mid s)=\int \pi(a \mid s, \xi) \pi(\xi) d \xi$ 其中 $\pi(\xi)$是生成 $\xi$ 的概率密度, $\pi(a \mid s, \xi)=\delta(a-f(s, \xi))$ $\delta$是Delta函数
<img src="img/Stein1.png" alt="image-20221129094054025" style="zoom:67%;" />
上图截自论文中定理3.1填充了前文提到的维度差距允许我们根据Stein不等式构造控制变量。所以紧接着在公式89中作者将控制变量引入策略梯度中并给出了估计量。
**控制变量构建**
在构建控制变量的时候作者考虑了两种方法一种是对Q函数进行估计让$\phi(s,a)$尽可能地靠近$Q$函数,以此来降低方差,另一种是直接最小化估计量的方差。
### **Main Contribution (How):**
本文研究了Stein控制变量是一种在策略梯度中降低方差的方法可以提升样本效率。本文所提出的方法概括了以前的几种方法并在几个具有挑战性的RL任务中证明了其实际优势。
#### 算法
<img src="img/Stein2.png" alt="截屏2022-12-05 19.58.49" style="zoom:67%;" />
运用Stein控制变量的PPO算法。
#### 实验
文本将所提出来方差降低的方法与PPO和TRPO算法结合用在连续环境MuJoco中。证明了通过使用Stein控制变量构建的基线函数可以显著提高样本利用率提升训练速度。
本文所有的实验都使用的是高斯噪声,根据前文的讨论将基线函数的形式设定为$\phi_w(s, a)=\hat{V}^\pi(s)+\psi_w(s, a)$,其中$\hat{V}^\pi$ 是对价值函数的估计,$\psi_w(s, a)$ 是一个以$w$ 为参数的函数,$w$的设置思路分别为拟合Q函数 (FitQ)和最小化方差 (MinVar)。作者在实验中尝试了 $\psi_w(s, a)$ 的形式,包括线性,二次型,全连接神经网络,实验结果如下:
<img src="img/Stein3.png" alt="image 2022-11-29 100857" style="zoom:80%;" />
作者紧接着在Walker2d-v1和Hopper-v1环境下对TRPO算法进行了实验发现所有用Stein控制变量来减小方差的算法都比以前Q-prop算法表现要好。
<img src="img/Stein4.png" alt="2022-11-29 101823" style="zoom: 80%;" />
<img src="img/Stein5.png" alt="2022-11-29 101905" style="zoom:80%;" />
最后作者测试用Stein控制函数的PPO算法
![2022-11-29 103457](img/Stein6.png)
#### 本文提出方法的优点:
1. 可以有效降低估计量方差,提升样本利用率。
2. 可以更灵活的构建基线函数,构建具有线性,二次型,非线性形式的基线函数。
### 个人简介
吴文昊,西安交通大学硕士在读,联系方式:wwhwwh05@qq.com

View File

@@ -0,0 +1,147 @@
## Bridging the Gap Between Value and Policy Based Reinforcement Learning
作者Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans
出处NIPS'17, Google Brain
论文链接https://arxiv.org/abs/1702.08892
**亮点:引入熵正则化,提出了一种稳定的异策略强化学习训练方法来**
### **Motivation (Why):**
过往基于策略和基于价值的强化学习算法都有各自的优点和缺点比如基于策略的算法大多是同策略算法稳定但样本利用率低基于价值的算法大多是异策略算法样本利用率很高但算法不稳定。过去虽然有一些研究结合了两者的优点但他们存在一些理论问题没有解决所以仍留有很大的优化潜力。作者通过熵正则化研究了策略与softmax value consistency之间的关系给出了一个稳定的基于策略的异策略强化学习算法。
### **Main Idea (What):**
#### **符号设定**
核心的思想是在动作的选取上增加一个扰动,更改优化目标
作者用以下形式的$O_{ER}(s,\pi)$来表示在状态 $s$ 下执行策略 $\pi$ 后所能获得的期望收益与我们常用的Q函数$Q(s,a)$表示相同意思。
<img src="img/PCL1.png" alt="1" style="zoom:150%;" />
按照贝尔曼公式的思路,我们有对于某个特定状态的最优值函数 $V^{\circ}(s)$,和最优策略$\pi^{\circ}$,定义如下:
$$
V^{\circ}(s)=\max _\pi O_{\mathrm{ER}}(s, \pi) \\
\pi^{\circ}=\operatorname{argmax}_\pi O_{\mathrm{ER}}(s, \pi).
$$
可以写成如下的迭代形式:
![2](img/PCL2.png)
#### 一致性分析
作者在本文中以一个softmax的方式研究一个状态下的最优价值函数softmax的含义是区别于hard max不是每个状态一定要选择价值最大的那个行动非黑即白而是引入熵正则项来加入一点点灰色选择的行动是”比较软“的价值最大同时引入熵正则项还可以防止优化过程中收敛到次优解。
正则化后的期望奖励有以下形式ENT是熵(Entropy)的缩写:
$$
O_{\mathrm{ENT}}(s, \pi)=O_{\mathrm{ER}}(s, \pi)+\tau \mathbb{H}(s, \pi),
$$
其中 $\tau$ 是一个可调节的参数,$\mathbb{H}(s, \pi)$ 定义如下:
$$
\mathbb{H}(s, \pi)=\sum_a \pi(a \mid s)\left[-\log \pi(a \mid s)+\gamma \mathbb{H}\left(s^{\prime}, \pi\right)\right] .
$$
正则化后的期望奖励也可以写成如下迭代形式:
$$
O_{\mathrm{ENT}}(s, \pi)=\sum_a \pi(a \mid s)\left[r(s, a)-\tau \log \pi(a \mid s)+\gamma O_{\mathrm{ENT}}\left(s^{\prime}, \pi\right)\right]
$$
用$V^*(s)=\max _\pi O_{\mathrm{ENT}}(s, \pi)$来表示状态 $s$ 的软最优值, $\pi^*(a \mid s)$ 表示最优策略,代表在状态 $s$ 选择可以达到软最优值的动作。这样最优策略的动作就不是固定的了,因为引入了熵正则项来扰动最大化的过程,因为熵这一项会鼓励策略变得不稳定。作者用如下形式表示最优策略:
$$
\pi^*(a \mid s) \propto \exp \left\{\left(r(s, a)+\gamma V^*\left(s^{\prime}\right)\right) / \tau\right\}
$$
带入前面的式子得到
$$
V^*(s)=O_{\mathrm{ENT}}\left(s, \pi^*\right)=\tau \log \sum_a \exp \left\{\left(r(s, a)+\gamma V^*\left(s^{\prime}\right)\right) / \tau\right\} .\\
Q^*(s, a)=r(s, a)+\gamma V^*\left(s^{\prime}\right)=r(s, a)+\gamma \tau \log \sum_{a^{\prime}} \exp \left(Q^*\left(s^{\prime}, a^{\prime}\right) / \tau\right)
$$
#### 最优价值和策略之间的一致性
将最优策略写作如下形式
$$
\pi^*(a \mid s)=\frac{\exp \left\{\left(r(s, a)+\gamma V^*\left(s^{\prime}\right)\right) / \tau\right\}}{\exp \left\{V^*(s) / \tau\right\}}
$$
两边取对数,可以得到相邻状态之间的软最优值关系
<img src="img/PCL3.png" alt="3" style="zoom:150%;" />
因为上面的定理是在相邻状态间的,可以反复利用这个公式,来得到一定间隔的两个状态之间的软最优值关系
<img src="img/PCL4.png" alt="4" style="zoom:150%;" />
接下来我们就可以依照上面的式子来进行软最优值估计和策略优化,作者同时给出了判断收敛性的定理 。
![5](img/PCL5.png)
### **Main Contribution (How):**
引入了熵正则项,可以同时优化对状态价值的估计和策略。即可以用同策略的数据去训练也可以用异策略的。在不同游戏上都超越了基线算法
#### 算法
**路径一致性算法PCL**
在引入了熵正则化后,最优价值函数和最优策略之间的这个关系,可以让我们在沿着一串路径寻找最优策略的同时寻找最优价值函数。作者定义以下一致性函数
$$
C\left(s_{i: i+d}, \theta, \phi\right)=-V_\phi\left(s_i\right)+\gamma^d V_\phi\left(s_{i+d}\right)+\sum_{j=0}^{d-1} \gamma^j\left[r\left(s_{i+j}, a_{i+j}\right)-\tau \log \pi_\theta\left(a_{i+j} \mid s_{i+j}\right)\right].
$$
其中,$s_{i: i+d} \equiv\left(s_i, a_i, \ldots, s_{i+d-1}, a_{i+d-1}, s_{i+d}\right)$ 是一个长度为 $d$ 的子轨迹。训练算法的目标是找到可以使一致性尽可能趋近于0的价值函数和策略。所以作者提出了路径一致性学习PCL算法优化目标可以写作如下形式。
$$
O_{\mathrm{PCL}}(\theta, \phi)=\sum_{s_{i: i+d} \in E} \frac{1}{2} C\left(s_{i: i+d}, \theta, \phi\right)^2
$$
参数更新梯度如下,
$$
\begin{aligned}
\Delta \theta & =\eta_\pi C\left(s_{i: i+d}, \theta, \phi\right) \sum_{j=0}^{d-1} \gamma^j \nabla_\theta \log \pi_\theta\left(a_{i+j} \mid s_{i+j}\right) \\
\Delta \phi & =\eta_v C\left(s_{i: i+d}, \theta, \phi\right)\left(\nabla_\phi V_\phi\left(s_i\right)-\gamma^d \nabla_\phi V_\phi\left(s_{i+d}\right)\right)
\end{aligned}
$$
其中PCL更新既可以用同策略采集的在线数据也可以用回放缓存中策异略采集的离线数据。在本文中作者是混合从这两种数据中采样来更新的。
**统一路径一致性算法UPCL**
上述算法在找最优价值函数和最优策略的时候,是在对两个独立的模型进行优化,作者通过 Q 函数的形式,将策略和价值写进一个式子里,
$$
\begin{aligned}
V_\rho(s) & =\tau \log \sum_a \exp \left\{Q_\rho(s, a) / \tau\right\} \\
\pi_\rho(a \mid s) & =\exp \left\{\left(Q_\rho(s, a)-V_\rho(s)\right) / \tau\right\}
\end{aligned}
$$
其中 $\rho$ 是这个统一模型的参数,更新方式如下:
$$
\begin{aligned}
\Delta \rho= & \eta_\pi C\left(s_{i: i+d}, \rho\right) \sum_{j=0}^{d-1} \gamma^j \nabla_\rho \log \pi_\rho\left(a_{i+j} \mid s_{i+j}\right)+ \\
& \eta_v C\left(s_{i: i+d}, \rho\right)\left(\nabla_\rho V_\rho\left(s_i\right)-\gamma^d \nabla_\rho V_\rho\left(s_{i+d}\right)\right)
\end{aligned}
$$
#### 实验
作者将PCL算法UPCL算法和现有的A3C,DQN算法在几个特定任务上进行对比。
![6](img/PCL6.png)
PCL算法的表现在一些任务上和A3C不相上下在一些有挑战性的任务上超越了A3C算法。在所有的人任务上PCL的表现都比DQN算法要好。
![7](img/PCL7.png)
PCL和UPCL的训练表现对比。
![8](img/PCL8.png)
在训练过程中加入了一下比较好的专家数据的PCL-expert和PCL算法性能的对比。
### 个人简介
吴文昊,西安交通大学硕士在读,联系方式:wwhwwh05@qq.com

View File

@@ -0,0 +1,62 @@
## The Mirage of Action-Dependent Baselines in Reinforcement Learning
作者George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard E. Turner, Zoubin Ghahramani, Sergey Levine
论文链接https://arxiv.org/abs/1802.10031v3
**亮点:**本文对早期论文进行批评和重新评估,认为早期论文是一些实现上的技巧降低了方差,而不是真正无偏的,提出了一个地平线状态价值函数
### **Motivation (Why):**
在对策略梯度估计量的方差进行分解和数值研究时,发现之前论文中降低策略梯度方差的方法并不会降低方差,这是一个预期外的发现。于是作者查阅了他们的开源代码,发现他们的实验结果是出于一些代码实现上的技巧。作者通过对方差的分析,指出了可以提升的地方,通过一个小小的改动显著的提升了算法表现。
### **Main Idea (What):**
#### **方差分解**
作者将策略梯度的方差分解为了如下形式
$$
\begin{aligned}
\Sigma=& \underbrace{\mathbb{E}_{s, a}\left[\operatorname{Var}_{\tau \mid s, a}(\hat{A}(s, a, \tau) \nabla \log \pi(a \mid s))\right]}_{\Sigma_\tau} \\
&+\underbrace{\mathbb{E}_s\left[\operatorname{Var}_{a \mid s}((\hat{A}(s, a)-\phi(s, a)) \nabla \log \pi(a \mid s))\right]}_{\Sigma_a} \\
&+\underbrace{\operatorname{Var}_s\left(\mathbb{E}_{a \mid s}[\hat{A}(s, a) \nabla \log \pi(a \mid s)]\right)}_{\Sigma_s}
\end{aligned}
$$
作者分析策略梯度中方差的来源,是来自我们只收集了状态$s$有限多的数据,在每个状态也只做了一次行动$a$ 然后我们进行计算的时候也只用了一个样本$\tau$。 所以非常直觉地想,$\sum_{\tau}, \sum_{a}, \sum_{s}$ 分别描述了来自这三个局限的方差。其中,第二项方差$\sum_a$ 是关于基线函数的方差,也可以记为 $\sum_a^{\phi(s)}$。 当$\phi(s)$是最优的时候,$\sum_a$消失了,只剩下 $\sum_{\tau} + \sum_{s}$ 这两项。所以当 $\sum_a$ 相对其他两项很大的时候,最优的基线函数选取可以有效地降低方差,比如在环境 Cliffworld中一个错误的动作选取可能会导致智能体摔下悬崖而得到一个很大的负数奖励所以降低动作所带来的方差 $\sum_a$ 可以显著提升算法性能。
#### **方差测量**
作者通过复现过往工作代码发现,很多声称降低了梯度估计量方差的方法实际上并没有降低方差,反而是引入了期望上的偏差,以此提升了算法的效果。
![截屏2022-12-05 20.40.09](img/Mirage1.png)
无偏估计的QProp算法, 有偏估计的QProp和TRPO算法训练过程中算法性能对比。
### **Main Contribution (How):**
作者提出了地平线感知函数,在原有价值函数的基础上加上了对未来价值的预测 $\sum_{i=t}^T \gamma^{i-t} \hat{r}(s_t)$。
$$
\hat{V}\left(s_t\right)=\left(\sum_{i=t}^T \gamma^{i-t}\right) \hat{r}\left(s_t\right)+\hat{V}^{\prime}\left(s_t\right)
$$
<img src="img/Mirage2.png" alt="截屏2022-12-05 20.50.10" style="zoom:67%;" />
作者将TRPO算法与地平线感知函数和其他基线函数结合在不同的任务上对比效果。
#### 主要工作
1. 提出了一种通过更精准的价值函数或者状态相关的基线函数来降低策略梯度方差的方法,提出了考虑奖励衰退的地平线感知值函数。
2. 作者详细研究了和状态动作相关的价值函数对方差的影响,复现了过往工作的开源代码。发现过往研究取得的方差降低是通过在代码实现时引入期望偏差来获得的,并不是真正的无偏估计
3. 将文本所提出的地平线感知值函数以及过往论文真正的无偏基线函数和TRPO算法结合进行试验发现只有引入本文所构造的基线函数性能比TRPO稍好一些其他方法表现均不如TRPO算法。
作者也提出了后续的研究方向,即研究在梯度计算过程中。期望和方差之间的权衡
### 个人简介
吴文昊,西安交通大学硕士在读,联系方式:wwhwwh05@qq.com

Binary file not shown.

After

Width:  |  Height:  |  Size: 973 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 985 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 195 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 161 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 88 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 123 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 546 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 90 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 106 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 210 KiB

View File

@@ -29,9 +29,9 @@
| | Addressing Function Approximation Error in Actor-Critic Methods (**TD3**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Addressing%20Function%20Approximation%20Error%20in%20Actor-Critic%20Methods.pdf) | https://arxiv.org/abs/1802.09477 | |
| | A Distributional Perspective on Reinforcement Learning (**C51**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/A%20Distributional%20Perspective%20on%20Reinforcement%20Learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/A%20Distributional%20Perspective%20on%20Reinforcement%20Learning.pdf) | https://arxiv.org/abs/1707.06887 | |
| | Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic (**Q-Prop**) | https://arxiv.org/abs/1611.02247 | |
| | Action-depedent Control Variates for Policy Optimization via Steins Identity (**Stein Control Variates**) | https://arxiv.org/abs/1710.11198 | |
| | The Mirage of Action-Dependent Baselines in Reinforcement Learning | https://arxiv.org/abs/1802.10031 | |
| | Bridging the Gap Between Value and Policy Based Reinforcement Learning (**PCL**) | https://arxiv.org/abs/1702.08892 | |
| | Action-depedent Control Variates for Policy Optimization via Steins Identity (**Stein Control Variates**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Action-depedent%20Control%20Variates%20for%20Policy%20Optimization%20via%20Stein%E2%80%99s%20Identity.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Action-depedent%20Control%20Variates%20for%20Policy%20Optimization%20via%20Stein%E2%80%99s%20Identity.pdf)| https://arxiv.org/abs/1710.11198 | |
| | The Mirage of Action-Dependent Baselines in Reinforcement Learning [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/The%20Mirage%20of%20Action-Dependent%20Baselines%20in%20Reinforcement%20Learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/The%20Mirage%20of%20Action-Dependent%20Baselines%20in%20Reinforcement%20Learning.pdf)| https://arxiv.org/abs/1802.10031 | |
| | Bridging the Gap Between Value and Policy Based Reinforcement Learning (**PCL**) [[Markdown]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/Bridging%20the%20Gap%20Between%20Value%20and%20Policy%20Based%20Reinforcement%20Learning.md) [[PDF]](https://github.com/datawhalechina/easy-rl/blob/master/papers/Policy_gradient/PDF/Bridging%20the%20Gap%20Between%20Value%20and%20Policy%20Based%20Reinforcement%20Learning.pdf) | https://arxiv.org/abs/1702.08892 | |

Submodule projects deleted from 07c9fc1d30