删除projects

This commit is contained in:
johnjim0816
2022-11-15 00:44:53 +08:00
parent 18226036a1
commit 9735f463b0
359 changed files with 0 additions and 29475 deletions

10
projects/.gitignore vendored
View File

@@ -1,10 +0,0 @@
.DS_Store
.ipynb_checkpoints
__pycache__
.vscode
test.py
pseudocodes.aux
pseudocodes.log
pseudocodes.synctex.gz
pseudocodes.out
pseudocodes.toc

View File

@@ -1,21 +0,0 @@
MIT License
Copyright (c) 2020 John Jim
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@@ -1,115 +0,0 @@
## 0. 写在前面
本项目用于学习RL基础算法主要面向对象为RL初学者、需要结合RL的非专业学习者尽量做到: **注释详细****结构清晰**。
注意本项目为实战内容,建议首先掌握相关算法的一些理论基础,再来享用本项目,理论教程参考本人参与编写的[蘑菇书](https://github.com/datawhalechina/easy-rl)。
未来开发计划包括但不限于多智能体算法、强化学习Python包以及强化学习图形化编程平台等等。
## 1. 项目说明
项目内容主要包含以下几个部分:
* [Jupyter Notebook](./notebooks/)使用Notebook写的算法有比较详细的实战引导推荐新手食用
* [codes](./codes/)这些是基于Python脚本写的算法风格比较接近实际项目的写法推荐有一定代码基础的人阅读下面会说明其具体的一些架构
* [附件](./assets/):目前包含强化学习各算法的中文伪代码
[codes](./assets/)结构主要分为以下几个脚本:
* ```[algorithm_name].py```:即保存算法的脚本,例如```dqn.py```,每种算法都会有一定的基础模块,例如```Replay Buffer```、```MLP```(多层感知机)等等;
* ```task.py```: 即保存任务的脚本,基本包括基于```argparse```模块的参数,训练以及测试函数等等,其中训练函数即```train```遵循伪代码而设计,想读懂代码可从该函数入手;
* ```utils.py```:该脚本用于保存诸如存储结果以及画图的软件,在实际项目或研究中,推荐大家使用```Tensorboard```来保存结果,然后使用诸如```matplotlib```以及```seabron```来进一步画图。
## 2. 算法列表
注:点击对应的名称会跳到[codes](./codes/)下对应的算法中,其他版本还请读者自行翻阅
| 算法名称 | 参考文献 | 作者 | 备注 |
| :-------------------------------------: | :----------------------------------------------------------: | :--------------------------------------------------: | :--: |
| [Policy Gradient](codes/PolicyGradient) | [Policy Gradient paper](https://proceedings.neurips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf) | [johnjim0816](https://github.com/johnjim0816) | |
| [Monte Carlo](codes/MonteCarlo) | | [johnjim0816](https://github.com/johnjim0816) | |
| [DQN](codes/DQN) | | [johnjim0816](https://github.com/johnjim0816) | |
| DQN-CNN | | | 待更 |
| [PER_DQN](codes/PER_DQN) | [PER DQN Paper](https://arxiv.org/abs/1511.05952) | [wangzhongren](https://github.com/wangzhongren-code) | |
| [DoubleDQN](codes/DoubleDQN) | [Double DQN Paper](https://arxiv.org/abs/1509.06461) | [johnjim0816](https://github.com/johnjim0816) | |
| [SoftQ](codes/SoftQ) | [Soft Q-learning paper](https://arxiv.org/abs/1702.08165) | [johnjim0816](https://github.com/johnjim0816) | |
| [SAC](codes/SAC) | [SAC paper](https://arxiv.org/pdf/1812.05905.pdf) | | |
| [SAC-Discrete](codes/SAC) | [SAC-Discrete paper](https://arxiv.org/pdf/1910.07207.pdf) | | |
| SAC-S | [SAC-S paper](https://arxiv.org/abs/1801.01290) | | |
| DSAC | [DSAC paper](https://paperswithcode.com/paper/addressing-value-estimation-errors-in) | | 待更 |
## 3. 算法环境
算法环境说明请跳转[env](./codes/envs/README.md)
## 4. 运行环境
主要依赖Python 3.7、PyTorch 1.10.0、Gym 0.25.2。
### 4.1. 创建Conda环境
```bash
conda create -n easyrl python=3.7
conda activate easyrl # 激活环境
```
### 4.2. 安装Torch
安装CPU版本
```bash
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cpuonly -c pytorch
```
安装CUDA版本
```bash
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
```
如果安装Torch需要镜像加速的话点击[清华镜像链接](https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/),选择对应的操作系统,如```win-64```,然后复制链接,执行:
```bash
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/win-64/
```
也可以使用PiP镜像安装仅限CUDA版本
```bash
pip install torch==1.10.0+cu113 torchvision==0.11.0+cu113 torchaudio==0.10.0 --extra-index-url https://download.pytorch.org/whl/cu113
```
### 4.3. 检验CUDA版本Torch安装
CPU版本Torch请忽略此步执行如下Python脚本如果返回True说明CUDA版本安装成功:
```python
import torch
print(torch.cuda.is_available())
```
### 4.4. 安装Gym
```bash
pip install gym==0.25.2
```
如需安装Atari环境则需另外安装
```bash
pip install gym[atari,accept-rom-license]==0.25.2
```
### 4.5. 安装其他依赖
项目根目录下执行:
```bash
pip install -r requirements.txt
```
## 6.使用说明
对于[codes](./codes/)`cd`到对应的算法目录下,例如`DQN`
```bash
python task_0.py
```
或者加载配置文件:
```bash
python task0.py --yaml configs/CartPole-v1_DQN_Train.yaml
```
对于[Jupyter Notebook](./notebooks/)
* 直接运行对应的ipynb文件就行
## 6. 友情说明
推荐使用VS Code做项目入门可参考[VSCode上手指南](https://blog.csdn.net/JohnJim0/article/details/126366454)

View File

@@ -1,359 +0,0 @@
\documentclass[11pt]{ctexart}
\usepackage{ctex}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{hyperref}
% \usepackage[hidelinks]{hyperref} 去除超链接的红色框
\usepackage{setspace}
\usepackage{titlesec}
\usepackage{float} % 调用该包能够使用[H]
% \pagestyle{plain} % 去除页眉但是保留页脚编号都去掉plain换empty
% 更改脚注为圆圈
\usepackage{pifont}
\makeatletter
\newcommand*{\circnum}[1]{%
\expandafter\@circnum\csname c@#1\endcsname
}
\newcommand*{\@circnum}[1]{%
\ifnum#1<1 %
\@ctrerr
\else
\ifnum#1>20 %
\@ctrerr
\else
\ding{\the\numexpr 171+(#1)\relax}%
\fi
\fi
}
\makeatother
\renewcommand*{\thefootnote}{\circnum{footnote}}
\begin{document}
\tableofcontents % 目录注意要运行两下或者vscode保存两下才能显示
% \singlespacing
\clearpage
\section{模版备用}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{算法}\footnotemark[1]}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\begin{algorithmic}[1] % [1]显示步数
\STATE 测试
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{脚注}
\clearpage
\section{Q learning算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{Q-learning算法}\footnotemark[1]}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\begin{algorithmic}[1] % [1]显示步数
\STATE 初始化Q表$Q(s,a)$为任意值,但其中$Q(s_{terminal},)=0$即终止状态对应的Q值为0
\FOR {回合数 = $1,M$}
\STATE 重置环境,获得初始状态$s_1$
\FOR {时步 = $1,T$}
\STATE 根据$\varepsilon-greedy$策略采样动作$a_t$
\STATE 环境根据$a_t$反馈奖励$r_t$和下一个状态$s_{t+1}$
\STATE {\bfseries 更新策略:}
\STATE $Q(s_t,a_t) \leftarrow Q(s_t,a_t)+\alpha[r_t+\gamma\max _{a}Q(s_{t+1},a)-Q(s_t,a_t)]$
\STATE 更新状态$s_{t+1} \leftarrow s_t$
\ENDFOR
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{Reinforcement Learning: An Introduction}
\clearpage
\section{Sarsa算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{Sarsa算法}\footnotemark[1]}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\begin{algorithmic}[1] % [1]显示步数
\STATE 初始化Q表$Q(s,a)$为任意值,但其中$Q(s_{terminal},)=0$即终止状态对应的Q值为0
\FOR {回合数 = $1,M$}
\STATE 重置环境,获得初始状态$s_1$
\STATE 根据$\varepsilon-greedy$策略采样初始动作$a_1$
\FOR {时步 = $1,t$}
\STATE 环境根据$a_t$反馈奖励$r_t$和下一个状态$s_{t+1}$
\STATE 根据$\varepsilon-greedy$策略$s_{t+1}$和采样动作$a_{t+1}$
\STATE {\bfseries 更新策略:}
\STATE $Q(s_t,a_t) \leftarrow Q(s_t,a_t)+\alpha[r_t+\gamma Q(s_{t+1},a_{t+1})-Q(s_t,a_t)]$
\STATE 更新状态$s_{t+1} \leftarrow s_t$
\STATE 更新动作$a_{t+1} \leftarrow a_t$
\ENDFOR
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{Reinforcement Learning: An Introduction}
\clearpage
\section{DQN算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{DQN算法}\footnotemark[1]}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\renewcommand{\algorithmicrequire}{\textbf{输入:}}
\renewcommand{\algorithmicensure}{\textbf{输出:}}
\begin{algorithmic}[1]
% \REQUIRE $n \geq 0 \vee x \neq 0$ % 输入
% \ENSURE $y = x^n$ % 输出
\STATE 初始化策略网络参数$\theta$ % 初始化
\STATE 复制参数到目标网络$\hat{Q} \leftarrow Q$
\STATE 初始化经验回放$D$
\FOR {回合数 = $1,M$}
\STATE 重置环境,获得初始状态$s_t$
\FOR {时步 = $1,t$}
\STATE 根据$\varepsilon-greedy$策略采样动作$a_t$
\STATE 环境根据$a_t$反馈奖励$r_t$和下一个状态$s_{t+1}$
\STATE 存储transition即$(s_t,a_t,r_t,s_{t+1})$到经验回放$D$
\STATE 更新环境状态$s_{t+1} \leftarrow s_t$
\STATE {\bfseries 更新策略:}
\STATE$D$中采样一个batch的transition
\STATE 计算实际的$Q$值,即$y_{j}$\footnotemark[2]
\STATE 对损失 $L(\theta)=\left(y_{i}-Q\left(s_{i}, a_{i} ; \theta\right)\right)^{2}$关于参数$\theta$做随机梯度下降\footnotemark[3]
\ENDFOR
\STATE$C$个回合复制参数$\hat{Q}\leftarrow Q$\footnotemark[4]]
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{Playing Atari with Deep Reinforcement Learning}
\footnotetext[2]{$y_{i}= \begin{cases}r_{i} & \text {对于终止状态} s_{i+1} \\ r_{i}+\gamma \max _{a^{\prime}} Q\left(s_{i+1}, a^{\prime} ; \theta\right) & \text {对于非终止状态} s_{i+1}\end{cases}$}
\footnotetext[3]{$\theta_i \leftarrow \theta_i - \lambda \nabla_{\theta_{i}} L_{i}\left(\theta_{i}\right)$}
\footnotetext[4]{此处也可像原论文中放到小循环中改成每$C$步,但没有每$C$个回合稳定}
\clearpage
\section{PER\_DQN算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{PER\_DQN算法}\footnotemark[1]}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\renewcommand{\algorithmicrequire}{\textbf{输入:}}
\renewcommand{\algorithmicensure}{\textbf{输出:}}
\begin{algorithmic}[1]
% \REQUIRE $n \geq 0 \vee x \neq 0$ % 输入
% \ENSURE $y = x^n$ % 输出
\STATE 初始化策略网络参数$\theta$ % 初始化
\STATE 复制参数到目标网络$\hat{Q} \leftarrow Q$
\STATE 初始化经验回放$D$
\FOR {回合数 = $1,M$}
\STATE 重置环境,获得初始状态$s_t$
\FOR {时步 = $1,t$}
\STATE 根据$\varepsilon-greedy$策略采样动作$a_t$
\STATE 环境根据$a_t$反馈奖励$r_t$和下一个状态$s_{t+1}$
\STATE 存储transition即$(s_t,a_t,r_t,s_{t+1})$到经验回放$D$并根据TD-error损失确定其优先级$p_t$
\STATE 更新环境状态$s_{t+1} \leftarrow s_t$
\STATE {\bfseries 更新策略:}
\STATE 按照经验回放中的优先级别,每个样本采样概率为$P(j)=p_j^\alpha / \sum_i p_i^\alpha$,从$D$中采样一个大小为batch的transition
\STATE 计算各个样本重要性采样权重 $w_j=(N \cdot P(j))^{-\beta} / \max _i w_i$
\STATE 计算TD-error $\delta_j$ ; 并根据TD-error更新优先级$p_j$
\STATE 计算实际的$Q$值,即$y_{j}$\footnotemark[2]
\STATE 根据重要性采样权重调整损失 $L(\theta)=\left(y_{j}-Q\left(s_{j}, a_{j} ; \theta\right)\cdot w_j \right)^{2}$,并将其关于参数$\theta$做随机梯度下降\footnotemark[3]
\ENDFOR
\STATE$C$个回合复制参数$\hat{Q}\leftarrow Q$\footnotemark[4]]
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{Playing Atari with Deep Reinforcement Learning}
\footnotetext[2]{$y_{i}= \begin{cases}r_{i} & \text {对于终止状态} s_{i+1} \\ r_{i}+\gamma \max _{a^{\prime}} Q\left(s_{i+1}, a^{\prime} ; \theta\right) & \text {对于非终止状态} s_{i+1}\end{cases}$}
\footnotetext[3]{$\theta_i \leftarrow \theta_i - \lambda \nabla_{\theta_{i}} L_{i}\left(\theta_{i}\right)$}
\footnotetext[4]{此处也可像原论文中放到小循环中改成每$C$步,但没有每$C$个回合稳定}
\clearpage
\section{Policy Gradient算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{REINFORCE算法Monte-Carlo Policy Gradient}\footnotemark[1]}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\begin{algorithmic}[1] % [1]显示步数
\STATE 初始化策略参数$\boldsymbol{\theta} \in \mathbb{R}^{d^{\prime}}($ e.g., to $\mathbf{0})$
\FOR {回合数 = $1,M$}
\STATE 根据策略$\pi(\cdot \mid \cdot, \boldsymbol{\theta})$采样一个(或几个)回合的transition
\FOR {时步 = $0,1,2,...,T-1$}
\STATE 计算回报$G \leftarrow \sum_{k=t+1}^{T} \gamma^{k-t-1} R_{k}$
\STATE 更新策略$\boldsymbol{\theta} \leftarrow {\boldsymbol{\theta}+\alpha \gamma^{t}} G \nabla \ln \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}\right)$
\ENDFOR
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{Reinforcement Learning: An Introduction}
\clearpage
\section{Advantage Actor Critic算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{Q Actor Critic算法}}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\begin{algorithmic}[1] % [1]显示步数
\STATE 初始化Actor参数$\theta$和Critic参数$w$
\FOR {回合数 = $1,M$}
\STATE 根据策略$\pi_{\theta}(a|s)$采样一个(或几个)回合的transition
\STATE {\bfseries 更新Critic参数\footnotemark[1]}
\FOR {时步 = $t+1,1$}
\STATE 计算Advantage$ \delta_t = r_t + \gamma Q_w(s_{t+1},a_{t+1})-Q_w(s_t,a_t)$
\STATE $w \leftarrow w+\alpha_{w} \delta_{t} \nabla_{w} Q_w(s_t,a_t)$
\STATE $a_t \leftarrow a_{t+1}$,$s_t \leftarrow s_{t+1}$
\ENDFOR
\STATE 更新Actor参数$\theta \leftarrow \theta+\alpha_{\theta} Q_{w}(s, a) \nabla_{\theta} \log \pi_{\theta}(a \mid s)$
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{这里结合TD error的特性按照从$t+1$$1$计算法Advantage更方便}
\clearpage
\section{PPO-Clip算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{PPO-Clip算法}\footnotemark[1]\footnotemark[2]}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\begin{algorithmic}[1] % [1]显示步数
\STATE 初始化策略网络(Actor)参数$\theta$和价值网络(Critic)参数$\phi$
\STATE 初始化Clip参数$\epsilon$
\STATE 初始化epoch数量$K$
\STATE 初始化经验回放$D$
\STATE 初始化总时步数$c=0$
\FOR {回合数 = $1,2,\cdots,M$}
\STATE 重置环境,获得初始状态$s_0$
\FOR {时步 $t = 1,2,\cdots,T$}
\STATE 计数总时步$c \leftarrow c+1$
\STATE 根据策略$\pi_{\theta}$选择$a_t$
\STATE 环境根据$a_t$反馈奖励$r_t$和下一个状态$s_{t+1}$
\STATE 存储$(s_t,a_t,r_t,s_{t+1})$到经验回放$D$
\IF{$c$$C$整除\footnotemark[3]}
\FOR {$k= 1,2,\cdots,K$}
\STATE 测试
\ENDFOR
\STATE 清空经验回放$D$
\ENDIF
\ENDFOR
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{Proximal Policy Optimization Algorithms}
\footnotetext[2]{https://spinningup.openai.com/en/latest/algorithms/ppo.html}
\footnotetext[3]{\bfseries 即每$C$个时步更新策略}
\clearpage
\section{DDPG算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{DDPG算法}\footnotemark[1]}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\begin{algorithmic}[1] % [1]显示步数
\STATE 初始化critic网络$Q\left(s, a \mid \theta^Q\right)$和actor网络$\mu(s|\theta^{\mu})$的参数$\theta^Q$$\theta^{\mu}$
\STATE 初始化对应的目标网络参数,即$\theta^{Q^{\prime}} \leftarrow \theta^Q, \theta^{\mu^{\prime}} \leftarrow \theta^\mu$
\STATE 初始化经验回放$R$
\FOR {回合数 = $1,M$}
\STATE 选择动作$a_t=\mu\left(s_t \mid \theta^\mu\right)+\mathcal{N}_t$$\mathcal{N}_t$为探索噪声
\STATE 环境根据$a_t$反馈奖励$s_t$和下一个状态$s_{t+1}$
\STATE 存储transition$(s_t,a_t,r_t,s_{t+1})$到经验回放$R$
\STATE 更新环境状态$s_{t+1} \leftarrow s_t$
\STATE {\bfseries 更新策略:}
\STATE$R$中取出一个随机批量的$(s_i,a_i,r_i,s_{i+1})$
\STATE 求得$y_i=r_i+\gamma Q^{\prime}\left(s_{i+1}, \mu^{\prime}\left(s_{i+1} \mid \theta^{\mu^{\prime}}\right) \mid \theta^{Q^{\prime}}\right)$
\STATE 更新critic参数其损失为$L=\frac{1}{N} \sum_i\left(y_i-Q\left(s_i, a_i \mid \theta^Q\right)\right)^2$
\STATE 更新actor参数$\left.\left.\nabla_{\theta^\mu} J \approx \frac{1}{N} \sum_i \nabla_a Q\left(s, a \mid \theta^Q\right)\right|_{s=s_i, a=\mu\left(s_i\right)} \nabla_{\theta^\mu} \mu\left(s \mid \theta^\mu\right)\right|_{s_i}$
\STATE 软更新目标网络:$\theta^{Q^{\prime}} \leftarrow \tau \theta^Q+(1-\tau) \theta^{Q^{\prime}}$
$\theta^{\mu^{\prime}} \leftarrow \tau \theta^\mu+(1-\tau) \theta^{\mu^{\prime}}$
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{Continuous control with deep reinforcement learning}
\clearpage
\section{SoftQ算法}
\begin{algorithm}[H]
\floatname{algorithm}{{SoftQ算法}}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\begin{algorithmic}[1]
\STATE 初始化参数$\theta$$\phi$% 初始化
\STATE 复制参数$\bar{\theta} \leftarrow \theta, \bar{\phi} \leftarrow \phi$
\STATE 初始化经验回放$D$
\FOR {回合数 = $1,M$}
\FOR {时步 = $1,t$}
\STATE 根据$\mathbf{a}_{t} \leftarrow f^{\phi}\left(\xi ; \mathbf{s}_{t}\right)$采样动作,其中$\xi \sim \mathcal{N}(\mathbf{0}, \boldsymbol{I})$
\STATE 环境根据$a_t$反馈奖励$s_t$和下一个状态$s_{t+1}$
\STATE 存储transition即$(s_t,a_t,r_t,s_{t+1})$到经验回放$D$
\STATE 更新环境状态$s_{t+1} \leftarrow s_t$
\STATE {\bfseries 更新soft Q函数参数}
\STATE 对于每个$s^{(i)}_{t+1}$采样$\left\{\mathbf{a}^{(i, j)}\right\}_{j=0}^{M} \sim q_{\mathbf{a}^{\prime}}$
\STATE 计算empirical soft values $V_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t}\right)$\footnotemark[1]
\STATE 计算empirical gradient $J_{Q}(\theta)$\footnotemark[2]
\STATE 根据$J_{Q}(\theta)$使用ADAM更新参数$\theta$
\STATE {\bfseries 更新策略:}
\STATE 对于每个$s^{(i)}_{t}$采样$\left\{\xi^{(i, j)}\right\}_{j=0}^{M} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{I})$
\STATE 计算$\mathbf{a}_{t}^{(i, j)}=f^{\phi}\left(\xi^{(i, j)}, \mathbf{s}_{t}^{(i)}\right)$
\STATE 使用经验估计计算$\Delta f^{\phi}\left(\cdot ; \mathbf{s}_{t}\right)$\footnotemark[3]
\STATE 计算经验估计$\frac{\partial J_{\pi}\left(\phi ; \mathbf{s}_{t}\right)}{\partial \phi} \propto \mathbb{E}_{\xi}\left[\Delta f^{\phi}\left(\xi ; \mathbf{s}_{t}\right) \frac{\partial f^{\phi}\left(\xi ; \mathbf{s}_{t}\right)}{\partial \phi}\right]$,即$\hat{\nabla}_{\phi} J_{\pi}$
\STATE 根据$\hat{\nabla}_{\phi} J_{\pi}$使用ADAM更新参数$\phi$
\STATE
\ENDFOR
\STATE$C$个回合复制参数$\bar{\theta} \leftarrow \theta, \bar{\phi} \leftarrow \phi$
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{$V_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t}\right)=\alpha \log \mathbb{E}_{q_{\mathbf{a}^{\prime}}}\left[\frac{\exp \left(\frac{1}{\alpha} Q_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t}, \mathbf{a}^{\prime}\right)\right)}{q_{\mathbf{a}^{\prime}}\left(\mathbf{a}^{\prime}\right)}\right]$}
\footnotetext[2]{$J_{Q}(\theta)=\mathbb{E}_{\mathbf{s}_{t} \sim q_{\mathbf{s}_{t}}, \mathbf{a}_{t} \sim q_{\mathbf{a}_{t}}}\left[\frac{1}{2}\left(\hat{Q}_{\mathrm{soft}}^{\bar{\theta}}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-Q_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)^{2}\right]$}
\footnotetext[3]{$\begin{aligned} \Delta f^{\phi}\left(\cdot ; \mathbf{s}_{t}\right)=& \mathbb{E}_{\mathbf{a}_{t} \sim \pi^{\phi}}\left[\left.\kappa\left(\mathbf{a}_{t}, f^{\phi}\left(\cdot ; \mathbf{s}_{t}\right)\right) \nabla_{\mathbf{a}^{\prime}} Q_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t}, \mathbf{a}^{\prime}\right)\right|_{\mathbf{a}^{\prime}=\mathbf{a}_{t}}\right.\\ &\left.+\left.\alpha \nabla_{\mathbf{a}^{\prime}} \kappa\left(\mathbf{a}^{\prime}, f^{\phi}\left(\cdot ; \mathbf{s}_{t}\right)\right)\right|_{\mathbf{a}^{\prime}=\mathbf{a}_{t}}\right] \end{aligned}$}
\clearpage
\section{SAC-S算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{SAC-S算法}\footnotemark[1]}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\begin{algorithmic}[1] % [1]显示步数
\STATE 初始化参数$\psi, \bar{\psi}, \theta, \phi$
\FOR {回合数 = $1,M$}
\FOR {时步 = $1,t$}
\STATE 根据$\boldsymbol{a}_{t} \sim \pi_{\phi}\left(\boldsymbol{a}_{t} \mid \mathbf{s}_{t}\right)$采样动作$a_t$
\STATE 环境反馈奖励和下一个状态,$\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)$
\STATE 存储transition到经验回放中$\mathcal{D} \leftarrow \mathcal{D} \cup\left\{\left(\mathbf{s}_{t}, \mathbf{a}_{t}, r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right), \mathbf{s}_{t+1}\right)\right\}$
\STATE 更新环境状态$s_{t+1} \leftarrow s_t$
\STATE {\bfseries 更新策略:}
\STATE $\psi \leftarrow \psi-\lambda_{V} \hat{\nabla}_{\psi} J_{V}(\psi)$
\STATE $\theta_{i} \leftarrow \theta_{i}-\lambda_{Q} \hat{\nabla}_{\theta_{i}} J_{Q}\left(\theta_{i}\right)$ for $i \in\{1,2\}$
\STATE $\phi \leftarrow \phi-\lambda_{\pi} \hat{\nabla}_{\phi} J_{\pi}(\phi)$
\STATE $\bar{\psi} \leftarrow \tau \psi+(1-\tau) \bar{\psi}$
\ENDFOR
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor}
\clearpage
\section{SAC算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{SAC算法}\footnotemark[1]}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\begin{algorithmic}[1]
\STATE 初始化网络参数$\theta_1,\theta_2$以及$\phi$ % 初始化
\STATE 复制参数到目标网络$\bar{\theta_1} \leftarrow \theta_1,\bar{\theta_2} \leftarrow \theta_2,$
\STATE 初始化经验回放$D$
\FOR {回合数 = $1,M$}
\STATE 重置环境,获得初始状态$s_t$
\FOR {时步 = $1,t$}
\STATE 根据$\boldsymbol{a}_{t} \sim \pi_{\phi}\left(\boldsymbol{a}_{t} \mid \mathbf{s}_{t}\right)$采样动作$a_t$
\STATE 环境反馈奖励和下一个状态,$\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)$
\STATE 存储transition到经验回放中$\mathcal{D} \leftarrow \mathcal{D} \cup\left\{\left(\mathbf{s}_{t}, \mathbf{a}_{t}, r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right), \mathbf{s}_{t+1}\right)\right\}$
\STATE 更新环境状态$s_{t+1} \leftarrow s_t$
\STATE {\bfseries 更新策略:}
\STATE 更新$Q$函数,$\theta_{i} \leftarrow \theta_{i}-\lambda_{Q} \hat{\nabla}_{\theta_{i}} J_{Q}\left(\theta_{i}\right)$ for $i \in\{1,2\}$\footnotemark[2]\footnotemark[3]
\STATE 更新策略权重,$\phi \leftarrow \phi-\lambda_{\pi} \hat{\nabla}_{\phi} J_{\pi}(\phi)$ \footnotemark[4]
\STATE 调整temperature$\alpha \leftarrow \alpha-\lambda \hat{\nabla}_{\alpha} J(\alpha)$ \footnotemark[5]
\STATE 更新目标网络权重,$\bar{\theta}_{i} \leftarrow \tau \theta_{i}+(1-\tau) \bar{\theta}_{i}$ for $i \in\{1,2\}$
\ENDFOR
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[2]{Soft Actor-Critic Algorithms and Applications}
\footnotetext[2]{$J_{Q}(\theta)=\mathbb{E}_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \mathcal{D}}\left[\frac{1}{2}\left(Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p}\left[V_{\bar{\theta}}\left(\mathbf{s}_{t+1}\right)\right]\right)\right)^{2}\right]$}
\footnotetext[3]{$\hat{\nabla}_{\theta} J_{Q}(\theta)=\nabla_{\theta} Q_{\theta}\left(\mathbf{a}_{t}, \mathbf{s}_{t}\right)\left(Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\gamma\left(Q_{\bar{\theta}}\left(\mathbf{s}_{t+1}, \mathbf{a}_{t+1}\right)-\alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t+1} \mid \mathbf{s}_{t+1}\right)\right)\right)\right)\right.$}
\footnotetext[4]{$\hat{\nabla}_{\phi} J_{\pi}(\phi)=\nabla_{\phi} \alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)+\left(\nabla_{\mathbf{a}_{t}} \alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)-\nabla_{\mathbf{a}_{t}} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) \nabla_{\phi} f_{\phi}\left(\epsilon_{t} ; \mathbf{s}_{t}\right)$,$\mathbf{a}_{t}=f_{\phi}\left(\epsilon_{t} ; \mathbf{s}_{t}\right)$}
\footnotetext[5]{$J(\alpha)=\mathbb{E}_{\mathbf{a}_{t} \sim \pi_{t}}\left[-\alpha \log \pi_{t}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)-\alpha \overline{\mathcal{H}}\right]$}
\clearpage
\end{document}

View File

@@ -1,7 +0,0 @@
## 脚本描述
* `task0.py`:离散动作任务
* `task1.py`:离散动作任务,与`task0.py`唯一的区别就是Actor的激活函数是tanh而不是relu`CartPole-v1`上效果更好
* `task2.py`:连续动作任务,#TODO待调试

View File

@@ -1,24 +0,0 @@
general_cfg:
algo_name: A2C
device: cuda
env_name: CartPole-v1
eval_eps: 10
load_checkpoint: true
load_path: Train_CartPole-v1_A2C_20221030-211435
max_steps: 200
mode: test
save_fig: true
seed: 1
show_fig: false
test_eps: 20
train_eps: 1000
algo_cfg:
actor_hidden_dim: 256
actor_lr: 0.0003
batch_size: 64
buffer_size: 100000
critic_hidden_dim: 256
critic_lr: 0.001
gamma: 0.99
hidden_dim: 256
target_update: 4

View File

@@ -1,23 +0,0 @@
2022-10-30 21:25:53 - r - INFO: - n_states: 4, n_actions: 2
2022-10-30 21:25:55 - r - INFO: - Start testing!
2022-10-30 21:25:55 - r - INFO: - Env: CartPole-v1, Algorithm: A2C, Device: cuda
2022-10-30 21:25:56 - r - INFO: - Episode: 1/20, Reward: 200.0, Step: 200
2022-10-30 21:25:56 - r - INFO: - Episode: 2/20, Reward: 200.0, Step: 200
2022-10-30 21:25:56 - r - INFO: - Episode: 3/20, Reward: 200.0, Step: 200
2022-10-30 21:25:56 - r - INFO: - Episode: 4/20, Reward: 200.0, Step: 200
2022-10-30 21:25:56 - r - INFO: - Episode: 5/20, Reward: 200.0, Step: 200
2022-10-30 21:25:56 - r - INFO: - Episode: 6/20, Reward: 200.0, Step: 200
2022-10-30 21:25:56 - r - INFO: - Episode: 7/20, Reward: 200.0, Step: 200
2022-10-30 21:25:56 - r - INFO: - Episode: 8/20, Reward: 200.0, Step: 200
2022-10-30 21:25:56 - r - INFO: - Episode: 9/20, Reward: 200.0, Step: 200
2022-10-30 21:25:56 - r - INFO: - Episode: 10/20, Reward: 200.0, Step: 200
2022-10-30 21:25:57 - r - INFO: - Episode: 11/20, Reward: 200.0, Step: 200
2022-10-30 21:25:57 - r - INFO: - Episode: 12/20, Reward: 190.0, Step: 190
2022-10-30 21:25:57 - r - INFO: - Episode: 13/20, Reward: 200.0, Step: 200
2022-10-30 21:25:57 - r - INFO: - Episode: 14/20, Reward: 200.0, Step: 200
2022-10-30 21:25:57 - r - INFO: - Episode: 15/20, Reward: 96.0, Step: 96
2022-10-30 21:25:57 - r - INFO: - Episode: 16/20, Reward: 200.0, Step: 200
2022-10-30 21:25:57 - r - INFO: - Episode: 17/20, Reward: 200.0, Step: 200
2022-10-30 21:25:57 - r - INFO: - Episode: 18/20, Reward: 200.0, Step: 200
2022-10-30 21:25:57 - r - INFO: - Episode: 19/20, Reward: 112.0, Step: 112
2022-10-30 21:25:57 - r - INFO: - Episode: 20/20, Reward: 200.0, Step: 200

Binary file not shown.

Before

Width:  |  Height:  |  Size: 34 KiB

View File

@@ -1,21 +0,0 @@
episodes,rewards,steps
0,200.0,200
1,200.0,200
2,200.0,200
3,200.0,200
4,200.0,200
5,200.0,200
6,200.0,200
7,200.0,200
8,200.0,200
9,200.0,200
10,200.0,200
11,190.0,190
12,200.0,200
13,200.0,200
14,96.0,96
15,200.0,200
16,200.0,200
17,200.0,200
18,112.0,112
19,200.0,200
1 episodes rewards steps
2 0 200.0 200
3 1 200.0 200
4 2 200.0 200
5 3 200.0 200
6 4 200.0 200
7 5 200.0 200
8 6 200.0 200
9 7 200.0 200
10 8 200.0 200
11 9 200.0 200
12 10 200.0 200
13 11 190.0 190
14 12 200.0 200
15 13 200.0 200
16 14 96.0 96
17 15 200.0 200
18 16 200.0 200
19 17 200.0 200
20 18 112.0 112
21 19 200.0 200

View File

@@ -1,25 +0,0 @@
general_cfg:
algo_name: A2C
device: cuda
env_name: CartPole-v1
eval_eps: 10
eval_per_episode: 5
load_checkpoint: true
load_path: Train_CartPole-v1_A2C_20221031-232138
max_steps: 200
mode: test
save_fig: true
seed: 1
show_fig: false
test_eps: 20
train_eps: 1000
algo_cfg:
actor_hidden_dim: 256
actor_lr: 0.0003
batch_size: 64
buffer_size: 100000
critic_hidden_dim: 256
critic_lr: 0.001
gamma: 0.99
hidden_dim: 256
target_update: 4

View File

@@ -1,28 +0,0 @@
2022-10-31 23:33:16 - r - INFO: - n_states: 4, n_actions: 2
2022-10-31 23:33:16 - r - INFO: - Actor model name: ActorSoftmaxTanh
2022-10-31 23:33:16 - r - INFO: - Critic model name: Critic
2022-10-31 23:33:16 - r - INFO: - ACMemory memory name: PGReplay
2022-10-31 23:33:16 - r - INFO: - agent name: A2C
2022-10-31 23:33:17 - r - INFO: - Start testing!
2022-10-31 23:33:17 - r - INFO: - Env: CartPole-v1, Algorithm: A2C, Device: cuda
2022-10-31 23:33:18 - r - INFO: - Episode: 1/20, Reward: 200.0, Step: 200
2022-10-31 23:33:18 - r - INFO: - Episode: 2/20, Reward: 200.0, Step: 200
2022-10-31 23:33:18 - r - INFO: - Episode: 3/20, Reward: 186.0, Step: 186
2022-10-31 23:33:18 - r - INFO: - Episode: 4/20, Reward: 200.0, Step: 200
2022-10-31 23:33:18 - r - INFO: - Episode: 5/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 6/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 7/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 8/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 9/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 10/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 11/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 12/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 13/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 14/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 15/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 16/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 17/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 18/20, Reward: 200.0, Step: 200
2022-10-31 23:33:19 - r - INFO: - Episode: 19/20, Reward: 200.0, Step: 200
2022-10-31 23:33:20 - r - INFO: - Episode: 20/20, Reward: 200.0, Step: 200
2022-10-31 23:33:20 - r - INFO: - Finish testing!

Binary file not shown.

Before

Width:  |  Height:  |  Size: 31 KiB

View File

@@ -1,21 +0,0 @@
episodes,rewards,steps
0,200.0,200
1,200.0,200
2,186.0,186
3,200.0,200
4,200.0,200
5,200.0,200
6,200.0,200
7,200.0,200
8,200.0,200
9,200.0,200
10,200.0,200
11,200.0,200
12,200.0,200
13,200.0,200
14,200.0,200
15,200.0,200
16,200.0,200
17,200.0,200
18,200.0,200
19,200.0,200
1 episodes rewards steps
2 0 200.0 200
3 1 200.0 200
4 2 186.0 186
5 3 200.0 200
6 4 200.0 200
7 5 200.0 200
8 6 200.0 200
9 7 200.0 200
10 8 200.0 200
11 9 200.0 200
12 10 200.0 200
13 11 200.0 200
14 12 200.0 200
15 13 200.0 200
16 14 200.0 200
17 15 200.0 200
18 16 200.0 200
19 17 200.0 200
20 18 200.0 200
21 19 200.0 200

View File

@@ -1,23 +0,0 @@
general_cfg:
algo_name: A2C
device: cuda
env_name: CartPole-v1
eval_eps: 10
load_checkpoint: false
load_path: tasks
max_steps: 200
mode: train
save_fig: true
seed: 1
show_fig: false
test_eps: 20
train_eps: 1000
algo_cfg:
actor_hidden_dim: 256
actor_lr: 0.0003
batch_size: 64
buffer_size: 100000
critic_hidden_dim: 256
critic_lr: 0.001
gamma: 0.99
hidden_dim: 256

Binary file not shown.

Before

Width:  |  Height:  |  Size: 68 KiB

View File

@@ -1,24 +0,0 @@
general_cfg:
algo_name: A2C
device: cuda
env_name: CartPole-v1
eval_eps: 10
eval_per_episode: 5
load_checkpoint: false
load_path: tasks
max_steps: 200
mode: train
save_fig: true
seed: 1
show_fig: false
test_eps: 20
train_eps: 1000
algo_cfg:
actor_hidden_dim: 256
actor_lr: 0.0003
batch_size: 64
buffer_size: 100000
critic_hidden_dim: 256
critic_lr: 0.001
gamma: 0.99
hidden_dim: 256

Binary file not shown.

Before

Width:  |  Height:  |  Size: 58 KiB

View File

@@ -1,103 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2022-08-16 23:05:25
LastEditor: JiangJi
LastEditTime: 2022-11-01 00:33:49
Discription:
'''
import torch
import numpy as np
from torch.distributions import Categorical,Normal
class A2C:
def __init__(self,models,memories,cfg):
self.n_actions = cfg.n_actions
self.gamma = cfg.gamma
self.device = torch.device(cfg.device)
self.continuous = cfg.continuous
if hasattr(cfg,'action_bound'):
self.action_bound = cfg.action_bound
self.memory = memories['ACMemory']
self.actor = models['Actor'].to(self.device)
self.critic = models['Critic'].to(self.device)
self.actor_optim = torch.optim.Adam(self.actor.parameters(), lr=cfg.actor_lr)
self.critic_optim = torch.optim.Adam(self.critic.parameters(), lr=cfg.critic_lr)
def sample_action(self,state):
# state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
# dist = self.actor(state)
# self.entropy = - np.sum(np.mean(dist.detach().cpu().numpy()) * np.log(dist.detach().cpu().numpy()))
# value = self.critic(state) # note that 'dist' need require_grad=True
# self.value = value.detach().cpu().numpy().squeeze(0)[0]
# action = np.random.choice(self.n_actions, p=dist.detach().cpu().numpy().squeeze(0)) # shape(p=(n_actions,1)
# self.log_prob = torch.log(dist.squeeze(0)[action])
if self.continuous:
state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
mu, sigma = self.actor(state)
dist = Normal(self.action_bound * mu.view(1,), sigma.view(1,))
action = dist.sample()
value = self.critic(state)
# self.entropy = - np.sum(np.mean(dist.detach().cpu().numpy()) * np.log(dist.detach().cpu().numpy()))
self.value = value.detach().cpu().numpy().squeeze(0)[0] # detach() to avoid gradient
self.log_prob = dist.log_prob(action).squeeze(dim=0) # Tensor([0.])
self.entropy = dist.entropy().cpu().detach().numpy().squeeze(0) # detach() to avoid gradient
return action.cpu().detach().numpy()
else:
state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
probs = self.actor(state)
dist = Categorical(probs)
action = dist.sample() # Tensor([0])
value = self.critic(state)
self.value = value.detach().cpu().numpy().squeeze(0)[0] # detach() to avoid gradient
self.log_prob = dist.log_prob(action).squeeze(dim=0) # Tensor([0.])
self.entropy = dist.entropy().cpu().detach().numpy().squeeze(0) # detach() to avoid gradient
return action.cpu().numpy().item()
@torch.no_grad()
def predict_action(self,state):
if self.continuous:
state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
mu, sigma = self.actor(state)
dist = Normal(self.action_bound * mu.view(1,), sigma.view(1,))
action = dist.sample()
return action.cpu().detach().numpy()
else:
state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
dist = self.actor(state)
# value = self.critic(state) # note that 'dist' need require_grad=True
# value = value.detach().cpu().numpy().squeeze(0)[0]
action = np.random.choice(self.n_actions, p=dist.detach().cpu().numpy().squeeze(0)) # shape(p=(n_actions,1)
return action
def update(self,next_state,entropy):
value_pool,log_prob_pool,reward_pool = self.memory.sample()
value_pool = torch.tensor(value_pool, device=self.device)
log_prob_pool = torch.stack(log_prob_pool)
next_state = torch.tensor(next_state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
next_value = self.critic(next_state)
returns = np.zeros_like(reward_pool)
for t in reversed(range(len(reward_pool))):
next_value = reward_pool[t] + self.gamma * next_value # G(s_{t},a{t}) = r_{t+1} + gamma * V(s_{t+1})
returns[t] = next_value
returns = torch.tensor(returns, device=self.device)
advantages = returns - value_pool
actor_loss = (-log_prob_pool * advantages).mean()
critic_loss = 0.5 * advantages.pow(2).mean()
tot_loss = actor_loss + critic_loss + 0.001 * entropy
self.actor_optim.zero_grad()
self.critic_optim.zero_grad()
tot_loss.backward()
self.actor_optim.step()
self.critic_optim.step()
self.memory.clear()
def save_model(self, path):
from pathlib import Path
# create path
Path(path).mkdir(parents=True, exist_ok=True)
torch.save(self.actor.state_dict(), f"{path}/actor_checkpoint.pt")
torch.save(self.critic.state_dict(), f"{path}/critic_checkpoint.pt")
def load_model(self, path):
self.actor.load_state_dict(torch.load(f"{path}/actor_checkpoint.pt"))
self.critic.load_state_dict(torch.load(f"{path}/critic_checkpoint.pt"))

View File

@@ -1,65 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2022-09-19 14:48:16
LastEditor: JiangJi
LastEditTime: 2022-10-30 01:21:50
Discription: #TODO待更新模版
'''
import torch
import numpy as np
class A2C_2:
def __init__(self,models,memories,cfg):
self.n_actions = cfg.n_actions
self.gamma = cfg.gamma
self.device = torch.device(cfg.device)
self.memory = memories['ACMemory']
self.ac_net = models['ActorCritic'].to(self.device)
self.ac_optimizer = torch.optim.Adam(self.ac_net.parameters(), lr = cfg.lr)
def sample_action(self,state):
state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
value, dist = self.ac_net(state) # note that 'dist' need require_grad=True
value = value.detach().numpy().squeeze(0)[0]
action = np.random.choice(self.n_actions, p=dist.detach().numpy().squeeze(0)) # shape(p=(n_actions,1)
return action,value,dist
def predict_action(self,state):
''' predict can be all wrapped with no_grad(), then donot need detach(), or you can just copy contents of 'sample_action'
'''
with torch.no_grad():
state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
value, dist = self.ac_net(state)
value = value.numpy().squeeze(0)[0] # shape(value) = (1,)
action = np.random.choice(self.n_actions, p=dist.numpy().squeeze(0)) # shape(p=(n_actions,1)
return action,value,dist
def update(self,next_state,entropy):
value_pool,log_prob_pool,reward_pool = self.memory.sample()
next_state = torch.tensor(next_state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
next_value,_ = self.ac_net(next_state)
returns = np.zeros_like(reward_pool)
for t in reversed(range(len(reward_pool))):
next_value = reward_pool[t] + self.gamma * next_value # G(s_{t},a{t}) = r_{t+1} + gamma * V(s_{t+1})
returns[t] = next_value
returns = torch.tensor(returns, device=self.device)
value_pool = torch.tensor(value_pool, device=self.device)
advantages = returns - value_pool
log_prob_pool = torch.stack(log_prob_pool)
actor_loss = (-log_prob_pool * advantages).mean()
critic_loss = 0.5 * advantages.pow(2).mean()
ac_loss = actor_loss + critic_loss + 0.001 * entropy
self.ac_optimizer.zero_grad()
ac_loss.backward()
self.ac_optimizer.step()
self.memory.clear()
def save_model(self, path):
from pathlib import Path
# create path
Path(path).mkdir(parents=True, exist_ok=True)
torch.save(self.ac_net.state_dict(), f"{path}/a2c_checkpoint.pt")
def load_model(self, path):
self.ac_net.load_state_dict(torch.load(f"{path}/a2c_checkpoint.pt"))

View File

@@ -1,21 +0,0 @@
general_cfg:
algo_name: A2C
device: cuda
env_name: CartPole-v1
mode: test
load_checkpoint: true
load_path: Train_CartPole-v1_A2C_20221031-232138
max_steps: 200
save_fig: true
seed: 1
show_fig: false
test_eps: 20
train_eps: 1000
algo_cfg:
continuous: false
batch_size: 64
buffer_size: 100000
gamma: 0.99
actor_lr: 0.0003
critic_lr: 0.001
target_update: 4

View File

@@ -1,19 +0,0 @@
general_cfg:
algo_name: A2C
device: cuda
env_name: CartPole-v1
mode: train
load_checkpoint: false
load_path: Train_CartPole-v1_DQN_20221026-054757
max_steps: 200
save_fig: true
seed: 1
show_fig: false
test_eps: 20
train_eps: 600
algo_cfg:
continuous: false
batch_size: 64
buffer_size: 100000
gamma: 0.0003
lr: 0.001

View File

@@ -1,21 +0,0 @@
general_cfg:
algo_name: A2C
device: cuda
env_name: Pendulum-v1
mode: train
eval_per_episode: 200
load_checkpoint: false
load_path: Train_CartPole-v1_DQN_20221026-054757
max_steps: 200
save_fig: true
seed: 1
show_fig: false
test_eps: 20
train_eps: 1000
algo_cfg:
continuous: true
batch_size: 64
buffer_size: 100000
gamma: 0.0003
actor_lr: 0.0003
critic_lr: 0.001

View File

@@ -1,38 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2022-10-30 00:53:03
LastEditor: JiangJi
LastEditTime: 2022-11-01 00:17:55
Discription: default parameters of A2C
'''
from common.config import GeneralConfig,AlgoConfig
class GeneralConfigA2C(GeneralConfig):
def __init__(self) -> None:
self.env_name = "CartPole-v1" # name of environment
self.algo_name = "A2C" # name of algorithm
self.mode = "train" # train or test
self.seed = 1 # random seed
self.device = "cuda" # device to use
self.train_eps = 1000 # number of episodes for training
self.test_eps = 20 # number of episodes for testing
self.max_steps = 200 # max steps for each episode
self.load_checkpoint = False
self.load_path = "tasks" # path to load model
self.show_fig = False # show figure or not
self.save_fig = True # save figure or not
class AlgoConfigA2C(AlgoConfig):
def __init__(self) -> None:
self.continuous = False # continuous or discrete action space
self.hidden_dim = 256 # hidden_dim for MLP
self.gamma = 0.99 # discount factor
self.actor_lr = 3e-4 # learning rate of actor
self.critic_lr = 1e-3 # learning rate of critic
self.actor_hidden_dim = 256 # hidden_dim for actor MLP
self.critic_hidden_dim = 256 # hidden_dim for critic MLP
self.buffer_size = 100000 # size of replay buffer
self.batch_size = 64 # batch size

View File

@@ -1,130 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2022-09-19 14:48:16
LastEditor: JiangJi
LastEditTime: 2022-10-30 01:21:15
Discription: #TODO待更新模版
'''
import sys,os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized."
curr_path = os.path.dirname(os.path.abspath(__file__)) # current path
parent_path = os.path.dirname(curr_path) # parent path
sys.path.append(parent_path) # add path to system path
import datetime
import argparse
import gym
import torch
import numpy as np
from common.utils import all_seed
from common.launcher import Launcher
from common.memories import PGReplay
from common.models import ActorCriticSoftmax
from envs.register import register_env
from a2c_2 import A2C_2
class Main(Launcher):
def get_args(self):
curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time
parser = argparse.ArgumentParser(description="hyperparameters")
parser.add_argument('--algo_name',default='A2C',type=str,help="name of algorithm")
parser.add_argument('--env_name',default='CartPole-v0',type=str,help="name of environment")
parser.add_argument('--train_eps',default=2000,type=int,help="episodes of training")
parser.add_argument('--test_eps',default=20,type=int,help="episodes of testing")
parser.add_argument('--ep_max_steps',default = 100000,type=int,help="steps per episode, much larger value can simulate infinite steps")
parser.add_argument('--gamma',default=0.99,type=float,help="discounted factor")
parser.add_argument('--lr',default=3e-4,type=float,help="learning rate")
parser.add_argument('--actor_hidden_dim',default=256,type=int)
parser.add_argument('--critic_hidden_dim',default=256,type=int)
parser.add_argument('--device',default='cpu',type=str,help="cpu or cuda")
parser.add_argument('--seed',default=10,type=int,help="seed")
parser.add_argument('--show_fig',default=False,type=bool,help="if show figure or not")
parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not")
args = parser.parse_args()
default_args = {'result_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/results/",
'model_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/models/",
}
args = {**vars(args),**default_args} # type(dict)
return args
def env_agent_config(self,cfg):
''' create env and agent
'''
register_env(cfg['env_name'])
env = gym.make(cfg['env_name'])
if cfg['seed'] !=0: # set random seed
all_seed(env,seed=cfg["seed"])
try: # state dimension
n_states = env.observation_space.n # print(hasattr(env.observation_space, 'n'))
except AttributeError:
n_states = env.observation_space.shape[0] # print(hasattr(env.observation_space, 'shape'))
n_actions = env.action_space.n # action dimension
print(f"n_states: {n_states}, n_actions: {n_actions}")
cfg.update({"n_states":n_states,"n_actions":n_actions}) # update to cfg paramters
models = {'ActorCritic':ActorCriticSoftmax(cfg['n_states'],cfg['n_actions'], actor_hidden_dim = cfg['actor_hidden_dim'],critic_hidden_dim=cfg['critic_hidden_dim'])}
memories = {'ACMemory':PGReplay()}
agent = A2C_2(models,memories,cfg)
return env,agent
def train(self,cfg,env,agent):
print("Start training!")
print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}")
rewards = [] # record rewards for all episodes
steps = [] # record steps for all episodes
for i_ep in range(cfg['train_eps']):
ep_reward = 0 # reward per episode
ep_step = 0 # step per episode
ep_entropy = 0
state = env.reset() # reset and obtain initial state
for _ in range(cfg['ep_max_steps']):
action, value, dist = agent.sample_action(state) # sample action
next_state, reward, done, _ = env.step(action) # update env and return transitions
log_prob = torch.log(dist.squeeze(0)[action])
entropy = -np.sum(np.mean(dist.detach().numpy()) * np.log(dist.detach().numpy()))
agent.memory.push((value,log_prob,reward)) # save transitions
state = next_state # update state
ep_reward += reward
ep_entropy += entropy
ep_step += 1
if done:
break
agent.update(next_state,ep_entropy) # update agent
rewards.append(ep_reward)
steps.append(ep_step)
if (i_ep+1)%10==0:
print(f'Episode: {i_ep+1}/{cfg["train_eps"]}, Reward: {ep_reward:.2f}, Steps:{ep_step}')
print("Finish training!")
return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
def test(self,cfg,env,agent):
print("Start testing!")
print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}")
rewards = [] # record rewards for all episodes
steps = [] # record steps for all episodes
for i_ep in range(cfg['test_eps']):
ep_reward = 0 # reward per episode
ep_step = 0
state = env.reset() # reset and obtain initial state
for _ in range(cfg['ep_max_steps']):
action,_,_ = agent.predict_action(state) # predict action
next_state, reward, done, _ = env.step(action)
state = next_state
ep_reward += reward
ep_step += 1
if done:
break
rewards.append(ep_reward)
steps.append(ep_step)
print(f"Episode: {i_ep+1}/{cfg['test_eps']}, Steps:{ep_step}, Reward: {ep_reward:.2f}")
print("Finish testing!")
return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
if __name__ == "__main__":
main = Main()
main.run()

View File

@@ -1,142 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2022-10-30 01:19:43
LastEditor: JiangJi
LastEditTime: 2022-11-01 01:21:06
Discription:
'''
import sys,os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized."
curr_path = os.path.dirname(os.path.abspath(__file__)) # current path
parent_path = os.path.dirname(curr_path) # parent path
sys.path.append(parent_path) # add path to system path
import gym
from common.utils import all_seed,merge_class_attrs
from common.launcher import Launcher
from common.memories import PGReplay
from common.models import ActorSoftmax,Critic
from envs.register import register_env
from a2c import A2C
from config.config import GeneralConfigA2C,AlgoConfigA2C
class Main(Launcher):
def __init__(self) -> None:
super().__init__()
self.cfgs['general_cfg'] = merge_class_attrs(self.cfgs['general_cfg'],GeneralConfigA2C())
self.cfgs['algo_cfg'] = merge_class_attrs(self.cfgs['algo_cfg'],AlgoConfigA2C())
def env_agent_config(self,cfg,logger):
''' create env and agent
'''
register_env(cfg.env_name)
env = gym.make(cfg.env_name,new_step_api=True) # create env
if cfg.seed !=0: # set random seed
all_seed(env,seed = cfg.seed)
try: # state dimension
n_states = env.observation_space.n # print(hasattr(env.observation_space, 'n'))
except AttributeError:
n_states = env.observation_space.shape[0] # print(hasattr(env.observation_space, 'shape'))
n_actions = env.action_space.n # action dimension
logger.info(f"n_states: {n_states}, n_actions: {n_actions}") # print info
# update to cfg paramters
setattr(cfg, 'n_states', n_states)
setattr(cfg, 'n_actions', n_actions)
models = {'Actor':ActorSoftmax(n_states,n_actions, hidden_dim = cfg.actor_hidden_dim),'Critic':Critic(n_states,1,hidden_dim=cfg.critic_hidden_dim)}
memories = {'ACMemory':PGReplay()}
agent = A2C(models,memories,cfg)
for k,v in models.items():
logger.info(f"{k} model name: {type(v).__name__}")
for k,v in memories.items():
logger.info(f"{k} memory name: {type(v).__name__}")
logger.info(f"agent name: {type(agent).__name__}")
return env,agent
def train_one_episode(self, env, agent, cfg):
ep_reward = 0 # reward per episode
ep_step = 0 # step per episode
ep_entropy = 0 # entropy per episode
state = env.reset() # reset and obtain initial state
for _ in range(cfg.max_steps):
action = agent.sample_action(state) # sample action
next_state, reward, terminated, truncated , info = env.step(action) # update env and return transitions
agent.memory.push((agent.value,agent.log_prob,reward)) # save transitions
state = next_state # update state
ep_reward += reward
ep_entropy += agent.entropy
ep_step += 1
if terminated:
break
agent.update(next_state,ep_entropy) # update agent
return agent,ep_reward,ep_step
def test_one_episode(self, env, agent, cfg):
ep_reward = 0 # reward per episode
ep_step = 0 # step per episode
state = env.reset() # reset and obtain initial state
for _ in range(cfg.max_steps):
action = agent.predict_action(state) # predict action
next_state, reward, terminated, truncated , info = env.step(action)
state = next_state
ep_reward += reward
ep_step += 1
if terminated:
break
return agent,ep_reward,ep_step
# def train(self,cfg,env,agent,logger):
# logger.info("Start training!")
# logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
# rewards = [] # record rewards for all episodes
# steps = [] # record steps for all episodes
# for i_ep in range(cfg.train_eps):
# ep_reward = 0 # reward per episode
# ep_step = 0 # step per episode
# ep_entropy = 0
# state = env.reset() # reset and obtain initial state
# for _ in range(cfg.max_steps):
# action = agent.sample_action(state) # sample action
# next_state, reward, terminated, truncated , info = env.step(action) # update env and return transitions
# agent.memory.push((agent.value,agent.log_prob,reward)) # save transitions
# state = next_state # update state
# ep_reward += reward
# ep_entropy += agent.entropy
# ep_step += 1
# if terminated:
# break
# agent.update(next_state,ep_entropy) # update agent
# rewards.append(ep_reward)
# steps.append(ep_step)
# logger.info(f"Episode: {i_ep+1}/{cfg.train_eps}, Reward: {ep_reward:.2f}, Steps:{ep_step}")
# logger.info("Finish training!")
# return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
# def test(self,cfg,env,agent,logger):
# logger.info("Start testing!")
# logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
# rewards = [] # record rewards for all episodes
# steps = [] # record steps for all episodes
# for i_ep in range(cfg.test_eps):
# ep_reward = 0 # reward per episode
# ep_step = 0
# state = env.reset() # reset and obtain initial state
# for _ in range(cfg.max_steps):
# action = agent.predict_action(state) # predict action
# next_state, reward, terminated, truncated , info = env.step(action)
# state = next_state
# ep_reward += reward
# ep_step += 1
# if terminated:
# break
# rewards.append(ep_reward)
# steps.append(ep_step)
# logger.info(f"Episode: {i_ep+1}/{cfg.test_eps}, Reward: {ep_reward:.2f}, Steps:{ep_step}")
# logger.info("Finish testing!")
# env.close()
# return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
if __name__ == "__main__":
main = Main()
main.run()

View File

@@ -1,142 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2022-10-30 01:19:43
LastEditor: JiangJi
LastEditTime: 2022-11-01 01:21:12
Discription: continuous action space
'''
import sys,os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized."
curr_path = os.path.dirname(os.path.abspath(__file__)) # current path
parent_path = os.path.dirname(curr_path) # parent path
sys.path.append(parent_path) # add path to system path
import gym
from common.utils import all_seed,merge_class_attrs
from common.launcher import Launcher
from common.memories import PGReplay
from common.models import ActorSoftmaxTanh,Critic
from envs.register import register_env
from a2c import A2C
from config.config import GeneralConfigA2C,AlgoConfigA2C
class Main(Launcher):
def __init__(self) -> None:
super().__init__()
self.cfgs['general_cfg'] = merge_class_attrs(self.cfgs['general_cfg'],GeneralConfigA2C())
self.cfgs['algo_cfg'] = merge_class_attrs(self.cfgs['algo_cfg'],AlgoConfigA2C())
def env_agent_config(self,cfg,logger):
''' create env and agent
'''
register_env(cfg.env_name)
env = gym.make(cfg.env_name,new_step_api=True) # create env
if cfg.seed !=0: # set random seed
all_seed(env,seed = cfg.seed)
try: # state dimension
n_states = env.observation_space.n # print(hasattr(env.observation_space, 'n'))
except AttributeError:
n_states = env.observation_space.shape[0] # print(hasattr(env.observation_space, 'shape'))
n_actions = env.action_space.n # action dimension
logger.info(f"n_states: {n_states}, n_actions: {n_actions}") # print info
# update to cfg paramters
setattr(cfg, 'n_states', n_states)
setattr(cfg, 'n_actions', n_actions)
models = {'Actor':ActorSoftmaxTanh(n_states,n_actions, hidden_dim = cfg.actor_hidden_dim),'Critic':Critic(n_states,1,hidden_dim=cfg.critic_hidden_dim)}
memories = {'ACMemory':PGReplay()}
agent = A2C(models,memories,cfg)
for k,v in models.items():
logger.info(f"{k} model name: {type(v).__name__}")
for k,v in memories.items():
logger.info(f"{k} memory name: {type(v).__name__}")
logger.info(f"agent name: {type(agent).__name__}")
return env,agent
def train_one_episode(self, env, agent, cfg):
ep_reward = 0 # reward per episode
ep_step = 0 # step per episode
ep_entropy = 0 # entropy per episode
state = env.reset() # reset and obtain initial state
for _ in range(cfg.max_steps):
action = agent.sample_action(state) # sample action
next_state, reward, terminated, truncated , info = env.step(action) # update env and return transitions
agent.memory.push((agent.value,agent.log_prob,reward)) # save transitions
state = next_state # update state
ep_reward += reward
ep_entropy += agent.entropy
ep_step += 1
if terminated:
break
agent.update(next_state,ep_entropy) # update agent
return agent,ep_reward,ep_step
def test_one_episode(self, env, agent, cfg):
ep_reward = 0 # reward per episode
ep_step = 0 # step per episode
state = env.reset() # reset and obtain initial state
for _ in range(cfg.max_steps):
action = agent.predict_action(state) # predict action
next_state, reward, terminated, truncated , info = env.step(action)
state = next_state
ep_reward += reward
ep_step += 1
if terminated:
break
return agent,ep_reward,ep_step
# def train(self,cfg,env,agent,logger):
# logger.info("Start training!")
# logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
# rewards = [] # record rewards for all episodes
# steps = [] # record steps for all episodes
# for i_ep in range(cfg.train_eps):
# ep_reward = 0 # reward per episode
# ep_step = 0 # step per episode
# ep_entropy = 0
# state = env.reset() # reset and obtain initial state
# for _ in range(cfg.max_steps):
# action = agent.sample_action(state) # sample action
# next_state, reward, terminated, truncated , info = env.step(action) # update env and return transitions
# agent.memory.push((agent.value,agent.log_prob,reward)) # save transitions
# state = next_state # update state
# ep_reward += reward
# ep_entropy += agent.entropy
# ep_step += 1
# if terminated:
# break
# agent.update(next_state,ep_entropy) # update agent
# rewards.append(ep_reward)
# steps.append(ep_step)
# logger.info(f"Episode: {i_ep+1}/{cfg.train_eps}, Reward: {ep_reward:.2f}, Steps:{ep_step}")
# logger.info("Finish training!")
# return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
# def test(self,cfg,env,agent,logger):
# logger.info("Start testing!")
# logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
# rewards = [] # record rewards for all episodes
# steps = [] # record steps for all episodes
# for i_ep in range(cfg.test_eps):
# ep_reward = 0 # reward per episode
# ep_step = 0
# state = env.reset() # reset and obtain initial state
# for _ in range(cfg.max_steps):
# action = agent.predict_action(state) # predict action
# next_state, reward, terminated, truncated , info = env.step(action)
# state = next_state
# ep_reward += reward
# ep_step += 1
# if terminated:
# break
# rewards.append(ep_reward)
# steps.append(ep_step)
# logger.info(f"Episode: {i_ep+1}/{cfg.test_eps}, Reward: {ep_reward:.2f}, Steps:{ep_step}")
# logger.info("Finish testing!")
# env.close()
# return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
if __name__ == "__main__":
main = Main()
main.run()

View File

@@ -1,149 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2022-10-30 01:19:43
LastEditor: JiangJi
LastEditTime: 2022-11-01 00:08:22
Discription: the only difference from task0.py is that the actor here we use ActorSoftmaxTanh instead of ActorSoftmax with ReLU
'''
import sys,os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized."
curr_path = os.path.dirname(os.path.abspath(__file__)) # current path
parent_path = os.path.dirname(curr_path) # parent path
sys.path.append(parent_path) # add path to system path
import gym
import torch
import numpy as np
from common.utils import all_seed,merge_class_attrs
from common.launcher import Launcher
from common.memories import PGReplay
from common.models import ActorNormal,Critic
from envs.register import register_env
from a2c import A2C
from config.config import GeneralConfigA2C,AlgoConfigA2C
class Main(Launcher):
def __init__(self) -> None:
super().__init__()
self.cfgs['general_cfg'] = merge_class_attrs(self.cfgs['general_cfg'],GeneralConfigA2C())
self.cfgs['algo_cfg'] = merge_class_attrs(self.cfgs['algo_cfg'],AlgoConfigA2C())
def env_agent_config(self,cfg,logger):
''' create env and agent
'''
register_env(cfg.env_name)
env = gym.make(cfg.env_name,new_step_api=True) # create env
if cfg.seed !=0: # set random seed
all_seed(env,seed = cfg.seed)
try: # state dimension
n_states = env.observation_space.n # print(hasattr(env.observation_space, 'n'))
except AttributeError:
n_states = env.observation_space.shape[0] # print(hasattr(env.observation_space, 'shape'))
try:
n_actions = env.action_space.n # action dimension
except AttributeError:
n_actions = env.action_space.shape[0]
logger.info(f"action bound: {abs(env.action_space.low.item())}")
setattr(cfg, 'action_bound', abs(env.action_space.low.item()))
logger.info(f"n_states: {n_states}, n_actions: {n_actions}") # print info
# update to cfg paramters
setattr(cfg, 'n_states', n_states)
setattr(cfg, 'n_actions', n_actions)
models = {'Actor':ActorNormal(n_states,n_actions, hidden_dim = cfg.actor_hidden_dim),'Critic':Critic(n_states,1,hidden_dim=cfg.critic_hidden_dim)}
memories = {'ACMemory':PGReplay()}
agent = A2C(models,memories,cfg)
for k,v in models.items():
logger.info(f"{k} model name: {type(v).__name__}")
for k,v in memories.items():
logger.info(f"{k} memory name: {type(v).__name__}")
logger.info(f"agent name: {type(agent).__name__}")
return env,agent
def train_one_episode(self, env, agent, cfg):
ep_reward = 0 # reward per episode
ep_step = 0 # step per episode
ep_entropy = 0 # entropy per episode
state = env.reset() # reset and obtain initial state
for _ in range(cfg.max_steps):
action = agent.sample_action(state) # sample action
next_state, reward, terminated, truncated , info = env.step(action) # update env and return transitions
agent.memory.push((agent.value,agent.log_prob,reward)) # save transitions
state = next_state # update state
ep_reward += reward
ep_entropy += agent.entropy
ep_step += 1
if terminated:
break
agent.update(next_state,ep_entropy) # update agent
return agent,ep_reward,ep_step
def test_one_episode(self, env, agent, cfg):
ep_reward = 0 # reward per episode
ep_step = 0 # step per episode
state = env.reset() # reset and obtain initial state
for _ in range(cfg.max_steps):
action = agent.predict_action(state) # predict action
next_state, reward, terminated, truncated , info = env.step(action)
state = next_state
ep_reward += reward
ep_step += 1
if terminated:
break
return agent,ep_reward,ep_step
# def train(self,cfg,env,agent,logger):
# logger.info("Start training!")
# logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
# rewards = [] # record rewards for all episodes
# steps = [] # record steps for all episodes
# for i_ep in range(cfg.train_eps):
# ep_reward = 0 # reward per episode
# ep_step = 0 # step per episode
# ep_entropy = 0
# state = env.reset() # reset and obtain initial state
# for _ in range(cfg.max_steps):
# action = agent.sample_action(state) # sample action
# next_state, reward, terminated, truncated , info = env.step(action) # update env and return transitions
# agent.memory.push((agent.value,agent.log_prob,reward)) # save transitions
# state = next_state # update state
# ep_reward += reward
# ep_entropy += agent.entropy
# ep_step += 1
# if terminated:
# break
# agent.update(next_state,ep_entropy) # update agent
# rewards.append(ep_reward)
# steps.append(ep_step)
# logger.info(f"Episode: {i_ep+1}/{cfg.train_eps}, Reward: {ep_reward:.2f}, Steps:{ep_step}")
# logger.info("Finish training!")
# return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
# def test(self,cfg,env,agent,logger):
# logger.info("Start testing!")
# logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
# rewards = [] # record rewards for all episodes
# steps = [] # record steps for all episodes
# for i_ep in range(cfg.test_eps):
# ep_reward = 0 # reward per episode
# ep_step = 0
# state = env.reset() # reset and obtain initial state
# for _ in range(cfg.max_steps):
# action = agent.predict_action(state) # predict action
# next_state, reward, terminated, truncated , info = env.step(action)
# state = next_state
# ep_reward += reward
# ep_step += 1
# if terminated:
# break
# rewards.append(ep_reward)
# steps.append(ep_step)
# logger.info(f"Episode: {i_ep+1}/{cfg.test_eps}, Reward: {ep_reward:.2f}, Steps:{ep_step}")
# logger.info("Finish testing!")
# env.close()
# return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
if __name__ == "__main__":
main = Main()
main.run()

View File

@@ -1,5 +0,0 @@
## A2C
https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f

View File

@@ -1,56 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2021-05-03 22:16:08
LastEditor: JiangJi
LastEditTime: 2022-07-20 23:54:40
Discription:
Environment:
'''
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
class ActorCritic(nn.Module):
''' A2C网络模型包含一个Actor和Critic
'''
def __init__(self, input_dim, output_dim, hidden_dim):
super(ActorCritic, self).__init__()
self.critic = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
self.actor = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, output_dim),
nn.Softmax(dim=1),
)
def forward(self, x):
value = self.critic(x)
probs = self.actor(x)
dist = Categorical(probs)
return dist, value
class A2C:
''' A2C算法
'''
def __init__(self,n_states,n_actions,cfg) -> None:
self.gamma = cfg.gamma
self.device = torch.device(cfg.device)
self.model = ActorCritic(n_states, n_actions, cfg.hidden_size).to(self.device)
self.optimizer = optim.Adam(self.model.parameters())
def compute_returns(self,next_value, rewards, masks):
R = next_value
returns = []
for step in reversed(range(len(rewards))):
R = rewards[step] + self.gamma * R * masks[step]
returns.insert(0, R)
return returns

View File

@@ -1,14 +0,0 @@
{
"algo_name": "A2C",
"env_name": "CartPole-v0",
"n_envs": 8,
"max_steps": 20000,
"n_steps": 5,
"gamma": 0.99,
"lr": 0.001,
"hidden_dim": 256,
"deivce": "cpu",
"result_path": "C:\\Users\\24438\\Desktop\\rl-tutorials/outputs/CartPole-v0/20220713-221850/results/",
"model_path": "C:\\Users\\24438\\Desktop\\rl-tutorials/outputs/CartPole-v0/20220713-221850/models/",
"save_fig": true
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 64 KiB

View File

@@ -1,137 +0,0 @@
import sys,os
curr_path = os.path.dirname(os.path.abspath(__file__)) # current path
parent_path = os.path.dirname(curr_path) # parent path
sys.path.append(parent_path) # add to system path
import gym
import numpy as np
import torch
import torch.optim as optim
import datetime
import argparse
from common.multiprocessing_env import SubprocVecEnv
from a3c import ActorCritic
from common.utils import save_results, make_dir
from common.utils import plot_rewards, save_args
def get_args():
""" Hyperparameters
"""
curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # Obtain current time
parser = argparse.ArgumentParser(description="hyperparameters")
parser.add_argument('--algo_name',default='A2C',type=str,help="name of algorithm")
parser.add_argument('--env_name',default='CartPole-v0',type=str,help="name of environment")
parser.add_argument('--n_envs',default=8,type=int,help="numbers of environments")
parser.add_argument('--max_steps',default=20000,type=int,help="episodes of training")
parser.add_argument('--n_steps',default=5,type=int,help="episodes of testing")
parser.add_argument('--gamma',default=0.99,type=float,help="discounted factor")
parser.add_argument('--lr',default=1e-3,type=float,help="learning rate")
parser.add_argument('--hidden_dim',default=256,type=int)
parser.add_argument('--device',default='cpu',type=str,help="cpu or cuda")
parser.add_argument('--result_path',default=curr_path + "/outputs/" + parser.parse_args().env_name + \
'/' + curr_time + '/results/' )
parser.add_argument('--model_path',default=curr_path + "/outputs/" + parser.parse_args().env_name + \
'/' + curr_time + '/models/' ) # path to save models
parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not")
args = parser.parse_args()
return args
def make_envs(env_name):
def _thunk():
env = gym.make(env_name)
env.seed(2)
return env
return _thunk
def test_env(env,model,vis=False):
state = env.reset()
if vis: env.render()
done = False
total_reward = 0
while not done:
state = torch.FloatTensor(state).unsqueeze(0).to(cfg.device)
dist, _ = model(state)
next_state, reward, done, _ = env.step(dist.sample().cpu().numpy()[0])
state = next_state
if vis: env.render()
total_reward += reward
return total_reward
def compute_returns(next_value, rewards, masks, gamma=0.99):
R = next_value
returns = []
for step in reversed(range(len(rewards))):
R = rewards[step] + gamma * R * masks[step]
returns.insert(0, R)
return returns
def train(cfg,envs):
print('Start training!')
print(f'Env:{cfg.env_name}, Algorithm:{cfg.algo_name}, Device:{cfg.device}')
env = gym.make(cfg.env_name) # a single env
env.seed(10)
n_states = envs.observation_space.shape[0]
n_actions = envs.action_space.n
model = ActorCritic(n_states, n_actions, cfg.hidden_dim).to(cfg.device)
optimizer = optim.Adam(model.parameters())
step_idx = 0
test_rewards = []
test_ma_rewards = []
state = envs.reset()
while step_idx < cfg.max_steps:
log_probs = []
values = []
rewards = []
masks = []
entropy = 0
# rollout trajectory
for _ in range(cfg.n_steps):
state = torch.FloatTensor(state).to(cfg.device)
dist, value = model(state)
action = dist.sample()
next_state, reward, done, _ = envs.step(action.cpu().numpy())
log_prob = dist.log_prob(action)
entropy += dist.entropy().mean()
log_probs.append(log_prob)
values.append(value)
rewards.append(torch.FloatTensor(reward).unsqueeze(1).to(cfg.device))
masks.append(torch.FloatTensor(1 - done).unsqueeze(1).to(cfg.device))
state = next_state
step_idx += 1
if step_idx % 100 == 0:
test_reward = np.mean([test_env(env,model) for _ in range(10)])
print(f"step_idx:{step_idx}, test_reward:{test_reward}")
test_rewards.append(test_reward)
if test_ma_rewards:
test_ma_rewards.append(0.9*test_ma_rewards[-1]+0.1*test_reward)
else:
test_ma_rewards.append(test_reward)
# plot(step_idx, test_rewards)
next_state = torch.FloatTensor(next_state).to(cfg.device)
_, next_value = model(next_state)
returns = compute_returns(next_value, rewards, masks)
log_probs = torch.cat(log_probs)
returns = torch.cat(returns).detach()
values = torch.cat(values)
advantage = returns - values
actor_loss = -(log_probs * advantage.detach()).mean()
critic_loss = advantage.pow(2).mean()
loss = actor_loss + 0.5 * critic_loss - 0.001 * entropy
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('Finish training')
return {'rewards':test_rewards,'ma_rewards':test_ma_rewards}
if __name__ == "__main__":
cfg = get_args()
envs = [make_envs(cfg.env_name) for i in range(cfg.n_envs)]
envs = SubprocVecEnv(envs)
# training
res_dic = train(cfg,envs)
make_dir(cfg.result_path,cfg.model_path)
save_args(cfg)
save_results(res_dic, tag='train',
path=cfg.result_path)
plot_rewards(res_dic['rewards'], res_dic['ma_rewards'], cfg, tag="train") # 画出结果

View File

@@ -1,96 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-09 20:25:52
@LastEditor: John
LastEditTime: 2022-09-27 15:43:21
@Discription:
@Environment: python 3.7.7
'''
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
class DDPG:
def __init__(self, models,memories,cfg):
self.device = torch.device(cfg['device'])
self.critic = models['critic'].to(self.device)
self.target_critic = models['critic'].to(self.device)
self.actor = models['actor'].to(self.device)
self.target_actor = models['actor'].to(self.device)
# copy weights from critic to target_critic
for target_param, param in zip(self.target_critic.parameters(), self.critic.parameters()):
target_param.data.copy_(param.data)
# copy weights from actor to target_actor
for target_param, param in zip(self.target_actor.parameters(), self.actor.parameters()):
target_param.data.copy_(param.data)
self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=cfg['critic_lr'])
self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=cfg['actor_lr'])
self.memory = memories['memory']
self.batch_size = cfg['batch_size']
self.gamma = cfg['gamma']
self.tau = cfg['tau']
def sample_action(self, state):
state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
action = self.actor(state)
return action.detach().cpu().numpy()[0, 0]
@torch.no_grad()
def predict_action(self, state):
''' predict action
'''
state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
action = self.actor(state)
return action.cpu().numpy()[0, 0]
def update(self):
if len(self.memory) < self.batch_size: # when memory size is less than batch size, return
return
# sample a random minibatch of N transitions from R
state, action, reward, next_state, done = self.memory.sample(self.batch_size)
# convert to tensor
state = torch.FloatTensor(np.array(state)).to(self.device)
next_state = torch.FloatTensor(np.array(next_state)).to(self.device)
action = torch.FloatTensor(np.array(action)).to(self.device)
reward = torch.FloatTensor(reward).unsqueeze(1).to(self.device)
done = torch.FloatTensor(np.float32(done)).unsqueeze(1).to(self.device)
policy_loss = self.critic(state, self.actor(state))
policy_loss = -policy_loss.mean()
next_action = self.target_actor(next_state)
target_value = self.target_critic(next_state, next_action.detach())
expected_value = reward + (1.0 - done) * self.gamma * target_value
expected_value = torch.clamp(expected_value, -np.inf, np.inf)
value = self.critic(state, action)
value_loss = nn.MSELoss()(value, expected_value.detach())
self.actor_optimizer.zero_grad()
policy_loss.backward()
self.actor_optimizer.step()
self.critic_optimizer.zero_grad()
value_loss.backward()
self.critic_optimizer.step()
# soft update
for target_param, param in zip(self.target_critic.parameters(), self.critic.parameters()):
target_param.data.copy_(
target_param.data * (1.0 - self.tau) +
param.data * self.tau
)
for target_param, param in zip(self.target_actor.parameters(), self.actor.parameters()):
target_param.data.copy_(
target_param.data * (1.0 - self.tau) +
param.data * self.tau
)
def save_model(self,path):
from pathlib import Path
# create path
Path(path).mkdir(parents=True, exist_ok=True)
torch.save(self.actor.state_dict(), f"{path}/actor_checkpoint.pt")
def load_model(self,path):
self.actor.load_state_dict(torch.load(f"{path}/actor_checkpoint.pt"))

View File

@@ -1,56 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-10 15:28:30
@LastEditor: John
LastEditTime: 2021-09-16 00:52:30
@Discription:
@Environment: python 3.7.7
'''
import gym
import numpy as np
class NormalizedActions(gym.ActionWrapper):
''' 将action范围重定在[0.1]之间
'''
def action(self, action):
low_bound = self.action_space.low
upper_bound = self.action_space.high
action = low_bound + (action + 1.0) * 0.5 * (upper_bound - low_bound)
action = np.clip(action, low_bound, upper_bound)
return action
def reverse_action(self, action):
low_bound = self.action_space.low
upper_bound = self.action_space.high
action = 2 * (action - low_bound) / (upper_bound - low_bound) - 1
action = np.clip(action, low_bound, upper_bound)
return action
class OUNoise(object):
'''OrnsteinUhlenbeck噪声
'''
def __init__(self, action_space, mu=0.0, theta=0.15, max_sigma=0.3, min_sigma=0.3, decay_period=100000):
self.mu = mu # OU噪声的参数
self.theta = theta # OU噪声的参数
self.sigma = max_sigma # OU噪声的参数
self.max_sigma = max_sigma
self.min_sigma = min_sigma
self.decay_period = decay_period
self.n_actions = action_space.shape[0]
self.low = action_space.low
self.high = action_space.high
self.reset()
def reset(self):
self.obs = np.ones(self.n_actions) * self.mu
def evolve_obs(self):
x = self.obs
dx = self.theta * (self.mu - x) + self.sigma * np.random.randn(self.n_actions)
self.obs = x + dx
return self.obs
def get_action(self, action, t=0):
ou_obs = self.evolve_obs()
self.sigma = self.max_sigma - (self.max_sigma - self.min_sigma) * min(1.0, t / self.decay_period) # sigma会逐渐衰减
return np.clip(action + ou_obs, self.low, self.high) # 动作加上噪声后进行剪切

View File

@@ -1,152 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-11 20:58:21
@LastEditor: John
LastEditTime: 2022-09-27 15:50:12
@Discription:
@Environment: python 3.7.7
'''
import sys,os
curr_path = os.path.dirname(os.path.abspath(__file__)) # current path
parent_path = os.path.dirname(curr_path) # parent path
sys.path.append(parent_path) # add to system path
import datetime
import gym
import torch
import argparse
import torch.nn as nn
import torch.nn.functional as F
from env import NormalizedActions,OUNoise
from ddpg import DDPG
from common.utils import all_seed
from common.memories import ReplayBufferQue
from common.launcher import Launcher
from envs.register import register_env
class Actor(nn.Module):
def __init__(self, n_states, n_actions, hidden_dim, init_w=3e-3):
super(Actor, self).__init__()
self.linear1 = nn.Linear(n_states, hidden_dim)
self.linear2 = nn.Linear(hidden_dim, hidden_dim)
self.linear3 = nn.Linear(hidden_dim, n_actions)
self.linear3.weight.data.uniform_(-init_w, init_w)
self.linear3.bias.data.uniform_(-init_w, init_w)
def forward(self, x):
x = F.relu(self.linear1(x))
x = F.relu(self.linear2(x))
x = torch.tanh(self.linear3(x))
return x
class Critic(nn.Module):
def __init__(self, n_states, n_actions, hidden_dim, init_w=3e-3):
super(Critic, self).__init__()
self.linear1 = nn.Linear(n_states + n_actions, hidden_dim)
self.linear2 = nn.Linear(hidden_dim, hidden_dim)
self.linear3 = nn.Linear(hidden_dim, 1)
# 随机初始化为较小的值
self.linear3.weight.data.uniform_(-init_w, init_w)
self.linear3.bias.data.uniform_(-init_w, init_w)
def forward(self, state, action):
# 按维数1拼接
x = torch.cat([state, action], 1)
x = F.relu(self.linear1(x))
x = F.relu(self.linear2(x))
x = self.linear3(x)
return x
class Main(Launcher):
def get_args(self):
""" hyperparameters
"""
curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time
parser = argparse.ArgumentParser(description="hyperparameters")
parser.add_argument('--algo_name',default='DDPG',type=str,help="name of algorithm")
parser.add_argument('--env_name',default='Pendulum-v1',type=str,help="name of environment")
parser.add_argument('--train_eps',default=300,type=int,help="episodes of training")
parser.add_argument('--test_eps',default=20,type=int,help="episodes of testing")
parser.add_argument('--max_steps',default=100000,type=int,help="steps per episode, much larger value can simulate infinite steps")
parser.add_argument('--gamma',default=0.99,type=float,help="discounted factor")
parser.add_argument('--critic_lr',default=1e-3,type=float,help="learning rate of critic")
parser.add_argument('--actor_lr',default=1e-4,type=float,help="learning rate of actor")
parser.add_argument('--memory_capacity',default=8000,type=int,help="memory capacity")
parser.add_argument('--batch_size',default=128,type=int)
parser.add_argument('--target_update',default=2,type=int)
parser.add_argument('--tau',default=1e-2,type=float)
parser.add_argument('--critic_hidden_dim',default=256,type=int)
parser.add_argument('--actor_hidden_dim',default=256,type=int)
parser.add_argument('--device',default='cpu',type=str,help="cpu or cuda")
parser.add_argument('--seed',default=1,type=int,help="random seed")
parser.add_argument('--show_fig',default=False,type=bool,help="if show figure or not")
parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not")
args = parser.parse_args()
default_args = {'result_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/results/",
'model_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/models/",
}
args = {**vars(args),**default_args} # type(dict)
return args
def env_agent_config(self,cfg):
register_env(cfg['env_name'])
env = gym.make(cfg['env_name'])
env = NormalizedActions(env) # decorate with action noise
if cfg['seed'] !=0: # set random seed
all_seed(env,seed=cfg["seed"])
n_states = env.observation_space.shape[0]
n_actions = env.action_space.shape[0]
print(f"n_states: {n_states}, n_actions: {n_actions}")
cfg.update({"n_states":n_states,"n_actions":n_actions}) # update to cfg paramters
models = {"actor":Actor(n_states,n_actions,hidden_dim=cfg['actor_hidden_dim']),"critic":Critic(n_states,n_actions,hidden_dim=cfg['critic_hidden_dim'])}
memories = {"memory":ReplayBufferQue(cfg['memory_capacity'])}
agent = DDPG(models,memories,cfg)
return env,agent
def train(self,cfg, env, agent):
print('Start training!')
ou_noise = OUNoise(env.action_space) # noise of action
rewards = [] # record rewards for all episodes
for i_ep in range(cfg['train_eps']):
state = env.reset()
ou_noise.reset()
ep_reward = 0
for i_step in range(cfg['max_steps']):
action = agent.sample_action(state)
action = ou_noise.get_action(action, i_step+1)
next_state, reward, done, _ = env.step(action)
ep_reward += reward
agent.memory.push((state, action, reward, next_state, done))
agent.update()
state = next_state
if done:
break
if (i_ep+1)%10 == 0:
print(f"Env:{i_ep+1}/{cfg['train_eps']}, Reward:{ep_reward:.2f}")
rewards.append(ep_reward)
print('Finish training!')
return {'rewards':rewards}
def test(self,cfg, env, agent):
print('Start testing!')
rewards = [] # record rewards for all episodes
for i_ep in range(cfg['test_eps']):
state = env.reset()
ep_reward = 0
for i_step in range(cfg['max_steps']):
action = agent.predict_action(state)
next_state, reward, done, _ = env.step(action)
ep_reward += reward
state = next_state
if done:
break
rewards.append(ep_reward)
print(f"Episode:{i_ep+1}/{cfg['test_eps']}, Reward:{ep_reward:.1f}")
print('Finish testing!')
return {'rewards':rewards}
if __name__ == "__main__":
main = Main()
main.run()

View File

@@ -1,25 +0,0 @@
{
"algo_name": "DDPG",
"env_name": "Pendulum-v1",
"train_eps": 300,
"test_eps": 20,
"max_steps": 100000,
"gamma": 0.99,
"critic_lr": 0.001,
"actor_lr": 0.0001,
"memory_capacity": 8000,
"batch_size": 128,
"target_update": 2,
"tau": 0.01,
"critic_hidden_dim": 256,
"actor_hidden_dim": 256,
"device": "cpu",
"seed": 1,
"show_fig": false,
"save_fig": true,
"result_path": "/Users/jj/Desktop/rl-tutorials/codes/DDPG/outputs/Pendulum-v1/20220927-155053/results/",
"model_path": "/Users/jj/Desktop/rl-tutorials/codes/DDPG/outputs/Pendulum-v1/20220927-155053/models/",
"n_states": 3,
"n_actions": 1,
"training_time": 358.8142900466919
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

View File

@@ -1,21 +0,0 @@
rewards
-116.045416124376
-126.18022935469217
-231.46338228458293
-246.40481094689758
-304.69493818839186
-124.39609191913091
-1.060003582878406
-114.19659653048288
-348.9745708742037
-116.10811133324769
-117.20146333694844
-118.66206784602966
-235.17836229762355
-356.14054913290624
-118.38579118156366
-351.9415915140771
-114.50877866098972
-124.775484599685
-226.47062962476875
-121.48872909193936
1 rewards
2 -116.045416124376
3 -126.18022935469217
4 -231.46338228458293
5 -246.40481094689758
6 -304.69493818839186
7 -124.39609191913091
8 -1.060003582878406
9 -114.19659653048288
10 -348.9745708742037
11 -116.10811133324769
12 -117.20146333694844
13 -118.66206784602966
14 -235.17836229762355
15 -356.14054913290624
16 -118.38579118156366
17 -351.9415915140771
18 -114.50877866098972
19 -124.775484599685
20 -226.47062962476875
21 -121.48872909193936

Binary file not shown.

Before

Width:  |  Height:  |  Size: 79 KiB

View File

@@ -1,301 +0,0 @@
rewards
-1557.8518596631177
-1354.7599369723537
-1375.5732016629706
-1493.8609739040871
-1426.7116204537845
-1235.7920755027762
-1339.1647620443073
-1544.2379906560486
-1539.6232758780877
-1549.5690058648204
-1446.9193195793853
-1520.2666688767558
-1525.0116707122581
-1379.136573640111
-1532.702831768523
-1484.7552963941637
-1359.6699201737677
-1349.6805649166854
-1510.869999766432
-1515.8398785434708
-1447.4648656578254
-1537.3822077872178
-1249.6517039877456
-1350.0302666965736
-1529.4363372505607
-1320.28204807604
-1502.9248141320654
-1545.4861772197075
-1579.928789692619
-1413.296070504152
-1242.4673258663781
-1403.8672028946078
-1452.7199002523635
-871.6071114009982
-1324.1789316121412
-1313.3348146041249
-1059.8722927418046
-1054.232673559123
-973.8956270782459
-972.9936641224186
-972.9477399905655
-947.0613443333731
-737.3866328989184
-958.6068164634295
-739.6973395350705
-886.8383108399455
-775.1430379821574
-937.3115016337417
-700.875502951337
-829.9396339144109
-271.1629773396998
-493.5460684734584
-485.9321719313203
-858.3735607086766
-1145.3440084994113
-1121.1338201339777
-1191.5640831332366
-1350.0425368846784
-249.25438665107953
-727.9051714734406
-368.5579316240395
-392.0611344939354
-955.3231703741553
-488.27956192035265
-362.2734695759137
-949.5440839122496
-496.8460016912189
-726.6871514929877
-424.48641462866266
-954.7075428204689
-608.9650086409792
-848.6059768900151
-866.7052398755033
-856.9846415044439
-751.0342976129083
-749.5118249469103
-509.882299129811
-506.56154097018043
-906.0964475820368
-1318.3941416286855
-1422.2017011876615
-1523.1661091894277
-1209.2850593747999
-1415.0972750475833
-1533.2263827605834
-1405.8345530072663
-1244.3384723384913
-1237.4704845061992
-949.3394417935086
-981.1855396112669
-1241.224568444032
-1033.118364799829
-1017.2403725619487
-981.9727804516916
-853.1877724775591
-869.0652369861646
-1069.8265343327998
-371.73173813891884
-735.5887912713665
-1262.050240428957
-1242.985056062197
-1191.6867713427482
-1328.5323118458034
-1015.5308653784714
-895.3066515461381
-994.1114862316568
-761.4710321387583
-717.6979056272868
-782.302146467708
-640.4913147345328
-725.6469893076355
-497.5346232085584
-1027.1192149202325
-950.0117149822681
-956.1343737377374
-708.9489626669097
-964.5003064113283
-611.9111516886613
-612.3182791021098
-1100.0047939174613
-984.9262458612923
-858.7106075590494
-842.305917848386
-745.9043991922597
-741.2168858394704
-1143.0750387284456
-755.5257242325362
-745.8440029056219
-387.8717950334138
-764.6628701051523
-486.7967495537958
-485.13357559164814
-313.5415216767419
-611.3450529954782
-611.1570544377465
-507.6456747676814
-615.2032627013064
-242.37988821149764
-603.85498620892
-352.2672241055367
-155.99874664988383
-615.4003063516313
-384.9811293551548
-498.80727354456315
-407.6898591217813
-1213.6383844696395
-1122.2425748913884
-592.4819308883913
-478.2046833075051
-891.0254788311132
-482.40204115385
-339.34676196677407
-582.9985110154428
-213.38243627478826
-928.8434951613825
-1545.5433749195483
-1179.5016285049896
-1211.9549773601925
-1396.8082561792166
-1318.073128824395
-597.3837225413702
-564.7793352410449
-723.744223659601
-653.0145534050461
-847.6138123247009
-385.62784320332867
-245.25250602651928
-117.55094416757835
-864.0064774069044
-124.30221387458867
-244.4014050243669
-1148.861754008653
-914.4047868424254
-765.9394408203351
-124.05114610943177
-605.7641303826842
-616.3595829453579
-375.5024692962698
-253.51874076866997
-240.08405245866714
-503.96565579077225
-606.7646526173963
-502.6512112729435
-746.404013238678
-718.8658110051653
-125.65808359856703
-247.62256797883364
-363.69852213666803
-249.21801061415547
-491.7724416523124
-235.37050442527357
-609.6026403583944
-236.05731608228092
-381.19853850450454
-298.7683201867404
-127.64145601534942
-233.4300138495176
-129.11243486763516
-390.0092951263507
-1000.7729892969854
-249.60445310459787
-253.02347910759622
-129.04269174391223
-360.6321251486308
-377.26297602576534
-124.98466986009481
-245.47913567739212
-127.0885254550411
-118.11013006825459
-128.8682755001942
-497.3015586531096
-340.77352433313484
-514.4945799737978
-503.24077308842783
-627.9068157464455
-511.39396524392146
-763.8866112068075
-741.7885082408757
-617.4945380476306
-950.3176437519387
-643.4791402436576
-511.9377874351982
-573.6219349516633
-564.1297823875693
-242.06399233336583
-496.4020380325518
-360.56387982880364
-495.4590728336022
-503.7263345016764
-122.47964616802327
-254.16543926263168
-614.5335268729743
-234.3718017676852
-301.27514663062874
-387.64758894986204
-368.74492411716415
-364.43559131093593
-160.6845848115533
-504.1948947975429
-246.51676032967683
-251.5732500220603
-600.1463819723879
-247.17476928471288
-381.924164337607
-377.4773226068174
-378.511830774651
-126.69199895843033
-365.0506645811703
-130.45052114802874
-374.37400288581813
-502.37678159638887
-374.43552658473055
-241.157211525502
-388.9597456642503
-249.4412385534861
-114.71395078439846
-864.6882327286056
-626.8144095971478
-732.9226896140248
-368.24767905020394
-369.7425524469132
-398.07832598184626
-906.7113918582257
-252.2343258180765
-370.4258473086036
-736.0203154396909
-609.4605173515027
-661.1255920773486
-489.9605291008584
-364.1671188501402
-644.4029089587781
-477.9510457677364
-128.78294672880136
-373.74382001694886
-380.69931133982936
-372.60275628381805
-743.0410655515724
-597.558847789258
-387.94245652694394
-725.3939448944484
-409.1301313430852
-491.8442467896486
-123.0638156839621
-377.9292326597324
-489.27209762667974
-255.63227821371257
-379.5885382060625
-370.2312967024669
-250.94061817008688
-131.2125308195906
-600.3312016651868
-130.84444772735733
-312.6287688438562
-382.4144610039701
-259.03558003697265
-224.92206667096863
-376.81390821359685
-382.39993489751646
-380.25599578593636
-610.1016672243638
1 rewards
2 -1557.8518596631177
3 -1354.7599369723537
4 -1375.5732016629706
5 -1493.8609739040871
6 -1426.7116204537845
7 -1235.7920755027762
8 -1339.1647620443073
9 -1544.2379906560486
10 -1539.6232758780877
11 -1549.5690058648204
12 -1446.9193195793853
13 -1520.2666688767558
14 -1525.0116707122581
15 -1379.136573640111
16 -1532.702831768523
17 -1484.7552963941637
18 -1359.6699201737677
19 -1349.6805649166854
20 -1510.869999766432
21 -1515.8398785434708
22 -1447.4648656578254
23 -1537.3822077872178
24 -1249.6517039877456
25 -1350.0302666965736
26 -1529.4363372505607
27 -1320.28204807604
28 -1502.9248141320654
29 -1545.4861772197075
30 -1579.928789692619
31 -1413.296070504152
32 -1242.4673258663781
33 -1403.8672028946078
34 -1452.7199002523635
35 -871.6071114009982
36 -1324.1789316121412
37 -1313.3348146041249
38 -1059.8722927418046
39 -1054.232673559123
40 -973.8956270782459
41 -972.9936641224186
42 -972.9477399905655
43 -947.0613443333731
44 -737.3866328989184
45 -958.6068164634295
46 -739.6973395350705
47 -886.8383108399455
48 -775.1430379821574
49 -937.3115016337417
50 -700.875502951337
51 -829.9396339144109
52 -271.1629773396998
53 -493.5460684734584
54 -485.9321719313203
55 -858.3735607086766
56 -1145.3440084994113
57 -1121.1338201339777
58 -1191.5640831332366
59 -1350.0425368846784
60 -249.25438665107953
61 -727.9051714734406
62 -368.5579316240395
63 -392.0611344939354
64 -955.3231703741553
65 -488.27956192035265
66 -362.2734695759137
67 -949.5440839122496
68 -496.8460016912189
69 -726.6871514929877
70 -424.48641462866266
71 -954.7075428204689
72 -608.9650086409792
73 -848.6059768900151
74 -866.7052398755033
75 -856.9846415044439
76 -751.0342976129083
77 -749.5118249469103
78 -509.882299129811
79 -506.56154097018043
80 -906.0964475820368
81 -1318.3941416286855
82 -1422.2017011876615
83 -1523.1661091894277
84 -1209.2850593747999
85 -1415.0972750475833
86 -1533.2263827605834
87 -1405.8345530072663
88 -1244.3384723384913
89 -1237.4704845061992
90 -949.3394417935086
91 -981.1855396112669
92 -1241.224568444032
93 -1033.118364799829
94 -1017.2403725619487
95 -981.9727804516916
96 -853.1877724775591
97 -869.0652369861646
98 -1069.8265343327998
99 -371.73173813891884
100 -735.5887912713665
101 -1262.050240428957
102 -1242.985056062197
103 -1191.6867713427482
104 -1328.5323118458034
105 -1015.5308653784714
106 -895.3066515461381
107 -994.1114862316568
108 -761.4710321387583
109 -717.6979056272868
110 -782.302146467708
111 -640.4913147345328
112 -725.6469893076355
113 -497.5346232085584
114 -1027.1192149202325
115 -950.0117149822681
116 -956.1343737377374
117 -708.9489626669097
118 -964.5003064113283
119 -611.9111516886613
120 -612.3182791021098
121 -1100.0047939174613
122 -984.9262458612923
123 -858.7106075590494
124 -842.305917848386
125 -745.9043991922597
126 -741.2168858394704
127 -1143.0750387284456
128 -755.5257242325362
129 -745.8440029056219
130 -387.8717950334138
131 -764.6628701051523
132 -486.7967495537958
133 -485.13357559164814
134 -313.5415216767419
135 -611.3450529954782
136 -611.1570544377465
137 -507.6456747676814
138 -615.2032627013064
139 -242.37988821149764
140 -603.85498620892
141 -352.2672241055367
142 -155.99874664988383
143 -615.4003063516313
144 -384.9811293551548
145 -498.80727354456315
146 -407.6898591217813
147 -1213.6383844696395
148 -1122.2425748913884
149 -592.4819308883913
150 -478.2046833075051
151 -891.0254788311132
152 -482.40204115385
153 -339.34676196677407
154 -582.9985110154428
155 -213.38243627478826
156 -928.8434951613825
157 -1545.5433749195483
158 -1179.5016285049896
159 -1211.9549773601925
160 -1396.8082561792166
161 -1318.073128824395
162 -597.3837225413702
163 -564.7793352410449
164 -723.744223659601
165 -653.0145534050461
166 -847.6138123247009
167 -385.62784320332867
168 -245.25250602651928
169 -117.55094416757835
170 -864.0064774069044
171 -124.30221387458867
172 -244.4014050243669
173 -1148.861754008653
174 -914.4047868424254
175 -765.9394408203351
176 -124.05114610943177
177 -605.7641303826842
178 -616.3595829453579
179 -375.5024692962698
180 -253.51874076866997
181 -240.08405245866714
182 -503.96565579077225
183 -606.7646526173963
184 -502.6512112729435
185 -746.404013238678
186 -718.8658110051653
187 -125.65808359856703
188 -247.62256797883364
189 -363.69852213666803
190 -249.21801061415547
191 -491.7724416523124
192 -235.37050442527357
193 -609.6026403583944
194 -236.05731608228092
195 -381.19853850450454
196 -298.7683201867404
197 -127.64145601534942
198 -233.4300138495176
199 -129.11243486763516
200 -390.0092951263507
201 -1000.7729892969854
202 -249.60445310459787
203 -253.02347910759622
204 -129.04269174391223
205 -360.6321251486308
206 -377.26297602576534
207 -124.98466986009481
208 -245.47913567739212
209 -127.0885254550411
210 -118.11013006825459
211 -128.8682755001942
212 -497.3015586531096
213 -340.77352433313484
214 -514.4945799737978
215 -503.24077308842783
216 -627.9068157464455
217 -511.39396524392146
218 -763.8866112068075
219 -741.7885082408757
220 -617.4945380476306
221 -950.3176437519387
222 -643.4791402436576
223 -511.9377874351982
224 -573.6219349516633
225 -564.1297823875693
226 -242.06399233336583
227 -496.4020380325518
228 -360.56387982880364
229 -495.4590728336022
230 -503.7263345016764
231 -122.47964616802327
232 -254.16543926263168
233 -614.5335268729743
234 -234.3718017676852
235 -301.27514663062874
236 -387.64758894986204
237 -368.74492411716415
238 -364.43559131093593
239 -160.6845848115533
240 -504.1948947975429
241 -246.51676032967683
242 -251.5732500220603
243 -600.1463819723879
244 -247.17476928471288
245 -381.924164337607
246 -377.4773226068174
247 -378.511830774651
248 -126.69199895843033
249 -365.0506645811703
250 -130.45052114802874
251 -374.37400288581813
252 -502.37678159638887
253 -374.43552658473055
254 -241.157211525502
255 -388.9597456642503
256 -249.4412385534861
257 -114.71395078439846
258 -864.6882327286056
259 -626.8144095971478
260 -732.9226896140248
261 -368.24767905020394
262 -369.7425524469132
263 -398.07832598184626
264 -906.7113918582257
265 -252.2343258180765
266 -370.4258473086036
267 -736.0203154396909
268 -609.4605173515027
269 -661.1255920773486
270 -489.9605291008584
271 -364.1671188501402
272 -644.4029089587781
273 -477.9510457677364
274 -128.78294672880136
275 -373.74382001694886
276 -380.69931133982936
277 -372.60275628381805
278 -743.0410655515724
279 -597.558847789258
280 -387.94245652694394
281 -725.3939448944484
282 -409.1301313430852
283 -491.8442467896486
284 -123.0638156839621
285 -377.9292326597324
286 -489.27209762667974
287 -255.63227821371257
288 -379.5885382060625
289 -370.2312967024669
290 -250.94061817008688
291 -131.2125308195906
292 -600.3312016651868
293 -130.84444772735733
294 -312.6287688438562
295 -382.4144610039701
296 -259.03558003697265
297 -224.92206667096863
298 -376.81390821359685
299 -382.39993489751646
300 -380.25599578593636
301 -610.1016672243638

View File

@@ -1,25 +0,0 @@
general_cfg:
algo_name: DQN
device: cuda
env_name: CartPole-v1
eval_eps: 10
eval_per_episode: 5
load_checkpoint: true
load_path: Train_CartPole-v1_DQN_20221031-001201
max_steps: 200
mode: test
save_fig: true
seed: 0
show_fig: false
test_eps: 10
train_eps: 100
algo_cfg:
batch_size: 64
buffer_size: 100000
epsilon_decay: 500
epsilon_end: 0.01
epsilon_start: 0.95
gamma: 0.95
hidden_dim: 256
lr: 0.0001
target_update: 4

View File

@@ -1,14 +0,0 @@
2022-10-31 00:13:43 - r - INFO: - n_states: 4, n_actions: 2
2022-10-31 00:13:44 - r - INFO: - Start testing!
2022-10-31 00:13:44 - r - INFO: - Env: CartPole-v1, Algorithm: DQN, Device: cuda
2022-10-31 00:13:45 - r - INFO: - Episode: 1/10, Reward: 200.0, Step: 200
2022-10-31 00:13:45 - r - INFO: - Episode: 2/10, Reward: 200.0, Step: 200
2022-10-31 00:13:45 - r - INFO: - Episode: 3/10, Reward: 200.0, Step: 200
2022-10-31 00:13:45 - r - INFO: - Episode: 4/10, Reward: 200.0, Step: 200
2022-10-31 00:13:45 - r - INFO: - Episode: 5/10, Reward: 200.0, Step: 200
2022-10-31 00:13:45 - r - INFO: - Episode: 6/10, Reward: 200.0, Step: 200
2022-10-31 00:13:45 - r - INFO: - Episode: 7/10, Reward: 200.0, Step: 200
2022-10-31 00:13:45 - r - INFO: - Episode: 8/10, Reward: 200.0, Step: 200
2022-10-31 00:13:45 - r - INFO: - Episode: 9/10, Reward: 200.0, Step: 200
2022-10-31 00:13:45 - r - INFO: - Episode: 10/10, Reward: 200.0, Step: 200
2022-10-31 00:13:45 - r - INFO: - Finish testing!

Binary file not shown.

Before

Width:  |  Height:  |  Size: 25 KiB

View File

@@ -1,11 +0,0 @@
episodes,rewards,steps
0,200.0,200
1,200.0,200
2,200.0,200
3,200.0,200
4,200.0,200
5,200.0,200
6,200.0,200
7,200.0,200
8,200.0,200
9,200.0,200
1 episodes rewards steps
2 0 200.0 200
3 1 200.0 200
4 2 200.0 200
5 3 200.0 200
6 4 200.0 200
7 5 200.0 200
8 6 200.0 200
9 7 200.0 200
10 8 200.0 200
11 9 200.0 200

View File

@@ -1,23 +0,0 @@
general_cfg:
algo_name: DQN
device: cuda
env_name: Acrobot-v1
load_checkpoint: false
load_path: Train_CartPole-v1_DQN_20221026-054757
max_steps: 100000
mode: train
save_fig: true
seed: 1
show_fig: false
test_eps: 10
train_eps: 100
algo_cfg:
batch_size: 128
buffer_size: 200000
epsilon_decay: 500
epsilon_end: 0.01
epsilon_start: 0.95
gamma: 0.95
hidden_dim: 256
lr: 0.002
target_update: 4

View File

@@ -1,104 +0,0 @@
2022-10-26 09:46:45 - r - INFO: - n_states: 6, n_actions: 3
2022-10-26 09:46:48 - r - INFO: - Start training!
2022-10-26 09:46:48 - r - INFO: - Env: Acrobot-v1, Algorithm: DQN, Device: cuda
2022-10-26 09:46:50 - r - INFO: - Episode: 1/100, Reward: -861.00: Epislon: 0.178
2022-10-26 09:46:50 - r - INFO: - Episode: 2/100, Reward: -252.00: Epislon: 0.111
2022-10-26 09:46:50 - r - INFO: - Episode: 3/100, Reward: -196.00: Epislon: 0.078
2022-10-26 09:46:51 - r - INFO: - Episode: 4/100, Reward: -390.00: Epislon: 0.041
2022-10-26 09:46:52 - r - INFO: - Episode: 5/100, Reward: -371.00: Epislon: 0.025
2022-10-26 09:46:52 - r - INFO: - Episode: 6/100, Reward: -237.00: Epislon: 0.019
2022-10-26 09:46:52 - r - INFO: - Episode: 7/100, Reward: -227.00: Epislon: 0.016
2022-10-26 09:46:53 - r - INFO: - Episode: 8/100, Reward: -228.00: Epislon: 0.014
2022-10-26 09:46:53 - r - INFO: - Episode: 9/100, Reward: -305.00: Epislon: 0.012
2022-10-26 09:46:54 - r - INFO: - Episode: 10/100, Reward: -234.00: Epislon: 0.011
2022-10-26 09:46:54 - r - INFO: - Episode: 11/100, Reward: -204.00: Epislon: 0.011
2022-10-26 09:46:55 - r - INFO: - Episode: 12/100, Reward: -277.00: Epislon: 0.010
2022-10-26 09:46:55 - r - INFO: - Episode: 13/100, Reward: -148.00: Epislon: 0.010
2022-10-26 09:46:56 - r - INFO: - Episode: 14/100, Reward: -372.00: Epislon: 0.010
2022-10-26 09:46:56 - r - INFO: - Episode: 15/100, Reward: -273.00: Epislon: 0.010
2022-10-26 09:46:56 - r - INFO: - Episode: 16/100, Reward: -105.00: Epislon: 0.010
2022-10-26 09:46:56 - r - INFO: - Episode: 17/100, Reward: -79.00: Epislon: 0.010
2022-10-26 09:46:57 - r - INFO: - Episode: 18/100, Reward: -112.00: Epislon: 0.010
2022-10-26 09:46:57 - r - INFO: - Episode: 19/100, Reward: -276.00: Epislon: 0.010
2022-10-26 09:46:57 - r - INFO: - Episode: 20/100, Reward: -148.00: Epislon: 0.010
2022-10-26 09:46:58 - r - INFO: - Episode: 21/100, Reward: -201.00: Epislon: 0.010
2022-10-26 09:46:58 - r - INFO: - Episode: 22/100, Reward: -173.00: Epislon: 0.010
2022-10-26 09:46:58 - r - INFO: - Episode: 23/100, Reward: -226.00: Epislon: 0.010
2022-10-26 09:46:59 - r - INFO: - Episode: 24/100, Reward: -154.00: Epislon: 0.010
2022-10-26 09:46:59 - r - INFO: - Episode: 25/100, Reward: -269.00: Epislon: 0.010
2022-10-26 09:46:59 - r - INFO: - Episode: 26/100, Reward: -191.00: Epislon: 0.010
2022-10-26 09:47:00 - r - INFO: - Episode: 27/100, Reward: -177.00: Epislon: 0.010
2022-10-26 09:47:00 - r - INFO: - Episode: 28/100, Reward: -209.00: Epislon: 0.010
2022-10-26 09:47:00 - r - INFO: - Episode: 29/100, Reward: -116.00: Epislon: 0.010
2022-10-26 09:47:00 - r - INFO: - Episode: 30/100, Reward: -117.00: Epislon: 0.010
2022-10-26 09:47:01 - r - INFO: - Episode: 31/100, Reward: -121.00: Epislon: 0.010
2022-10-26 09:47:01 - r - INFO: - Episode: 32/100, Reward: -208.00: Epislon: 0.010
2022-10-26 09:47:01 - r - INFO: - Episode: 33/100, Reward: -147.00: Epislon: 0.010
2022-10-26 09:47:02 - r - INFO: - Episode: 34/100, Reward: -104.00: Epislon: 0.010
2022-10-26 09:47:02 - r - INFO: - Episode: 35/100, Reward: -161.00: Epislon: 0.010
2022-10-26 09:47:02 - r - INFO: - Episode: 36/100, Reward: -144.00: Epislon: 0.010
2022-10-26 09:47:02 - r - INFO: - Episode: 37/100, Reward: -131.00: Epislon: 0.010
2022-10-26 09:47:03 - r - INFO: - Episode: 38/100, Reward: -226.00: Epislon: 0.010
2022-10-26 09:47:03 - r - INFO: - Episode: 39/100, Reward: -117.00: Epislon: 0.010
2022-10-26 09:47:03 - r - INFO: - Episode: 40/100, Reward: -344.00: Epislon: 0.010
2022-10-26 09:47:04 - r - INFO: - Episode: 41/100, Reward: -123.00: Epislon: 0.010
2022-10-26 09:47:04 - r - INFO: - Episode: 42/100, Reward: -232.00: Epislon: 0.010
2022-10-26 09:47:04 - r - INFO: - Episode: 43/100, Reward: -190.00: Epislon: 0.010
2022-10-26 09:47:05 - r - INFO: - Episode: 44/100, Reward: -176.00: Epislon: 0.010
2022-10-26 09:47:05 - r - INFO: - Episode: 45/100, Reward: -139.00: Epislon: 0.010
2022-10-26 09:47:06 - r - INFO: - Episode: 46/100, Reward: -410.00: Epislon: 0.010
2022-10-26 09:47:06 - r - INFO: - Episode: 47/100, Reward: -115.00: Epislon: 0.010
2022-10-26 09:47:06 - r - INFO: - Episode: 48/100, Reward: -118.00: Epislon: 0.010
2022-10-26 09:47:06 - r - INFO: - Episode: 49/100, Reward: -113.00: Epislon: 0.010
2022-10-26 09:47:07 - r - INFO: - Episode: 50/100, Reward: -355.00: Epislon: 0.010
2022-10-26 09:47:07 - r - INFO: - Episode: 51/100, Reward: -110.00: Epislon: 0.010
2022-10-26 09:47:07 - r - INFO: - Episode: 52/100, Reward: -148.00: Epislon: 0.010
2022-10-26 09:47:08 - r - INFO: - Episode: 53/100, Reward: -135.00: Epislon: 0.010
2022-10-26 09:47:08 - r - INFO: - Episode: 54/100, Reward: -220.00: Epislon: 0.010
2022-10-26 09:47:08 - r - INFO: - Episode: 55/100, Reward: -157.00: Epislon: 0.010
2022-10-26 09:47:09 - r - INFO: - Episode: 56/100, Reward: -130.00: Epislon: 0.010
2022-10-26 09:47:09 - r - INFO: - Episode: 57/100, Reward: -150.00: Epislon: 0.010
2022-10-26 09:47:09 - r - INFO: - Episode: 58/100, Reward: -254.00: Epislon: 0.010
2022-10-26 09:47:10 - r - INFO: - Episode: 59/100, Reward: -148.00: Epislon: 0.010
2022-10-26 09:47:10 - r - INFO: - Episode: 60/100, Reward: -108.00: Epislon: 0.010
2022-10-26 09:47:10 - r - INFO: - Episode: 61/100, Reward: -152.00: Epislon: 0.010
2022-10-26 09:47:10 - r - INFO: - Episode: 62/100, Reward: -107.00: Epislon: 0.010
2022-10-26 09:47:10 - r - INFO: - Episode: 63/100, Reward: -110.00: Epislon: 0.010
2022-10-26 09:47:11 - r - INFO: - Episode: 64/100, Reward: -266.00: Epislon: 0.010
2022-10-26 09:47:11 - r - INFO: - Episode: 65/100, Reward: -344.00: Epislon: 0.010
2022-10-26 09:47:12 - r - INFO: - Episode: 66/100, Reward: -93.00: Epislon: 0.010
2022-10-26 09:47:12 - r - INFO: - Episode: 67/100, Reward: -113.00: Epislon: 0.010
2022-10-26 09:47:12 - r - INFO: - Episode: 68/100, Reward: -191.00: Epislon: 0.010
2022-10-26 09:47:12 - r - INFO: - Episode: 69/100, Reward: -102.00: Epislon: 0.010
2022-10-26 09:47:13 - r - INFO: - Episode: 70/100, Reward: -187.00: Epislon: 0.010
2022-10-26 09:47:13 - r - INFO: - Episode: 71/100, Reward: -158.00: Epislon: 0.010
2022-10-26 09:47:13 - r - INFO: - Episode: 72/100, Reward: -166.00: Epislon: 0.010
2022-10-26 09:47:14 - r - INFO: - Episode: 73/100, Reward: -202.00: Epislon: 0.010
2022-10-26 09:47:14 - r - INFO: - Episode: 74/100, Reward: -179.00: Epislon: 0.010
2022-10-26 09:47:14 - r - INFO: - Episode: 75/100, Reward: -150.00: Epislon: 0.010
2022-10-26 09:47:14 - r - INFO: - Episode: 76/100, Reward: -170.00: Epislon: 0.010
2022-10-26 09:47:15 - r - INFO: - Episode: 77/100, Reward: -149.00: Epislon: 0.010
2022-10-26 09:47:15 - r - INFO: - Episode: 78/100, Reward: -119.00: Epislon: 0.010
2022-10-26 09:47:15 - r - INFO: - Episode: 79/100, Reward: -115.00: Epislon: 0.010
2022-10-26 09:47:15 - r - INFO: - Episode: 80/100, Reward: -97.00: Epislon: 0.010
2022-10-26 09:47:16 - r - INFO: - Episode: 81/100, Reward: -153.00: Epislon: 0.010
2022-10-26 09:47:16 - r - INFO: - Episode: 82/100, Reward: -97.00: Epislon: 0.010
2022-10-26 09:47:16 - r - INFO: - Episode: 83/100, Reward: -211.00: Epislon: 0.010
2022-10-26 09:47:16 - r - INFO: - Episode: 84/100, Reward: -195.00: Epislon: 0.010
2022-10-26 09:47:17 - r - INFO: - Episode: 85/100, Reward: -125.00: Epislon: 0.010
2022-10-26 09:47:17 - r - INFO: - Episode: 86/100, Reward: -155.00: Epislon: 0.010
2022-10-26 09:47:17 - r - INFO: - Episode: 87/100, Reward: -151.00: Epislon: 0.010
2022-10-26 09:47:18 - r - INFO: - Episode: 88/100, Reward: -194.00: Epislon: 0.010
2022-10-26 09:47:18 - r - INFO: - Episode: 89/100, Reward: -188.00: Epislon: 0.010
2022-10-26 09:47:18 - r - INFO: - Episode: 90/100, Reward: -195.00: Epislon: 0.010
2022-10-26 09:47:19 - r - INFO: - Episode: 91/100, Reward: -141.00: Epislon: 0.010
2022-10-26 09:47:19 - r - INFO: - Episode: 92/100, Reward: -132.00: Epislon: 0.010
2022-10-26 09:47:19 - r - INFO: - Episode: 93/100, Reward: -127.00: Epislon: 0.010
2022-10-26 09:47:19 - r - INFO: - Episode: 94/100, Reward: -195.00: Epislon: 0.010
2022-10-26 09:47:20 - r - INFO: - Episode: 95/100, Reward: -152.00: Epislon: 0.010
2022-10-26 09:47:20 - r - INFO: - Episode: 96/100, Reward: -145.00: Epislon: 0.010
2022-10-26 09:47:20 - r - INFO: - Episode: 97/100, Reward: -123.00: Epislon: 0.010
2022-10-26 09:47:20 - r - INFO: - Episode: 98/100, Reward: -176.00: Epislon: 0.010
2022-10-26 09:47:21 - r - INFO: - Episode: 99/100, Reward: -180.00: Epislon: 0.010
2022-10-26 09:47:21 - r - INFO: - Episode: 100/100, Reward: -124.00: Epislon: 0.010
2022-10-26 09:47:21 - r - INFO: - Finish training!

Binary file not shown.

Before

Width:  |  Height:  |  Size: 55 KiB

View File

@@ -1,101 +0,0 @@
episodes,rewards,steps
0,-861.0,862
1,-252.0,253
2,-196.0,197
3,-390.0,391
4,-371.0,372
5,-237.0,238
6,-227.0,228
7,-228.0,229
8,-305.0,306
9,-234.0,235
10,-204.0,205
11,-277.0,278
12,-148.0,149
13,-372.0,373
14,-273.0,274
15,-105.0,106
16,-79.0,80
17,-112.0,113
18,-276.0,277
19,-148.0,149
20,-201.0,202
21,-173.0,174
22,-226.0,227
23,-154.0,155
24,-269.0,270
25,-191.0,192
26,-177.0,178
27,-209.0,210
28,-116.0,117
29,-117.0,118
30,-121.0,122
31,-208.0,209
32,-147.0,148
33,-104.0,105
34,-161.0,162
35,-144.0,145
36,-131.0,132
37,-226.0,227
38,-117.0,118
39,-344.0,345
40,-123.0,124
41,-232.0,233
42,-190.0,191
43,-176.0,177
44,-139.0,140
45,-410.0,411
46,-115.0,116
47,-118.0,119
48,-113.0,114
49,-355.0,356
50,-110.0,111
51,-148.0,149
52,-135.0,136
53,-220.0,221
54,-157.0,158
55,-130.0,131
56,-150.0,151
57,-254.0,255
58,-148.0,149
59,-108.0,109
60,-152.0,153
61,-107.0,108
62,-110.0,111
63,-266.0,267
64,-344.0,345
65,-93.0,94
66,-113.0,114
67,-191.0,192
68,-102.0,103
69,-187.0,188
70,-158.0,159
71,-166.0,167
72,-202.0,203
73,-179.0,180
74,-150.0,151
75,-170.0,171
76,-149.0,150
77,-119.0,120
78,-115.0,116
79,-97.0,98
80,-153.0,154
81,-97.0,98
82,-211.0,212
83,-195.0,196
84,-125.0,126
85,-155.0,156
86,-151.0,152
87,-194.0,195
88,-188.0,189
89,-195.0,196
90,-141.0,142
91,-132.0,133
92,-127.0,128
93,-195.0,196
94,-152.0,153
95,-145.0,146
96,-123.0,124
97,-176.0,177
98,-180.0,181
99,-124.0,125
1 episodes rewards steps
2 0 -861.0 862
3 1 -252.0 253
4 2 -196.0 197
5 3 -390.0 391
6 4 -371.0 372
7 5 -237.0 238
8 6 -227.0 228
9 7 -228.0 229
10 8 -305.0 306
11 9 -234.0 235
12 10 -204.0 205
13 11 -277.0 278
14 12 -148.0 149
15 13 -372.0 373
16 14 -273.0 274
17 15 -105.0 106
18 16 -79.0 80
19 17 -112.0 113
20 18 -276.0 277
21 19 -148.0 149
22 20 -201.0 202
23 21 -173.0 174
24 22 -226.0 227
25 23 -154.0 155
26 24 -269.0 270
27 25 -191.0 192
28 26 -177.0 178
29 27 -209.0 210
30 28 -116.0 117
31 29 -117.0 118
32 30 -121.0 122
33 31 -208.0 209
34 32 -147.0 148
35 33 -104.0 105
36 34 -161.0 162
37 35 -144.0 145
38 36 -131.0 132
39 37 -226.0 227
40 38 -117.0 118
41 39 -344.0 345
42 40 -123.0 124
43 41 -232.0 233
44 42 -190.0 191
45 43 -176.0 177
46 44 -139.0 140
47 45 -410.0 411
48 46 -115.0 116
49 47 -118.0 119
50 48 -113.0 114
51 49 -355.0 356
52 50 -110.0 111
53 51 -148.0 149
54 52 -135.0 136
55 53 -220.0 221
56 54 -157.0 158
57 55 -130.0 131
58 56 -150.0 151
59 57 -254.0 255
60 58 -148.0 149
61 59 -108.0 109
62 60 -152.0 153
63 61 -107.0 108
64 62 -110.0 111
65 63 -266.0 267
66 64 -344.0 345
67 65 -93.0 94
68 66 -113.0 114
69 67 -191.0 192
70 68 -102.0 103
71 69 -187.0 188
72 70 -158.0 159
73 71 -166.0 167
74 72 -202.0 203
75 73 -179.0 180
76 74 -150.0 151
77 75 -170.0 171
78 76 -149.0 150
79 77 -119.0 120
80 78 -115.0 116
81 79 -97.0 98
82 80 -153.0 154
83 81 -97.0 98
84 82 -211.0 212
85 83 -195.0 196
86 84 -125.0 126
87 85 -155.0 156
88 86 -151.0 152
89 87 -194.0 195
90 88 -188.0 189
91 89 -195.0 196
92 90 -141.0 142
93 91 -132.0 133
94 92 -127.0 128
95 93 -195.0 196
96 94 -152.0 153
97 95 -145.0 146
98 96 -123.0 124
99 97 -176.0 177
100 98 -180.0 181
101 99 -124.0 125

View File

@@ -1,25 +0,0 @@
general_cfg:
algo_name: DQN
device: cuda
env_name: CartPole-v1
eval_eps: 10
eval_per_episode: 5
load_checkpoint: false
load_path: tasks
max_steps: 200
mode: train
save_fig: true
seed: 1
show_fig: false
test_eps: 10
train_eps: 100
algo_cfg:
batch_size: 64
buffer_size: 100000
epsilon_decay: 500
epsilon_end: 0.01
epsilon_start: 0.95
gamma: 0.95
hidden_dim: 256
lr: 0.0001
target_update: 800

View File

@@ -1,116 +0,0 @@
2022-10-31 00:12:01 - r - INFO: - n_states: 4, n_actions: 2
2022-10-31 00:12:01 - r - INFO: - Start training!
2022-10-31 00:12:01 - r - INFO: - Env: CartPole-v1, Algorithm: DQN, Device: cuda
2022-10-31 00:12:04 - r - INFO: - Episode: 1/100, Reward: 18.0, Step: 18
2022-10-31 00:12:04 - r - INFO: - Episode: 2/100, Reward: 35.0, Step: 35
2022-10-31 00:12:04 - r - INFO: - Episode: 3/100, Reward: 13.0, Step: 13
2022-10-31 00:12:04 - r - INFO: - Episode: 4/100, Reward: 32.0, Step: 32
2022-10-31 00:12:04 - r - INFO: - Episode: 5/100, Reward: 16.0, Step: 16
2022-10-31 00:12:04 - r - INFO: - Current episode 5 has the best eval reward: 15.30
2022-10-31 00:12:04 - r - INFO: - Episode: 6/100, Reward: 12.0, Step: 12
2022-10-31 00:12:04 - r - INFO: - Episode: 7/100, Reward: 13.0, Step: 13
2022-10-31 00:12:04 - r - INFO: - Episode: 8/100, Reward: 15.0, Step: 15
2022-10-31 00:12:04 - r - INFO: - Episode: 9/100, Reward: 11.0, Step: 11
2022-10-31 00:12:04 - r - INFO: - Episode: 10/100, Reward: 15.0, Step: 15
2022-10-31 00:12:04 - r - INFO: - Episode: 11/100, Reward: 9.0, Step: 9
2022-10-31 00:12:04 - r - INFO: - Episode: 12/100, Reward: 13.0, Step: 13
2022-10-31 00:12:04 - r - INFO: - Episode: 13/100, Reward: 13.0, Step: 13
2022-10-31 00:12:04 - r - INFO: - Episode: 14/100, Reward: 10.0, Step: 10
2022-10-31 00:12:04 - r - INFO: - Episode: 15/100, Reward: 9.0, Step: 9
2022-10-31 00:12:04 - r - INFO: - Episode: 16/100, Reward: 24.0, Step: 24
2022-10-31 00:12:04 - r - INFO: - Episode: 17/100, Reward: 8.0, Step: 8
2022-10-31 00:12:04 - r - INFO: - Episode: 18/100, Reward: 10.0, Step: 10
2022-10-31 00:12:04 - r - INFO: - Episode: 19/100, Reward: 11.0, Step: 11
2022-10-31 00:12:04 - r - INFO: - Episode: 20/100, Reward: 13.0, Step: 13
2022-10-31 00:12:04 - r - INFO: - Episode: 21/100, Reward: 12.0, Step: 12
2022-10-31 00:12:04 - r - INFO: - Episode: 22/100, Reward: 11.0, Step: 11
2022-10-31 00:12:04 - r - INFO: - Episode: 23/100, Reward: 9.0, Step: 9
2022-10-31 00:12:04 - r - INFO: - Episode: 24/100, Reward: 21.0, Step: 21
2022-10-31 00:12:05 - r - INFO: - Episode: 25/100, Reward: 14.0, Step: 14
2022-10-31 00:12:05 - r - INFO: - Episode: 26/100, Reward: 12.0, Step: 12
2022-10-31 00:12:05 - r - INFO: - Episode: 27/100, Reward: 9.0, Step: 9
2022-10-31 00:12:05 - r - INFO: - Episode: 28/100, Reward: 11.0, Step: 11
2022-10-31 00:12:05 - r - INFO: - Episode: 29/100, Reward: 12.0, Step: 12
2022-10-31 00:12:05 - r - INFO: - Episode: 30/100, Reward: 13.0, Step: 13
2022-10-31 00:12:05 - r - INFO: - Episode: 31/100, Reward: 10.0, Step: 10
2022-10-31 00:12:05 - r - INFO: - Episode: 32/100, Reward: 13.0, Step: 13
2022-10-31 00:12:05 - r - INFO: - Episode: 33/100, Reward: 18.0, Step: 18
2022-10-31 00:12:05 - r - INFO: - Episode: 34/100, Reward: 9.0, Step: 9
2022-10-31 00:12:05 - r - INFO: - Episode: 35/100, Reward: 10.0, Step: 10
2022-10-31 00:12:05 - r - INFO: - Episode: 36/100, Reward: 9.0, Step: 9
2022-10-31 00:12:05 - r - INFO: - Episode: 37/100, Reward: 10.0, Step: 10
2022-10-31 00:12:05 - r - INFO: - Episode: 38/100, Reward: 10.0, Step: 10
2022-10-31 00:12:05 - r - INFO: - Episode: 39/100, Reward: 10.0, Step: 10
2022-10-31 00:12:05 - r - INFO: - Episode: 40/100, Reward: 8.0, Step: 8
2022-10-31 00:12:06 - r - INFO: - Episode: 41/100, Reward: 9.0, Step: 9
2022-10-31 00:12:06 - r - INFO: - Episode: 42/100, Reward: 9.0, Step: 9
2022-10-31 00:12:06 - r - INFO: - Episode: 43/100, Reward: 20.0, Step: 20
2022-10-31 00:12:06 - r - INFO: - Episode: 44/100, Reward: 16.0, Step: 16
2022-10-31 00:12:06 - r - INFO: - Episode: 45/100, Reward: 17.0, Step: 17
2022-10-31 00:12:06 - r - INFO: - Current episode 45 has the best eval reward: 17.50
2022-10-31 00:12:06 - r - INFO: - Episode: 46/100, Reward: 17.0, Step: 17
2022-10-31 00:12:06 - r - INFO: - Episode: 47/100, Reward: 17.0, Step: 17
2022-10-31 00:12:06 - r - INFO: - Episode: 48/100, Reward: 18.0, Step: 18
2022-10-31 00:12:06 - r - INFO: - Episode: 49/100, Reward: 25.0, Step: 25
2022-10-31 00:12:06 - r - INFO: - Episode: 50/100, Reward: 31.0, Step: 31
2022-10-31 00:12:06 - r - INFO: - Current episode 50 has the best eval reward: 24.80
2022-10-31 00:12:06 - r - INFO: - Episode: 51/100, Reward: 22.0, Step: 22
2022-10-31 00:12:06 - r - INFO: - Episode: 52/100, Reward: 39.0, Step: 39
2022-10-31 00:12:06 - r - INFO: - Episode: 53/100, Reward: 36.0, Step: 36
2022-10-31 00:12:06 - r - INFO: - Episode: 54/100, Reward: 26.0, Step: 26
2022-10-31 00:12:07 - r - INFO: - Episode: 55/100, Reward: 33.0, Step: 33
2022-10-31 00:12:07 - r - INFO: - Current episode 55 has the best eval reward: 38.70
2022-10-31 00:12:07 - r - INFO: - Episode: 56/100, Reward: 56.0, Step: 56
2022-10-31 00:12:07 - r - INFO: - Episode: 57/100, Reward: 112.0, Step: 112
2022-10-31 00:12:07 - r - INFO: - Episode: 58/100, Reward: 101.0, Step: 101
2022-10-31 00:12:08 - r - INFO: - Episode: 59/100, Reward: 69.0, Step: 69
2022-10-31 00:12:08 - r - INFO: - Episode: 60/100, Reward: 75.0, Step: 75
2022-10-31 00:12:08 - r - INFO: - Episode: 61/100, Reward: 182.0, Step: 182
2022-10-31 00:12:09 - r - INFO: - Episode: 62/100, Reward: 52.0, Step: 52
2022-10-31 00:12:09 - r - INFO: - Episode: 63/100, Reward: 67.0, Step: 67
2022-10-31 00:12:09 - r - INFO: - Episode: 64/100, Reward: 53.0, Step: 53
2022-10-31 00:12:09 - r - INFO: - Episode: 65/100, Reward: 119.0, Step: 119
2022-10-31 00:12:10 - r - INFO: - Current episode 65 has the best eval reward: 171.90
2022-10-31 00:12:10 - r - INFO: - Episode: 66/100, Reward: 200.0, Step: 200
2022-10-31 00:12:10 - r - INFO: - Episode: 67/100, Reward: 74.0, Step: 74
2022-10-31 00:12:11 - r - INFO: - Episode: 68/100, Reward: 138.0, Step: 138
2022-10-31 00:12:11 - r - INFO: - Episode: 69/100, Reward: 149.0, Step: 149
2022-10-31 00:12:12 - r - INFO: - Episode: 70/100, Reward: 144.0, Step: 144
2022-10-31 00:12:12 - r - INFO: - Current episode 70 has the best eval reward: 173.70
2022-10-31 00:12:13 - r - INFO: - Episode: 71/100, Reward: 200.0, Step: 200
2022-10-31 00:12:13 - r - INFO: - Episode: 72/100, Reward: 198.0, Step: 198
2022-10-31 00:12:14 - r - INFO: - Episode: 73/100, Reward: 200.0, Step: 200
2022-10-31 00:12:14 - r - INFO: - Episode: 74/100, Reward: 200.0, Step: 200
2022-10-31 00:12:15 - r - INFO: - Episode: 75/100, Reward: 200.0, Step: 200
2022-10-31 00:12:16 - r - INFO: - Current episode 75 has the best eval reward: 200.00
2022-10-31 00:12:16 - r - INFO: - Episode: 76/100, Reward: 200.0, Step: 200
2022-10-31 00:12:17 - r - INFO: - Episode: 77/100, Reward: 200.0, Step: 200
2022-10-31 00:12:17 - r - INFO: - Episode: 78/100, Reward: 200.0, Step: 200
2022-10-31 00:12:18 - r - INFO: - Episode: 79/100, Reward: 200.0, Step: 200
2022-10-31 00:12:19 - r - INFO: - Episode: 80/100, Reward: 200.0, Step: 200
2022-10-31 00:12:19 - r - INFO: - Current episode 80 has the best eval reward: 200.00
2022-10-31 00:12:20 - r - INFO: - Episode: 81/100, Reward: 200.0, Step: 200
2022-10-31 00:12:20 - r - INFO: - Episode: 82/100, Reward: 200.0, Step: 200
2022-10-31 00:12:21 - r - INFO: - Episode: 83/100, Reward: 200.0, Step: 200
2022-10-31 00:12:21 - r - INFO: - Episode: 84/100, Reward: 200.0, Step: 200
2022-10-31 00:12:22 - r - INFO: - Episode: 85/100, Reward: 200.0, Step: 200
2022-10-31 00:12:23 - r - INFO: - Current episode 85 has the best eval reward: 200.00
2022-10-31 00:12:23 - r - INFO: - Episode: 86/100, Reward: 200.0, Step: 200
2022-10-31 00:12:24 - r - INFO: - Episode: 87/100, Reward: 200.0, Step: 200
2022-10-31 00:12:25 - r - INFO: - Episode: 88/100, Reward: 200.0, Step: 200
2022-10-31 00:12:25 - r - INFO: - Episode: 89/100, Reward: 200.0, Step: 200
2022-10-31 00:12:26 - r - INFO: - Episode: 90/100, Reward: 200.0, Step: 200
2022-10-31 00:12:27 - r - INFO: - Current episode 90 has the best eval reward: 200.00
2022-10-31 00:12:27 - r - INFO: - Episode: 91/100, Reward: 200.0, Step: 200
2022-10-31 00:12:28 - r - INFO: - Episode: 92/100, Reward: 200.0, Step: 200
2022-10-31 00:12:28 - r - INFO: - Episode: 93/100, Reward: 200.0, Step: 200
2022-10-31 00:12:29 - r - INFO: - Episode: 94/100, Reward: 200.0, Step: 200
2022-10-31 00:12:29 - r - INFO: - Episode: 95/100, Reward: 200.0, Step: 200
2022-10-31 00:12:30 - r - INFO: - Current episode 95 has the best eval reward: 200.00
2022-10-31 00:12:31 - r - INFO: - Episode: 96/100, Reward: 200.0, Step: 200
2022-10-31 00:12:31 - r - INFO: - Episode: 97/100, Reward: 200.0, Step: 200
2022-10-31 00:12:32 - r - INFO: - Episode: 98/100, Reward: 200.0, Step: 200
2022-10-31 00:12:32 - r - INFO: - Episode: 99/100, Reward: 200.0, Step: 200
2022-10-31 00:12:33 - r - INFO: - Episode: 100/100, Reward: 200.0, Step: 200
2022-10-31 00:12:33 - r - INFO: - Current episode 100 has the best eval reward: 200.00
2022-10-31 00:12:33 - r - INFO: - Finish training!

Binary file not shown.

Before

Width:  |  Height:  |  Size: 43 KiB

View File

@@ -1,101 +0,0 @@
episodes,rewards,steps
0,18.0,18
1,35.0,35
2,13.0,13
3,32.0,32
4,16.0,16
5,12.0,12
6,13.0,13
7,15.0,15
8,11.0,11
9,15.0,15
10,9.0,9
11,13.0,13
12,13.0,13
13,10.0,10
14,9.0,9
15,24.0,24
16,8.0,8
17,10.0,10
18,11.0,11
19,13.0,13
20,12.0,12
21,11.0,11
22,9.0,9
23,21.0,21
24,14.0,14
25,12.0,12
26,9.0,9
27,11.0,11
28,12.0,12
29,13.0,13
30,10.0,10
31,13.0,13
32,18.0,18
33,9.0,9
34,10.0,10
35,9.0,9
36,10.0,10
37,10.0,10
38,10.0,10
39,8.0,8
40,9.0,9
41,9.0,9
42,20.0,20
43,16.0,16
44,17.0,17
45,17.0,17
46,17.0,17
47,18.0,18
48,25.0,25
49,31.0,31
50,22.0,22
51,39.0,39
52,36.0,36
53,26.0,26
54,33.0,33
55,56.0,56
56,112.0,112
57,101.0,101
58,69.0,69
59,75.0,75
60,182.0,182
61,52.0,52
62,67.0,67
63,53.0,53
64,119.0,119
65,200.0,200
66,74.0,74
67,138.0,138
68,149.0,149
69,144.0,144
70,200.0,200
71,198.0,198
72,200.0,200
73,200.0,200
74,200.0,200
75,200.0,200
76,200.0,200
77,200.0,200
78,200.0,200
79,200.0,200
80,200.0,200
81,200.0,200
82,200.0,200
83,200.0,200
84,200.0,200
85,200.0,200
86,200.0,200
87,200.0,200
88,200.0,200
89,200.0,200
90,200.0,200
91,200.0,200
92,200.0,200
93,200.0,200
94,200.0,200
95,200.0,200
96,200.0,200
97,200.0,200
98,200.0,200
99,200.0,200
1 episodes rewards steps
2 0 18.0 18
3 1 35.0 35
4 2 13.0 13
5 3 32.0 32
6 4 16.0 16
7 5 12.0 12
8 6 13.0 13
9 7 15.0 15
10 8 11.0 11
11 9 15.0 15
12 10 9.0 9
13 11 13.0 13
14 12 13.0 13
15 13 10.0 10
16 14 9.0 9
17 15 24.0 24
18 16 8.0 8
19 17 10.0 10
20 18 11.0 11
21 19 13.0 13
22 20 12.0 12
23 21 11.0 11
24 22 9.0 9
25 23 21.0 21
26 24 14.0 14
27 25 12.0 12
28 26 9.0 9
29 27 11.0 11
30 28 12.0 12
31 29 13.0 13
32 30 10.0 10
33 31 13.0 13
34 32 18.0 18
35 33 9.0 9
36 34 10.0 10
37 35 9.0 9
38 36 10.0 10
39 37 10.0 10
40 38 10.0 10
41 39 8.0 8
42 40 9.0 9
43 41 9.0 9
44 42 20.0 20
45 43 16.0 16
46 44 17.0 17
47 45 17.0 17
48 46 17.0 17
49 47 18.0 18
50 48 25.0 25
51 49 31.0 31
52 50 22.0 22
53 51 39.0 39
54 52 36.0 36
55 53 26.0 26
56 54 33.0 33
57 55 56.0 56
58 56 112.0 112
59 57 101.0 101
60 58 69.0 69
61 59 75.0 75
62 60 182.0 182
63 61 52.0 52
64 62 67.0 67
65 63 53.0 53
66 64 119.0 119
67 65 200.0 200
68 66 74.0 74
69 67 138.0 138
70 68 149.0 149
71 69 144.0 144
72 70 200.0 200
73 71 198.0 198
74 72 200.0 200
75 73 200.0 200
76 74 200.0 200
77 75 200.0 200
78 76 200.0 200
79 77 200.0 200
80 78 200.0 200
81 79 200.0 200
82 80 200.0 200
83 81 200.0 200
84 82 200.0 200
85 83 200.0 200
86 84 200.0 200
87 85 200.0 200
88 86 200.0 200
89 87 200.0 200
90 88 200.0 200
91 89 200.0 200
92 90 200.0 200
93 91 200.0 200
94 92 200.0 200
95 93 200.0 200
96 94 200.0 200
97 95 200.0 200
98 96 200.0 200
99 97 200.0 200
100 98 200.0 200
101 99 200.0 200

View File

@@ -1,22 +0,0 @@
general_cfg:
algo_name: DQN
device: cuda
env_name: Acrobot-v1
mode: test
load_checkpoint: true
load_path: Train_Acrobot-v1_DQN_20221026-094645
max_steps: 100000
save_fig: true
seed: 1
show_fig: false
test_eps: 10
train_eps: 100
algo_cfg:
batch_size: 128
buffer_size: 200000
epsilon_decay: 500
epsilon_end: 0.01
epsilon_start: 0.95
gamma: 0.95
lr: 0.002
target_update: 4

View File

@@ -1,22 +0,0 @@
general_cfg:
algo_name: DQN
device: cuda
env_name: Acrobot-v1
mode: train
load_checkpoint: false
load_path: Train_CartPole-v1_DQN_20221026-054757
max_steps: 100000
save_fig: true
seed: 1
show_fig: false
test_eps: 10
train_eps: 100
algo_cfg:
batch_size: 128
buffer_size: 200000
epsilon_decay: 500
epsilon_end: 0.01
epsilon_start: 0.95
gamma: 0.95
lr: 0.002
target_update: 4

View File

@@ -1,22 +0,0 @@
general_cfg:
algo_name: DQN
device: cuda
env_name: CartPole-v1
mode: test
load_checkpoint: true
load_path: Train_CartPole-v1_DQN_20221031-001201
max_steps: 200
save_fig: true
seed: 0
show_fig: false
test_eps: 10
train_eps: 100
algo_cfg:
batch_size: 64
buffer_size: 100000
epsilon_decay: 500
epsilon_end: 0.01
epsilon_start: 0.95
gamma: 0.95
lr: 0.0001
target_update: 4

View File

@@ -1,22 +0,0 @@
general_cfg:
algo_name: DQN
device: cuda
env_name: CartPole-v1
mode: train
load_checkpoint: false
load_path: Train_CartPole-v1_DQN_20221026-054757
max_steps: 200
save_fig: true
seed: 0
show_fig: false
test_eps: 10
train_eps: 200
algo_cfg:
batch_size: 64
buffer_size: 100000
epsilon_decay: 500
epsilon_end: 0.01
epsilon_start: 0.95
gamma: 0.95
lr: 0.0001
target_update: 4

View File

@@ -1,38 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2022-10-30 00:37:33
LastEditor: JiangJi
LastEditTime: 2022-10-31 00:11:57
Discription: default parameters of DQN
'''
from common.config import GeneralConfig,AlgoConfig
class GeneralConfigDQN(GeneralConfig):
def __init__(self) -> None:
self.env_name = "CartPole-v1" # name of environment
self.algo_name = "DQN" # name of algorithm
self.mode = "train" # train or test
self.seed = 1 # random seed
self.device = "cuda" # device to use
self.train_eps = 100 # number of episodes for training
self.test_eps = 10 # number of episodes for testing
self.max_steps = 200 # max steps for each episode
self.load_checkpoint = False
self.load_path = "tasks" # path to load model
self.show_fig = False # show figure or not
self.save_fig = True # save figure or not
class AlgoConfigDQN(AlgoConfig):
def __init__(self) -> None:
# set epsilon_start=epsilon_end can obtain fixed epsilon=epsilon_end
self.epsilon_start = 0.95 # epsilon start value
self.epsilon_end = 0.01 # epsilon end value
self.epsilon_decay = 500 # epsilon decay rate
self.hidden_dim = 256 # hidden_dim for MLP
self.gamma = 0.95 # discount factor
self.lr = 0.0001 # learning rate
self.buffer_size = 100000 # size of replay buffer
self.batch_size = 64 # batch size
self.target_update = 800 # target network update frequency per steps

View File

@@ -1,130 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-12 00:50:49
@LastEditor: John
LastEditTime: 2022-10-31 00:07:19
@Discription:
@Environment: python 3.7.7
'''
'''off-policy
'''
import torch
import torch.nn as nn
import torch.optim as optim
import random
import math
import numpy as np
class DQN:
def __init__(self,model,memory,cfg):
self.n_actions = cfg.n_actions
self.device = torch.device(cfg.device)
self.gamma = cfg.gamma
## e-greedy parameters
self.sample_count = 0 # sample count for epsilon decay
self.epsilon = cfg.epsilon_start
self.sample_count = 0
self.epsilon_start = cfg.epsilon_start
self.epsilon_end = cfg.epsilon_end
self.epsilon_decay = cfg.epsilon_decay
self.batch_size = cfg.batch_size
self.target_update = cfg.target_update
self.policy_net = model.to(self.device)
self.target_net = model.to(self.device)
## copy parameters from policy net to target net
for target_param, param in zip(self.target_net.parameters(),self.policy_net.parameters()):
target_param.data.copy_(param.data)
# self.target_net.load_state_dict(self.policy_net.state_dict()) # or use this to copy parameters
self.optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg.lr)
self.memory = memory
self.update_flag = False
def sample_action(self, state):
''' sample action with e-greedy policy
'''
self.sample_count += 1
# epsilon must decay(linear,exponential and etc.) for balancing exploration and exploitation
self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
math.exp(-1. * self.sample_count / self.epsilon_decay)
if random.random() > self.epsilon:
with torch.no_grad():
state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
q_values = self.policy_net(state)
action = q_values.max(1)[1].item() # choose action corresponding to the maximum q value
else:
action = random.randrange(self.n_actions)
return action
# @torch.no_grad()
# def sample_action(self, state):
# ''' sample action with e-greedy policy
# '''
# self.sample_count += 1
# # epsilon must decay(linear,exponential and etc.) for balancing exploration and exploitation
# self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
# math.exp(-1. * self.sample_count / self.epsilon_decay)
# if random.random() > self.epsilon:
# state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
# q_values = self.policy_net(state)
# action = q_values.max(1)[1].item() # choose action corresponding to the maximum q value
# else:
# action = random.randrange(self.n_actions)
# return action
def predict_action(self,state):
''' predict action
'''
with torch.no_grad():
state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
q_values = self.policy_net(state)
action = q_values.max(1)[1].item() # choose action corresponding to the maximum q value
return action
def update(self):
if len(self.memory) < self.batch_size: # when transitions in memory donot meet a batch, not update
return
else:
if not self.update_flag:
print("Begin to update!")
self.update_flag = True
# sample a batch of transitions from replay buffer
state_batch, action_batch, reward_batch, next_state_batch, done_batch = self.memory.sample(
self.batch_size)
state_batch = torch.tensor(np.array(state_batch), device=self.device, dtype=torch.float) # shape(batchsize,n_states)
action_batch = torch.tensor(action_batch, device=self.device).unsqueeze(1) # shape(batchsize,1)
reward_batch = torch.tensor(reward_batch, device=self.device, dtype=torch.float).unsqueeze(1) # shape(batchsize,1)
next_state_batch = torch.tensor(np.array(next_state_batch), device=self.device, dtype=torch.float) # shape(batchsize,n_states)
done_batch = torch.tensor(np.float32(done_batch), device=self.device).unsqueeze(1) # shape(batchsize,1)
# print(state_batch.shape,action_batch.shape,reward_batch.shape,next_state_batch.shape,done_batch.shape)
# compute current Q(s_t,a), it is 'y_j' in pseucodes
q_value_batch = self.policy_net(state_batch).gather(dim=1, index=action_batch) # shape(batchsize,1),requires_grad=True
# print(q_values.requires_grad)
# compute max(Q(s_t+1,A_t+1)) respects to actions A, next_max_q_value comes from another net and is just regarded as constant for q update formula below, thus should detach to requires_grad=False
next_max_q_value_batch = self.target_net(next_state_batch).max(1)[0].detach().unsqueeze(1)
# print(q_values.shape,next_q_values.shape)
# compute expected q value, for terminal state, done_batch[0]=1, and expected_q_value=rewardcorrespondingly
expected_q_value_batch = reward_batch + self.gamma * next_max_q_value_batch* (1-done_batch)
# print(expected_q_value_batch.shape,expected_q_value_batch.requires_grad)
loss = nn.MSELoss()(q_value_batch, expected_q_value_batch) # shape same to
# backpropagation
self.optimizer.zero_grad()
loss.backward()
# clip to avoid gradient explosion
for param in self.policy_net.parameters():
param.grad.data.clamp_(-1, 1)
self.optimizer.step()
if self.sample_count % self.target_update == 0: # target net update, target_update means "C" in pseucodes
self.target_net.load_state_dict(self.policy_net.state_dict())
def save_model(self, fpath):
from pathlib import Path
# create path
Path(fpath).mkdir(parents=True, exist_ok=True)
torch.save(self.target_net.state_dict(), f"{fpath}/checkpoint.pt")
def load_model(self, fpath):
self.target_net.load_state_dict(torch.load(f"{fpath}/checkpoint.pt"))
for target_param, param in zip(self.target_net.parameters(), self.policy_net.parameters()):
param.data.copy_(target_param.data)

View File

@@ -1,138 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2022-10-12 11:09:54
LastEditor: JiangJi
LastEditTime: 2022-10-31 00:13:31
Discription: CartPole-v1,Acrobot-v1
'''
import sys,os
curr_path = os.path.dirname(os.path.abspath(__file__)) # current path
parent_path = os.path.dirname(curr_path) # parent path
sys.path.append(parent_path) # add to system path
import gym
from common.utils import all_seed,merge_class_attrs
from common.models import MLP
from common.memories import ReplayBuffer
from common.launcher import Launcher
from envs.register import register_env
from dqn import DQN
from config.config import GeneralConfigDQN,AlgoConfigDQN
class Main(Launcher):
def __init__(self) -> None:
super().__init__()
self.cfgs['general_cfg'] = merge_class_attrs(self.cfgs['general_cfg'],GeneralConfigDQN())
self.cfgs['algo_cfg'] = merge_class_attrs(self.cfgs['algo_cfg'],AlgoConfigDQN())
def env_agent_config(self,cfg,logger):
''' create env and agent
'''
register_env(cfg.env_name)
env = gym.make(cfg.env_name,new_step_api=True) # create env
if cfg.seed !=0: # set random seed
all_seed(env,seed=cfg.seed)
try: # state dimension
n_states = env.observation_space.n # print(hasattr(env.observation_space, 'n'))
except AttributeError:
n_states = env.observation_space.shape[0] # print(hasattr(env.observation_space, 'shape'))
n_actions = env.action_space.n # action dimension
logger.info(f"n_states: {n_states}, n_actions: {n_actions}") # print info
# update to cfg paramters
setattr(cfg, 'n_states', n_states)
setattr(cfg, 'n_actions', n_actions)
# cfg.update({"n_states":n_states,"n_actions":n_actions}) # update to cfg paramters
model = MLP(n_states,n_actions,hidden_dim=cfg.hidden_dim)
memory = ReplayBuffer(cfg.buffer_size) # replay buffer
agent = DQN(model,memory,cfg) # create agent
return env, agent
def train_one_episode(self, env, agent, cfg):
ep_reward = 0 # reward per episode
ep_step = 0
state = env.reset() # reset and obtain initial state
for _ in range(cfg.max_steps):
ep_step += 1
action = agent.sample_action(state) # sample action
next_state, reward, terminated, truncated , info = env.step(action) # update env and return transitions under new_step_api of OpenAI Gym
agent.memory.push(state, action, reward,
next_state, terminated) # save transitions
agent.update() # update agent
state = next_state # update next state for env
ep_reward += reward #
if terminated:
break
return agent,ep_reward,ep_step
def test_one_episode(self, env, agent, cfg):
ep_reward = 0 # reward per episode
ep_step = 0
state = env.reset() # reset and obtain initial state
for _ in range(cfg.max_steps):
ep_step += 1
action = agent.predict_action(state) # sample action
next_state, reward, terminated, truncated , info = env.step(action) # update env and return transitions under new_step_api of OpenAI Gym
state = next_state # update next state for env
ep_reward += reward #
if terminated:
break
return agent,ep_reward,ep_step
# def train(self,env, agent,cfg,logger):
# ''' 训练
# '''
# logger.info("Start training!")
# logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
# rewards = [] # record rewards for all episodes
# steps = [] # record steps for all episodes
# for i_ep in range(cfg.train_eps):
# ep_reward = 0 # reward per episode
# ep_step = 0
# state = env.reset() # reset and obtain initial state
# for _ in range(cfg.max_steps):
# ep_step += 1
# action = agent.sample_action(state) # sample action
# next_state, reward, terminated, truncated , info = env.step(action) # update env and return transitions under new_step_api of OpenAI Gym
# agent.memory.push(state, action, reward,
# next_state, terminated) # save transitions
# state = next_state # update next state for env
# agent.update() # update agent
# ep_reward += reward #
# if terminated:
# break
# if (i_ep + 1) % cfg.target_update == 0: # target net update, target_update means "C" in pseucodes
# agent.target_net.load_state_dict(agent.policy_net.state_dict())
# steps.append(ep_step)
# rewards.append(ep_reward)
# logger.info(f'Episode: {i_ep+1}/{cfg.train_eps}, Reward: {ep_reward:.2f}: Epislon: {agent.epsilon:.3f}')
# logger.info("Finish training!")
# env.close()
# res_dic = {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
# return res_dic
# def test(self,cfg, env, agent,logger):
# logger.info("Start testing!")
# logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
# rewards = [] # record rewards for all episodes
# steps = [] # record steps for all episodes
# for i_ep in range(cfg.test_eps):
# ep_reward = 0 # reward per episode
# ep_step = 0
# state = env.reset() # reset and obtain initial state
# for _ in range(cfg.max_steps):
# ep_step+=1
# action = agent.predict_action(state) # predict action
# next_state, reward, terminated, _, _ = env.step(action)
# state = next_state
# ep_reward += reward
# if terminated:
# break
# steps.append(ep_step)
# rewards.append(ep_reward)
# logger.info(f"Episode: {i_ep+1}/{cfg.test_eps}, Reward: {ep_reward:.2f}")
# logger.info("Finish testing!")
# env.close()
# return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
if __name__ == "__main__":
main = Main()
main.run()

View File

@@ -1,221 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2022-10-24 08:21:31
LastEditor: JiangJi
LastEditTime: 2022-10-26 09:50:49
Discription: Not finished
'''
import sys,os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized."
curr_path = os.path.dirname(os.path.abspath(__file__)) # current path
parent_path = os.path.dirname(curr_path) # parent path
sys.path.append(parent_path) # add path to system path
import gym
import torch
import datetime
import numpy as np
import argparse
from common.utils import all_seed
from common.models import MLP
from common.memories import ReplayBuffer
from common.launcher import Launcher
from envs.register import register_env
from dqn import DQN
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as T
from PIL import Image
resize = T.Compose([T.ToPILImage(),
T.Resize(40, interpolation=Image.CUBIC),
T.ToTensor()])
# xvfb-run -s "-screen 0 640x480x24" python main1.py
def get_cart_location(env,screen_width):
world_width = env.x_threshold * 2
scale = screen_width / world_width
return int(env.state[0] * scale + screen_width / 2.0) # MIDDLE OF CART
def get_screen(env):
# Returned screen requested by gym is 400x600x3, but is sometimes larger
# such as 800x1200x3. Transpose it into torch order (CHW).
screen = env.render().transpose((2, 0, 1))
# Cart is in the lower half, so strip off the top and bottom of the screen
_, screen_height, screen_width = screen.shape
screen = screen[:, int(screen_height*0.4):int(screen_height * 0.8)]
view_width = int(screen_width * 0.6)
cart_location = get_cart_location(env,screen_width)
if cart_location < view_width // 2:
slice_range = slice(view_width)
elif cart_location > (screen_width - view_width // 2):
slice_range = slice(-view_width, None)
else:
slice_range = slice(cart_location - view_width // 2,
cart_location + view_width // 2)
# Strip off the edges, so that we have a square image centered on a cart
screen = screen[:, :, slice_range]
# Convert to float, rescale, convert to torch tensor
# (this doesn't require a copy)
screen = np.ascontiguousarray(screen, dtype=np.float32) / 255
screen = torch.from_numpy(screen)
# Resize, and add a batch dimension (BCHW)
return resize(screen)
class CNN(nn.Module):
def __init__(self, h, w, outputs):
super(CNN, self).__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=5, stride=2)
self.bn1 = nn.BatchNorm2d(16)
self.conv2 = nn.Conv2d(16, 32, kernel_size=5, stride=2)
self.bn2 = nn.BatchNorm2d(32)
self.conv3 = nn.Conv2d(32, 32, kernel_size=5, stride=2)
self.bn3 = nn.BatchNorm2d(32)
# Number of Linear input connections depends on output of conv2d layers
# and therefore the input image size, so compute it.
def conv2d_size_out(size, kernel_size = 5, stride = 2):
return (size - (kernel_size - 1) - 1) // stride + 1
convw = conv2d_size_out(conv2d_size_out(conv2d_size_out(w)))
convh = conv2d_size_out(conv2d_size_out(conv2d_size_out(h)))
linear_input_size = convw * convh * 32
self.head = nn.Linear(linear_input_size, outputs)
# Called with either one element to determine next action, or a batch
# during optimization. Returns tensor([[left0exp,right0exp]...]).
def forward(self, x):
x = F.relu(self.bn1(self.conv1(x)))
x = F.relu(self.bn2(self.conv2(x)))
x = F.relu(self.bn3(self.conv3(x)))
return self.head(x.view(x.size(0), -1))
class Main(Launcher):
def get_args(self):
""" hyperparameters
"""
curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time
parser = argparse.ArgumentParser(description="hyperparameters")
parser.add_argument('--algo_name',default='DQN',type=str,help="name of algorithm")
parser.add_argument('--env_name',default='CartPole-v1',type=str,help="name of environment")
parser.add_argument('--train_eps',default=800,type=int,help="episodes of training")
parser.add_argument('--test_eps',default=20,type=int,help="episodes of testing")
parser.add_argument('--ep_max_steps',default = 100000,type=int,help="steps per episode, much larger value can simulate infinite steps")
parser.add_argument('--gamma',default=0.999,type=float,help="discounted factor")
parser.add_argument('--epsilon_start',default=0.95,type=float,help="initial value of epsilon")
parser.add_argument('--epsilon_end',default=0.01,type=float,help="final value of epsilon")
parser.add_argument('--epsilon_decay',default=500,type=int,help="decay rate of epsilon, the higher value, the slower decay")
parser.add_argument('--lr',default=0.0001,type=float,help="learning rate")
parser.add_argument('--memory_capacity',default=100000,type=int,help="memory capacity")
parser.add_argument('--batch_size',default=128,type=int)
parser.add_argument('--target_update',default=4,type=int)
parser.add_argument('--hidden_dim',default=256,type=int)
parser.add_argument('--device',default='cuda',type=str,help="cpu or cuda")
parser.add_argument('--seed',default=10,type=int,help="seed")
parser.add_argument('--show_fig',default=False,type=bool,help="if show figure or not")
parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not")
# please manually change the following args in this script if you want
parser.add_argument('--result_path',default=curr_path + "/outputs/" + parser.parse_args().env_name + \
'/' + curr_time + '/results' )
parser.add_argument('--model_path',default=curr_path + "/outputs/" + parser.parse_args().env_name + \
'/' + curr_time + '/models' )
args = parser.parse_args()
args = {**vars(args)} # type(dict)
return args
def env_agent_config(self,cfg):
''' create env and agent
'''
env = gym.make('CartPole-v1', new_step_api=True, render_mode='single_rgb_array').unwrapped
if cfg['seed'] !=0: # set random seed
all_seed(env,seed=cfg["seed"])
try: # state dimension
n_states = env.observation_space.n # print(hasattr(env.observation_space, 'n'))
except AttributeError:
n_states = env.observation_space.shape[0] # print(hasattr(env.observation_space, 'shape'))
n_actions = env.action_space.n # action dimension
print(f"n_states: {n_states}, n_actions: {n_actions}")
cfg.update({"n_states":n_states,"n_actions":n_actions}) # update to cfg paramters
env.reset()
init_screen = get_screen(env)
_, screen_height, screen_width = init_screen.shape
model = CNN(screen_height, screen_width, n_actions)
memory = ReplayBuffer(cfg["memory_capacity"]) # replay buffer
agent = DQN(model,memory,cfg) # create agent
return env, agent
def train(self,cfg, env, agent):
''' 训练
'''
print("Start training!")
print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}")
rewards = [] # record rewards for all episodes
steps = []
for i_ep in range(cfg["train_eps"]):
ep_reward = 0 # reward per episode
ep_step = 0
state = env.reset() # reset and obtain initial state
last_screen = get_screen(env)
current_screen = get_screen(env)
state = current_screen - last_screen
for _ in range(cfg['ep_max_steps']):
ep_step += 1
action = agent.sample_action(state) # sample action
_, reward, done, _,_ = env.step(action) # update env and return transitions
last_screen = current_screen
current_screen = get_screen(env)
next_state = current_screen - last_screen
agent.memory.push(state.cpu().numpy(), action, reward,
next_state.cpu().numpy(), done) # save transitions
state = next_state # update next state for env
agent.update() # update agent
ep_reward += reward #
if done:
break
if (i_ep + 1) % cfg["target_update"] == 0: # target net update, target_update means "C" in pseucodes
agent.target_net.load_state_dict(agent.policy_net.state_dict())
steps.append(ep_step)
rewards.append(ep_reward)
if (i_ep + 1) % 10 == 0:
print(f'Episode: {i_ep+1}/{cfg["train_eps"]}, Reward: {ep_reward:.2f}, step: {ep_step:d}, Epislon: {agent.epsilon:.3f}')
print("Finish training!")
env.close()
res_dic = {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
return res_dic
def test(self,cfg, env, agent):
print("Start testing!")
print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}")
rewards = [] # record rewards for all episodes
steps = []
for i_ep in range(cfg['test_eps']):
ep_reward = 0 # reward per episode
ep_step = 0
state = env.reset() # reset and obtain initial state
last_screen = get_screen(env)
current_screen = get_screen(env)
state = current_screen - last_screen
for _ in range(cfg['ep_max_steps']):
ep_step+=1
action = agent.predict_action(state) # predict action
_, reward, done, _,_ = env.step(action)
last_screen = current_screen
current_screen = get_screen(env)
next_state = current_screen - last_screen
state = next_state
ep_reward += reward
if done:
break
steps.append(ep_step)
rewards.append(ep_reward)
print(f"Episode: {i_ep+1}/{cfg['test_eps']}Reward: {ep_reward:.2f}")
print("Finish testing!")
env.close()
return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
if __name__ == "__main__":
main = Main()
main.run()

View File

@@ -1,39 +0,0 @@
食用本篇之前需要有DQN算法的基础参考[DQN算法实战](../DQN)。
## 原理简介
Double-DQN是2016年提出的算法灵感源自2010年的Double-Qlearning可参考论文[Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461)。
跟Nature DQN一样Double-DQN也用了两个网络一个当前网络(对应用$Q$表示),一个目标网络(对应一般用$Q'$表示,为方便区分,以下用$Q_{tar}$代替)。我们先回忆一下,对于非终止状态,目标$Q_{tar}$值计算如下
![在这里插入图片描述](assets/20201222145725907.png)
而在Double-DQN中不再是直接从目标$Q_{tar}$网络中选择各个动作中的最大$Q_{tar}$值,而是先从当前$Q$网络选择$Q$值最大对应的动作,然后代入到目标网络中计算对应的值:
![在这里插入图片描述](assets/20201222150225327.png)
Double-DQN的好处是Nature DQN中使用max虽然可以快速让Q值向可能的优化目标靠拢但是很容易过犹不及导致过度估计(Over Estimation),所谓过度估计就是最终我们得到的算法模型有很大的偏差(bias)。为了解决这个问题, DDQN通过解耦目标Q值动作的选择和目标Q值的计算这两步来达到消除过度估计的问题感兴趣可以阅读原论文。
伪代码如下:
![在这里插入图片描述](assets/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0pvaG5KaW0w,size_16,color_FFFFFF,t_70.png)
当然也可以两个网络可以同时为当前网络和目标网络,如下:
![在这里插入图片描述](assets/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0pvaG5KaW0w,size_16,color_FFFFFF,t_70-20210328110837146.png)
或者这样更好理解如何同时为当前网络和目标网络:
![在这里插入图片描述](assets/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0pvaG5KaW0w,size_16,color_FFFFFF,t_70-20210328110837157.png)
## 代码实战
完整程序见[github](https://github.com/JohnJim0816/reinforcement-learning-tutorials/tree/master/DoubleDQN)。结合上面的原理其实Double DQN改进来很简单基本只需要在```update```中修改几行代码,如下:
```python
'''以下是Nature DQN的q_target计算方式
next_q_state_value = self.target_net(
next_state_batch).max(1)[0].detach() # # 计算所有next states的Q'(s_{t+1})的最大值Q'为目标网络的q函数,比如tensor([ 0.0060, -0.0171,...,])
#计算 q_target
#对于终止状态此时done_batch[0]=1, 对应的expected_q_value等于reward
q_target = reward_batch + self.gamma * next_q_state_value * (1-done_batch[0])
'''
'''以下是Double DQNq_target计算方式与NatureDQN稍有不同'''
next_target_values = self.target_net(
next_state_batch)
#选出Q(s_t, a)对应的action代入到next_target_values获得target net对应的next_q_value即Q(s_t|a=argmax Q(s_t, a))
next_target_q_value = next_target_values.gather(1, torch.max(next_q_values, 1)[1].unsqueeze(1)).squeeze(1)
q_target = reward_batch + self.gamma * next_target_q_value * (1-done_batch[0])
```
reward变化结果如下
![在这里插入图片描述](assets/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0pvaG5KaW0w,size_16,color_FFFFFF,t_70-20210328110837128.png)
其中下边蓝色和红色分别表示Double DQN和Nature DQN在训练中的reward变化图而上面蓝色和绿色则表示Double DQN和Nature DQN在测试中的reward变化图。

Binary file not shown.

Before

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 24 KiB

View File

@@ -1,106 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-12 00:50:49
@LastEditor: John
LastEditTime: 2022-08-29 23:34:20
@Discription:
@Environment: python 3.7.7
'''
'''off-policy
'''
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import random
import math
import numpy as np
class DoubleDQN:
def __init__(self,models, memories, cfg):
self.n_actions = cfg['n_actions']
self.device = torch.device(cfg['device'])
self.gamma = cfg['gamma']
## e-greedy parameters
self.sample_count = 0 # sample count for epsilon decay
self.epsilon_start = cfg['epsilon_start']
self.epsilon_end = cfg['epsilon_end']
self.epsilon_decay = cfg['epsilon_decay']
self.batch_size = cfg['batch_size']
self.policy_net = models['Qnet'].to(self.device)
self.target_net = models['Qnet'].to(self.device)
# target_net copy from policy_net
for target_param, param in zip(self.target_net.parameters(), self.policy_net.parameters()):
target_param.data.copy_(param.data)
# self.target_net.eval() # donnot use BatchNormalization or Dropout
# the difference between parameters() and state_dict() is that parameters() require_grad=True
self.optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg['lr'])
self.memory = memories['Memory']
self.update_flag = False
def sample_action(self, state):
''' sample action
'''
self.sample_count += 1
self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * math.exp(-1. * self.sample_count / self.epsilon_decay)
if random.random() > self.epsilon:
with torch.no_grad():
state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(0)
q_value = self.policy_net(state)
action = q_value.max(1)[1].item()
else:
action = random.randrange(self.n_actions)
return action
def predict_action(self, state):
''' predict action
'''
with torch.no_grad():
state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(0)
q_value = self.policy_net(state)
action = q_value.max(1)[1].item()
return action
def update(self):
if len(self.memory) < self.batch_size: # when transitions in memory donot meet a batch, not update
return
else:
if not self.update_flag:
print("Begin to update!")
self.update_flag = True
# sample a batch of transitions from replay buffer
state_batch, action_batch, reward_batch, next_state_batch, done_batch = self.memory.sample(self.batch_size)
# convert to tensor
state_batch = torch.tensor(np.array(state_batch), device=self.device, dtype=torch.float)
action_batch = torch.tensor(action_batch, device=self.device).unsqueeze(1) # shape(batchsize,1)
reward_batch = torch.tensor(reward_batch, device=self.device, dtype=torch.float).unsqueeze(1) # shape(batchsize,1)
next_state_batch = torch.tensor(np.array(next_state_batch), device=self.device, dtype=torch.float)
done_batch = torch.tensor(np.float32(done_batch), device=self.device).unsqueeze(1) # shape(batchsize,1)
# compute current Q(s_t|a=a_t)
q_value_batch = self.policy_net(state_batch).gather(dim=1, index=action_batch) # shape(batchsize,1),requires_grad=True
next_q_value_batch = self.policy_net(next_state_batch)
'''the following is the way of computing Double DQN expected_q_valuea bit different from Nature DQN'''
next_target_value_batch = self.target_net(next_state_batch)
# choose action a from Q(s_t, a), next_target_values obtain next_q_valuewhich is Q(s_t|a=argmax Q(s_t, a))
next_target_q_value_batch = next_target_value_batch.gather(1, torch.max(next_q_value_batch, 1)[1].unsqueeze(1)) # shape(batchsize,1)
expected_q_value_batch = reward_batch + self.gamma * next_target_q_value_batch * (1-done_batch)
loss = nn.MSELoss()(q_value_batch , expected_q_value_batch)
self.optimizer.zero_grad()
loss.backward()
# clip to avoid gradient explosion
for param in self.policy_net.parameters():
param.grad.data.clamp_(-1, 1)
self.optimizer.step()
def save_model(self,path):
from pathlib import Path
# create path
Path(path).mkdir(parents=True, exist_ok=True)
torch.save(self.target_net.state_dict(), path+'checkpoint.pth')
def load_model(self,path):
self.target_net.load_state_dict(torch.load(path+'checkpoint.pth'))
for target_param, param in zip(self.target_net.parameters(), self.policy_net.parameters()):
param.data.copy_(target_param.data)

View File

@@ -1,129 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2021-11-07 18:10:37
LastEditor: JiangJi
LastEditTime: 2022-08-29 23:33:31
Discription:
'''
import sys,os
curr_path = os.path.dirname(os.path.abspath(__file__)) # current path
parent_path = os.path.dirname(curr_path) # parent path
sys.path.append(parent_path) # add to system path
import gym
import datetime
import argparse
from common.utils import all_seed
from common.models import MLP
from common.memories import ReplayBufferQue
from DoubleDQN.double_dqn import DoubleDQN
from common.launcher import Launcher
from envs.register import register_env
class Main(Launcher):
def get_args(self):
''' hyperparameters
'''
curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time
parser = argparse.ArgumentParser(description="hyperparameters")
parser.add_argument('--algo_name',default='DoubleDQN',type=str,help="name of algorithm")
parser.add_argument('--env_name',default='CartPole-v0',type=str,help="name of environment")
parser.add_argument('--train_eps',default=200,type=int,help="episodes of training")
parser.add_argument('--test_eps',default=20,type=int,help="episodes of testing")
parser.add_argument('--ep_max_steps',default = 100000,type=int,help="steps per episode, much larger value can simulate infinite steps")
parser.add_argument('--gamma',default=0.95,type=float,help="discounted factor")
parser.add_argument('--epsilon_start',default=0.95,type=float,help="initial value of epsilon")
parser.add_argument('--epsilon_end',default=0.01,type=float,help="final value of epsilon")
parser.add_argument('--epsilon_decay',default=500,type=int,help="decay rate of epsilon")
parser.add_argument('--lr',default=0.0001,type=float,help="learning rate")
parser.add_argument('--memory_capacity',default=100000,type=int,help="memory capacity")
parser.add_argument('--batch_size',default=64,type=int)
parser.add_argument('--target_update',default=4,type=int)
parser.add_argument('--hidden_dim',default=256,type=int)
parser.add_argument('--device',default='cpu',type=str,help="cpu or cuda")
parser.add_argument('--seed',default=1,type=int,help="seed")
parser.add_argument('--show_fig',default=False,type=bool,help="if show figure or not")
parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not")
args = parser.parse_args()
default_args = {'result_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/results/",
'model_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/models/",
}
args = {**vars(args),**default_args} # type(dict)
return args
def env_agent_config(self,cfg):
''' create env and agent
'''
register_env(cfg['env_name'])
env = gym.make(cfg['env_name'])
if cfg['seed'] !=0: # set random seed
all_seed(env,seed=cfg["seed"])
try: # state dimension
n_states = env.observation_space.n # print(hasattr(env.observation_space, 'n'))
except AttributeError:
n_states = env.observation_space.shape[0] # print(hasattr(env.observation_space, 'shape'))
n_actions = env.action_space.n # action dimension
print(f"n_states: {n_states}, n_actions: {n_actions}")
cfg.update({"n_states":n_states,"n_actions":n_actions}) # update to cfg paramters
models = {'Qnet':MLP(n_states,n_actions,hidden_dim=cfg['hidden_dim'])}
memories = {'Memory':ReplayBufferQue(cfg['memory_capacity'])}
agent = DoubleDQN(models,memories,cfg)
return env,agent
def train(self,cfg,env,agent):
print("Start training!")
print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}")
rewards = [] # record rewards for all episodes
steps = []
for i_ep in range(cfg["train_eps"]):
ep_reward = 0 # reward per episode
ep_step = 0
state = env.reset() # reset and obtain initial state
for _ in range(cfg['ep_max_steps']):
action = agent.sample_action(state)
next_state, reward, done, _ = env.step(action)
ep_reward += reward
agent.memory.push((state, action, reward, next_state, done))
state = next_state
agent.update()
if done:
break
if i_ep % cfg['target_update'] == 0:
agent.target_net.load_state_dict(agent.policy_net.state_dict())
steps.append(ep_step)
rewards.append(ep_reward)
if (i_ep+1)%10 == 0:
print(f'Episode: {i_ep+1}/{cfg["train_eps"]}, Reward: {ep_reward:.2f}: Epislon: {agent.epsilon:.3f}')
print("Finish training!")
env.close()
res_dic = {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
return res_dic
def test(self,cfg,env,agent):
print("Start testing!")
print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}")
rewards = [] # record rewards for all episodes
steps = []
for i_ep in range(cfg['test_eps']):
ep_reward = 0 # reward per episode
ep_step = 0
state = env.reset() # reset and obtain initial state
for _ in range(cfg['ep_max_steps']):
action = agent.predict_action(state)
next_state, reward, done, _ = env.step(action)
state = next_state
ep_reward += reward
if done:
break
steps.append(ep_step)
rewards.append(ep_reward)
print(f"Episode: {i_ep+1}/{cfg['test_eps']}Reward: {ep_reward:.2f}")
print("Finish testing!")
env.close()
return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
if __name__ == "__main__":
main = Main()
main.run()

View File

@@ -1 +0,0 @@
{"algo_name": "DoubleDQN", "env_name": "CartPole-v0", "train_eps": 200, "test_eps": 20, "ep_max_steps": 100000, "gamma": 0.95, "epsilon_start": 0.95, "epsilon_end": 0.01, "epsilon_decay": 500, "lr": 0.0001, "memory_capacity": 100000, "batch_size": 64, "target_update": 4, "hidden_dim": 256, "device": "cpu", "seed": 1, "show_fig": false, "save_fig": true, "result_path": "c:\\Users\\24438\\Desktop\\rl-tutorials\\codes\\DoubleDQN/outputs/CartPole-v0/20220829-233435/results/", "model_path": "c:\\Users\\24438\\Desktop\\rl-tutorials\\codes\\DoubleDQN/outputs/CartPole-v0/20220829-233435/models/", "n_states": 4, "n_actions": 2}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 53 KiB

View File

@@ -1,21 +0,0 @@
episodes,rewards,steps
0,145.0,0
1,166.0,0
2,171.0,0
3,200.0,0
4,139.0,0
5,200.0,0
6,200.0,0
7,141.0,0
8,200.0,0
9,187.0,0
10,166.0,0
11,172.0,0
12,121.0,0
13,200.0,0
14,200.0,0
15,149.0,0
16,128.0,0
17,200.0,0
18,178.0,0
19,185.0,0
1 episodes rewards steps
2 0 145.0 0
3 1 166.0 0
4 2 171.0 0
5 3 200.0 0
6 4 139.0 0
7 5 200.0 0
8 6 200.0 0
9 7 141.0 0
10 8 200.0 0
11 9 187.0 0
12 10 166.0 0
13 11 172.0 0
14 12 121.0 0
15 13 200.0 0
16 14 200.0 0
17 15 149.0 0
18 16 128.0 0
19 17 200.0 0
20 18 178.0 0
21 19 185.0 0

Binary file not shown.

Before

Width:  |  Height:  |  Size: 65 KiB

View File

@@ -1,201 +0,0 @@
episodes,rewards,steps
0,19.0,0
1,16.0,0
2,17.0,0
3,11.0,0
4,10.0,0
5,27.0,0
6,16.0,0
7,9.0,0
8,20.0,0
9,21.0,0
10,15.0,0
11,10.0,0
12,14.0,0
13,37.0,0
14,12.0,0
15,10.0,0
16,27.0,0
17,33.0,0
18,19.0,0
19,13.0,0
20,26.0,0
21,15.0,0
22,29.0,0
23,11.0,0
24,20.0,0
25,23.0,0
26,23.0,0
27,26.0,0
28,17.0,0
29,33.0,0
30,16.0,0
31,48.0,0
32,48.0,0
33,69.0,0
34,58.0,0
35,24.0,0
36,18.0,0
37,28.0,0
38,12.0,0
39,12.0,0
40,18.0,0
41,12.0,0
42,13.0,0
43,21.0,0
44,30.0,0
45,32.0,0
46,22.0,0
47,18.0,0
48,12.0,0
49,12.0,0
50,20.0,0
51,32.0,0
52,15.0,0
53,100.0,0
54,26.0,0
55,25.0,0
56,18.0,0
57,15.0,0
58,35.0,0
59,12.0,0
60,65.0,0
61,27.0,0
62,29.0,0
63,22.0,0
64,83.0,0
65,24.0,0
66,28.0,0
67,15.0,0
68,43.0,0
69,13.0,0
70,22.0,0
71,46.0,0
72,14.0,0
73,32.0,0
74,44.0,0
75,53.0,0
76,31.0,0
77,51.0,0
78,61.0,0
79,30.0,0
80,36.0,0
81,30.0,0
82,48.0,0
83,26.0,0
84,27.0,0
85,43.0,0
86,20.0,0
87,87.0,0
88,71.0,0
89,43.0,0
90,57.0,0
91,40.0,0
92,37.0,0
93,43.0,0
94,31.0,0
95,45.0,0
96,47.0,0
97,52.0,0
98,48.0,0
99,98.0,0
100,49.0,0
101,98.0,0
102,68.0,0
103,70.0,0
104,74.0,0
105,73.0,0
106,127.0,0
107,92.0,0
108,70.0,0
109,97.0,0
110,66.0,0
111,112.0,0
112,138.0,0
113,81.0,0
114,74.0,0
115,153.0,0
116,113.0,0
117,88.0,0
118,138.0,0
119,200.0,0
120,84.0,0
121,123.0,0
122,158.0,0
123,171.0,0
124,137.0,0
125,143.0,0
126,170.0,0
127,127.0,0
128,118.0,0
129,200.0,0
130,189.0,0
131,149.0,0
132,137.0,0
133,115.0,0
134,153.0,0
135,136.0,0
136,140.0,0
137,169.0,0
138,187.0,0
139,200.0,0
140,196.0,0
141,200.0,0
142,200.0,0
143,137.0,0
144,200.0,0
145,185.0,0
146,200.0,0
147,164.0,0
148,200.0,0
149,143.0,0
150,143.0,0
151,112.0,0
152,192.0,0
153,200.0,0
154,144.0,0
155,188.0,0
156,200.0,0
157,133.0,0
158,200.0,0
159,143.0,0
160,158.0,0
161,161.0,0
162,169.0,0
163,176.0,0
164,200.0,0
165,149.0,0
166,156.0,0
167,200.0,0
168,200.0,0
169,200.0,0
170,134.0,0
171,171.0,0
172,200.0,0
173,200.0,0
174,200.0,0
175,194.0,0
176,200.0,0
177,138.0,0
178,159.0,0
179,187.0,0
180,200.0,0
181,192.0,0
182,200.0,0
183,200.0,0
184,200.0,0
185,173.0,0
186,200.0,0
187,178.0,0
188,176.0,0
189,196.0,0
190,200.0,0
191,195.0,0
192,158.0,0
193,156.0,0
194,200.0,0
195,200.0,0
196,200.0,0
197,200.0,0
198,193.0,0
199,200.0,0
1 episodes rewards steps
2 0 19.0 0
3 1 16.0 0
4 2 17.0 0
5 3 11.0 0
6 4 10.0 0
7 5 27.0 0
8 6 16.0 0
9 7 9.0 0
10 8 20.0 0
11 9 21.0 0
12 10 15.0 0
13 11 10.0 0
14 12 14.0 0
15 13 37.0 0
16 14 12.0 0
17 15 10.0 0
18 16 27.0 0
19 17 33.0 0
20 18 19.0 0
21 19 13.0 0
22 20 26.0 0
23 21 15.0 0
24 22 29.0 0
25 23 11.0 0
26 24 20.0 0
27 25 23.0 0
28 26 23.0 0
29 27 26.0 0
30 28 17.0 0
31 29 33.0 0
32 30 16.0 0
33 31 48.0 0
34 32 48.0 0
35 33 69.0 0
36 34 58.0 0
37 35 24.0 0
38 36 18.0 0
39 37 28.0 0
40 38 12.0 0
41 39 12.0 0
42 40 18.0 0
43 41 12.0 0
44 42 13.0 0
45 43 21.0 0
46 44 30.0 0
47 45 32.0 0
48 46 22.0 0
49 47 18.0 0
50 48 12.0 0
51 49 12.0 0
52 50 20.0 0
53 51 32.0 0
54 52 15.0 0
55 53 100.0 0
56 54 26.0 0
57 55 25.0 0
58 56 18.0 0
59 57 15.0 0
60 58 35.0 0
61 59 12.0 0
62 60 65.0 0
63 61 27.0 0
64 62 29.0 0
65 63 22.0 0
66 64 83.0 0
67 65 24.0 0
68 66 28.0 0
69 67 15.0 0
70 68 43.0 0
71 69 13.0 0
72 70 22.0 0
73 71 46.0 0
74 72 14.0 0
75 73 32.0 0
76 74 44.0 0
77 75 53.0 0
78 76 31.0 0
79 77 51.0 0
80 78 61.0 0
81 79 30.0 0
82 80 36.0 0
83 81 30.0 0
84 82 48.0 0
85 83 26.0 0
86 84 27.0 0
87 85 43.0 0
88 86 20.0 0
89 87 87.0 0
90 88 71.0 0
91 89 43.0 0
92 90 57.0 0
93 91 40.0 0
94 92 37.0 0
95 93 43.0 0
96 94 31.0 0
97 95 45.0 0
98 96 47.0 0
99 97 52.0 0
100 98 48.0 0
101 99 98.0 0
102 100 49.0 0
103 101 98.0 0
104 102 68.0 0
105 103 70.0 0
106 104 74.0 0
107 105 73.0 0
108 106 127.0 0
109 107 92.0 0
110 108 70.0 0
111 109 97.0 0
112 110 66.0 0
113 111 112.0 0
114 112 138.0 0
115 113 81.0 0
116 114 74.0 0
117 115 153.0 0
118 116 113.0 0
119 117 88.0 0
120 118 138.0 0
121 119 200.0 0
122 120 84.0 0
123 121 123.0 0
124 122 158.0 0
125 123 171.0 0
126 124 137.0 0
127 125 143.0 0
128 126 170.0 0
129 127 127.0 0
130 128 118.0 0
131 129 200.0 0
132 130 189.0 0
133 131 149.0 0
134 132 137.0 0
135 133 115.0 0
136 134 153.0 0
137 135 136.0 0
138 136 140.0 0
139 137 169.0 0
140 138 187.0 0
141 139 200.0 0
142 140 196.0 0
143 141 200.0 0
144 142 200.0 0
145 143 137.0 0
146 144 200.0 0
147 145 185.0 0
148 146 200.0 0
149 147 164.0 0
150 148 200.0 0
151 149 143.0 0
152 150 143.0 0
153 151 112.0 0
154 152 192.0 0
155 153 200.0 0
156 154 144.0 0
157 155 188.0 0
158 156 200.0 0
159 157 133.0 0
160 158 200.0 0
161 159 143.0 0
162 160 158.0 0
163 161 161.0 0
164 162 169.0 0
165 163 176.0 0
166 164 200.0 0
167 165 149.0 0
168 166 156.0 0
169 167 200.0 0
170 168 200.0 0
171 169 200.0 0
172 170 134.0 0
173 171 171.0 0
174 172 200.0 0
175 173 200.0 0
176 174 200.0 0
177 175 194.0 0
178 176 200.0 0
179 177 138.0 0
180 178 159.0 0
181 179 187.0 0
182 180 200.0 0
183 181 192.0 0
184 182 200.0 0
185 183 200.0 0
186 184 200.0 0
187 185 173.0 0
188 186 200.0 0
189 187 178.0 0
190 188 176.0 0
191 189 196.0 0
192 190 200.0 0
193 191 195.0 0
194 192 158.0 0
195 193 156.0 0
196 194 200.0 0
197 195 200.0 0
198 196 200.0 0
199 197 200.0 0
200 198 193.0 0
201 199 200.0 0

View File

@@ -1 +0,0 @@
{"algo_name": "DoubleDQN", "env_name": "CartPole-v0", "train_eps": 200, "test_eps": 20, "ep_max_steps": 100000, "gamma": 0.95, "epsilon_start": 0.95, "epsilon_end": 0.01, "epsilon_decay": 500, "lr": 0.0001, "memory_capacity": 100000, "batch_size": 64, "target_update": 4, "hidden_dim": 256, "device": "cuda", "seed": 1, "show_fig": false, "save_fig": true, "result_path": "C:\\Users\\24438\\Desktop\\rl-tutorials\\codes\\DoubleDQN/outputs/CartPole-v0/20220829-233635/results/", "model_path": "C:\\Users\\24438\\Desktop\\rl-tutorials\\codes\\DoubleDQN/outputs/CartPole-v0/20220829-233635/models/", "n_states": 4, "n_actions": 2}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 40 KiB

View File

@@ -1,21 +0,0 @@
episodes,rewards,steps
0,200.0,0
1,200.0,0
2,200.0,0
3,200.0,0
4,191.0,0
5,200.0,0
6,200.0,0
7,179.0,0
8,200.0,0
9,200.0,0
10,200.0,0
11,190.0,0
12,147.0,0
13,197.0,0
14,200.0,0
15,200.0,0
16,167.0,0
17,200.0,0
18,200.0,0
19,200.0,0
1 episodes rewards steps
2 0 200.0 0
3 1 200.0 0
4 2 200.0 0
5 3 200.0 0
6 4 191.0 0
7 5 200.0 0
8 6 200.0 0
9 7 179.0 0
10 8 200.0 0
11 9 200.0 0
12 10 200.0 0
13 11 190.0 0
14 12 147.0 0
15 13 197.0 0
16 14 200.0 0
17 15 200.0 0
18 16 167.0 0
19 17 200.0 0
20 18 200.0 0
21 19 200.0 0

Binary file not shown.

Before

Width:  |  Height:  |  Size: 65 KiB

View File

@@ -1,201 +0,0 @@
episodes,rewards,steps
0,19.0,0
1,16.0,0
2,17.0,0
3,11.0,0
4,10.0,0
5,27.0,0
6,55.0,0
7,17.0,0
8,23.0,0
9,9.0,0
10,17.0,0
11,14.0,0
12,17.0,0
13,12.0,0
14,14.0,0
15,16.0,0
16,27.0,0
17,36.0,0
18,17.0,0
19,17.0,0
20,21.0,0
21,23.0,0
22,13.0,0
23,12.0,0
24,17.0,0
25,26.0,0
26,25.0,0
27,17.0,0
28,10.0,0
29,16.0,0
30,14.0,0
31,19.0,0
32,23.0,0
33,37.0,0
34,29.0,0
35,22.0,0
36,29.0,0
37,15.0,0
38,16.0,0
39,18.0,0
40,23.0,0
41,16.0,0
42,26.0,0
43,13.0,0
44,24.0,0
45,39.0,0
46,23.0,0
47,32.0,0
48,123.0,0
49,18.0,0
50,39.0,0
51,17.0,0
52,28.0,0
53,34.0,0
54,26.0,0
55,61.0,0
56,28.0,0
57,16.0,0
58,45.0,0
59,41.0,0
60,49.0,0
61,18.0,0
62,40.0,0
63,24.0,0
64,37.0,0
65,26.0,0
66,51.0,0
67,17.0,0
68,152.0,0
69,17.0,0
70,29.0,0
71,37.0,0
72,15.0,0
73,55.0,0
74,152.0,0
75,23.0,0
76,45.0,0
77,30.0,0
78,39.0,0
79,20.0,0
80,53.0,0
81,49.0,0
82,71.0,0
83,115.0,0
84,41.0,0
85,52.0,0
86,52.0,0
87,36.0,0
88,84.0,0
89,122.0,0
90,49.0,0
91,200.0,0
92,67.0,0
93,87.0,0
94,183.0,0
95,132.0,0
96,76.0,0
97,200.0,0
98,200.0,0
99,200.0,0
100,200.0,0
101,200.0,0
102,106.0,0
103,192.0,0
104,111.0,0
105,95.0,0
106,200.0,0
107,200.0,0
108,148.0,0
109,200.0,0
110,97.0,0
111,200.0,0
112,200.0,0
113,105.0,0
114,135.0,0
115,200.0,0
116,144.0,0
117,156.0,0
118,200.0,0
119,200.0,0
120,166.0,0
121,200.0,0
122,200.0,0
123,200.0,0
124,200.0,0
125,200.0,0
126,200.0,0
127,158.0,0
128,139.0,0
129,200.0,0
130,200.0,0
131,200.0,0
132,200.0,0
133,122.0,0
134,200.0,0
135,188.0,0
136,200.0,0
137,183.0,0
138,200.0,0
139,200.0,0
140,200.0,0
141,200.0,0
142,200.0,0
143,158.0,0
144,200.0,0
145,200.0,0
146,200.0,0
147,191.0,0
148,200.0,0
149,194.0,0
150,178.0,0
151,200.0,0
152,200.0,0
153,200.0,0
154,162.0,0
155,200.0,0
156,200.0,0
157,128.0,0
158,200.0,0
159,184.0,0
160,194.0,0
161,200.0,0
162,200.0,0
163,200.0,0
164,200.0,0
165,160.0,0
166,163.0,0
167,200.0,0
168,200.0,0
169,200.0,0
170,141.0,0
171,200.0,0
172,200.0,0
173,200.0,0
174,200.0,0
175,200.0,0
176,200.0,0
177,157.0,0
178,164.0,0
179,200.0,0
180,200.0,0
181,200.0,0
182,200.0,0
183,200.0,0
184,200.0,0
185,193.0,0
186,182.0,0
187,200.0,0
188,200.0,0
189,200.0,0
190,200.0,0
191,200.0,0
192,174.0,0
193,178.0,0
194,200.0,0
195,200.0,0
196,200.0,0
197,200.0,0
198,200.0,0
199,200.0,0
1 episodes rewards steps
2 0 19.0 0
3 1 16.0 0
4 2 17.0 0
5 3 11.0 0
6 4 10.0 0
7 5 27.0 0
8 6 55.0 0
9 7 17.0 0
10 8 23.0 0
11 9 9.0 0
12 10 17.0 0
13 11 14.0 0
14 12 17.0 0
15 13 12.0 0
16 14 14.0 0
17 15 16.0 0
18 16 27.0 0
19 17 36.0 0
20 18 17.0 0
21 19 17.0 0
22 20 21.0 0
23 21 23.0 0
24 22 13.0 0
25 23 12.0 0
26 24 17.0 0
27 25 26.0 0
28 26 25.0 0
29 27 17.0 0
30 28 10.0 0
31 29 16.0 0
32 30 14.0 0
33 31 19.0 0
34 32 23.0 0
35 33 37.0 0
36 34 29.0 0
37 35 22.0 0
38 36 29.0 0
39 37 15.0 0
40 38 16.0 0
41 39 18.0 0
42 40 23.0 0
43 41 16.0 0
44 42 26.0 0
45 43 13.0 0
46 44 24.0 0
47 45 39.0 0
48 46 23.0 0
49 47 32.0 0
50 48 123.0 0
51 49 18.0 0
52 50 39.0 0
53 51 17.0 0
54 52 28.0 0
55 53 34.0 0
56 54 26.0 0
57 55 61.0 0
58 56 28.0 0
59 57 16.0 0
60 58 45.0 0
61 59 41.0 0
62 60 49.0 0
63 61 18.0 0
64 62 40.0 0
65 63 24.0 0
66 64 37.0 0
67 65 26.0 0
68 66 51.0 0
69 67 17.0 0
70 68 152.0 0
71 69 17.0 0
72 70 29.0 0
73 71 37.0 0
74 72 15.0 0
75 73 55.0 0
76 74 152.0 0
77 75 23.0 0
78 76 45.0 0
79 77 30.0 0
80 78 39.0 0
81 79 20.0 0
82 80 53.0 0
83 81 49.0 0
84 82 71.0 0
85 83 115.0 0
86 84 41.0 0
87 85 52.0 0
88 86 52.0 0
89 87 36.0 0
90 88 84.0 0
91 89 122.0 0
92 90 49.0 0
93 91 200.0 0
94 92 67.0 0
95 93 87.0 0
96 94 183.0 0
97 95 132.0 0
98 96 76.0 0
99 97 200.0 0
100 98 200.0 0
101 99 200.0 0
102 100 200.0 0
103 101 200.0 0
104 102 106.0 0
105 103 192.0 0
106 104 111.0 0
107 105 95.0 0
108 106 200.0 0
109 107 200.0 0
110 108 148.0 0
111 109 200.0 0
112 110 97.0 0
113 111 200.0 0
114 112 200.0 0
115 113 105.0 0
116 114 135.0 0
117 115 200.0 0
118 116 144.0 0
119 117 156.0 0
120 118 200.0 0
121 119 200.0 0
122 120 166.0 0
123 121 200.0 0
124 122 200.0 0
125 123 200.0 0
126 124 200.0 0
127 125 200.0 0
128 126 200.0 0
129 127 158.0 0
130 128 139.0 0
131 129 200.0 0
132 130 200.0 0
133 131 200.0 0
134 132 200.0 0
135 133 122.0 0
136 134 200.0 0
137 135 188.0 0
138 136 200.0 0
139 137 183.0 0
140 138 200.0 0
141 139 200.0 0
142 140 200.0 0
143 141 200.0 0
144 142 200.0 0
145 143 158.0 0
146 144 200.0 0
147 145 200.0 0
148 146 200.0 0
149 147 191.0 0
150 148 200.0 0
151 149 194.0 0
152 150 178.0 0
153 151 200.0 0
154 152 200.0 0
155 153 200.0 0
156 154 162.0 0
157 155 200.0 0
158 156 200.0 0
159 157 128.0 0
160 158 200.0 0
161 159 184.0 0
162 160 194.0 0
163 161 200.0 0
164 162 200.0 0
165 163 200.0 0
166 164 200.0 0
167 165 160.0 0
168 166 163.0 0
169 167 200.0 0
170 168 200.0 0
171 169 200.0 0
172 170 141.0 0
173 171 200.0 0
174 172 200.0 0
175 173 200.0 0
176 174 200.0 0
177 175 200.0 0
178 176 200.0 0
179 177 157.0 0
180 178 164.0 0
181 179 200.0 0
182 180 200.0 0
183 181 200.0 0
184 182 200.0 0
185 183 200.0 0
186 184 200.0 0
187 185 193.0 0
188 186 182.0 0
189 187 200.0 0
190 188 200.0 0
191 189 200.0 0
192 190 200.0 0
193 191 200.0 0
194 192 174.0 0
195 193 178.0 0
196 194 200.0 0
197 195 200.0 0
198 196 200.0 0
199 197 200.0 0
200 198 200.0 0
201 199 200.0 0

Some files were not shown because too many files have changed in this diff Show More