hot update

This commit is contained in:
johnjim0816
2022-08-24 11:33:06 +08:00
parent ad65dd17cd
commit 62a7364c72
40 changed files with 2129 additions and 179 deletions

View File

@@ -23,52 +23,71 @@
注:点击对应的名称会跳到[codes](./codes/)下对应的算法中,其他版本还请读者自行翻阅
| 算法名称 | 参考文献 | 环境 | 备注 |
| :-----------------------: | :----------------------------------------------------------: | :--: | :--: |
| | | | |
| DQN-CNN | | | 待更 |
| [SoftQ](codes/SoftQ) | [Soft Q-learning paper](https://arxiv.org/abs/1702.08165) | | |
| [SAC](codes/SAC) | [SAC paper](https://arxiv.org/pdf/1812.05905.pdf) | | |
| [SAC-Discrete](codes/SAC) | [SAC-Discrete paper](https://arxiv.org/pdf/1910.07207.pdf) | | |
| SAC-V | [SAC-V paper](https://arxiv.org/abs/1801.01290) | | |
| DSAC | [DSAC paper](https://paperswithcode.com/paper/addressing-value-estimation-errors-in) | | 待更 |
| 算法名称 | 参考文献 | 备注 |
| :-----------------------: | :----------------------------------------------------------: | :--: |
| | | |
| DQN-CNN | | 待更 |
| [SoftQ](codes/SoftQ) | [Soft Q-learning paper](https://arxiv.org/abs/1702.08165) | |
| [SAC](codes/SAC) | [SAC paper](https://arxiv.org/pdf/1812.05905.pdf) | |
| [SAC-Discrete](codes/SAC) | [SAC-Discrete paper](https://arxiv.org/pdf/1910.07207.pdf) | |
| SAC-S | [SAC-S paper](https://arxiv.org/abs/1801.01290) | |
| DSAC | [DSAC paper](https://paperswithcode.com/paper/addressing-value-estimation-errors-in) | 待更 |
## 3、算法环境
算法环境说明请跳转[env](./codes/envs/README.md)
## 3、运行环境
## 4、运行环境
Python 3.7、PyTorch 1.10.0、Gym 0.21.0
主要依赖:Python 3.7、PyTorch 1.10.0、Gym 0.21.0
在项目根目录下执行以下命令复现环境:
### 4.1、创建Conda环境
```bash
conda create -n easyrl python=3.7
conda activate easyrl # 激活环境
```
### 4.2、安装Torch
安装CPU版本
```bash
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cpuonly -c pytorch
```
安装CUDA版本
```bash
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
```
如果安装Torch需要镜像加速的话点击[清华镜像链接](https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/),选择对应的操作系统,如```win-64```,然后复制链接,执行:
```bash
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/win-64/
```
也可以使用PiP镜像安装仅限CUDA版本
```bash
pip install torch==1.10.0+cu113 torchvision==0.11.0+cu113 torchaudio==0.10.0 --extra-index-url https://download.pytorch.org/whl/cu113
```
### 4.3、安装其他依赖
项目根目录下执行:
```bash
pip install -r requirements.txt
```
如果需要使用CUDA则需另外安装```cudatoolkit```,推荐```10.2```或者```11.3```版本的CUDA如下
```bash
conda install cudatoolkit=11.3 -c pytorch
```
如果conda需要镜像加速安装的话点击[该清华镜像链接](https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/),选择对应的操作系统,比如```win-64```,然后复制链接,执行如下命令:
```bash
conda install cudatoolkit=11.3 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/win-64/
```
执行以下Python脚本如果返回True说明cuda安装成功:
### 4.4、检验CUDA版本Torch安装
CPU版本Torch请忽略此步执行如下Python脚本如果返回True说明CUDA版本安装成功:
```python
import torch
print(torch.cuda.is_available())
```
如果还是不成功可以使用pip安装
```bash
pip install torch==1.10.0+cu113 torchvision==0.11.0+cu113 torchaudio==0.10.0 --extra-index-url https://download.pytorch.org/whl/cu113
```
## 4、使用说明
## 5、使用说明
对于[codes](./codes/)
* 运行带有task的py脚本
* 运行带有```main.py```脚本
* 执行[scripts](codes\scripts)下对应的Bash脚本例如```sh codes/scripts/DQN_task0.sh```,推荐创建名为"easyrl"的conda环境否则需要更改sh脚本相关信息。对于Windows系统建议安装Git(不要更改默认安装路径否则VS Code可能不会显示Git Bash)然后使用git bash终端而非PowerShell或者cmd终端
对于[Jupyter Notebook](./notebooks/)
* 直接运行对应的ipynb文件就行
## 5、友情说明
## 6、友情说明
推荐使用VS Code做项目入门可参考[VSCode上手指南](https://blog.csdn.net/JohnJim0/article/details/126366454)

View File

@@ -28,6 +28,8 @@
\@writefile{loa}{\contentsline {algorithm}{\numberline {}{\ignorespaces }}{6}{algorithm.}\protected@file@percent }
\@writefile{toc}{\contentsline {section}{\numberline {6}SoftQ算法}{7}{section.6}\protected@file@percent }
\@writefile{loa}{\contentsline {algorithm}{\numberline {}{\ignorespaces }}{7}{algorithm.}\protected@file@percent }
\@writefile{toc}{\contentsline {section}{\numberline {7}SAC算法}{8}{section.7}\protected@file@percent }
\@writefile{toc}{\contentsline {section}{\numberline {7}SAC-S算法}{8}{section.7}\protected@file@percent }
\@writefile{loa}{\contentsline {algorithm}{\numberline {}{\ignorespaces }}{8}{algorithm.}\protected@file@percent }
\gdef \@abspage@last{8}
\@writefile{toc}{\contentsline {section}{\numberline {8}SAC算法}{9}{section.8}\protected@file@percent }
\@writefile{loa}{\contentsline {algorithm}{\numberline {}{\ignorespaces }}{9}{algorithm.}\protected@file@percent }
\gdef \@abspage@last{9}

View File

@@ -1,4 +1,4 @@
This is XeTeX, Version 3.141592653-2.6-0.999993 (TeX Live 2021) (preloaded format=xelatex 2021.8.22) 22 AUG 2022 16:54
This is XeTeX, Version 3.141592653-2.6-0.999993 (TeX Live 2021) (preloaded format=xelatex 2021.8.22) 23 AUG 2022 19:26
entering extended mode
restricted \write18 enabled.
file:line:error style messages enabled.
@@ -415,85 +415,85 @@ Package: titlesec 2019/10/16 v2.13 Sectioning titles
) (./pseudocodes.aux)
\openout1 = `pseudocodes.aux'.
LaTeX Font Info: Checking defaults for OML/cmm/m/it on input line 13.
LaTeX Font Info: ... okay on input line 13.
LaTeX Font Info: Checking defaults for OMS/cmsy/m/n on input line 13.
LaTeX Font Info: ... okay on input line 13.
LaTeX Font Info: Checking defaults for OT1/cmr/m/n on input line 13.
LaTeX Font Info: ... okay on input line 13.
LaTeX Font Info: Checking defaults for T1/cmr/m/n on input line 13.
LaTeX Font Info: ... okay on input line 13.
LaTeX Font Info: Checking defaults for TS1/cmr/m/n on input line 13.
LaTeX Font Info: ... okay on input line 13.
LaTeX Font Info: Checking defaults for TU/lmr/m/n on input line 13.
LaTeX Font Info: ... okay on input line 13.
LaTeX Font Info: Checking defaults for OMX/cmex/m/n on input line 13.
LaTeX Font Info: ... okay on input line 13.
LaTeX Font Info: Checking defaults for U/cmr/m/n on input line 13.
LaTeX Font Info: ... okay on input line 13.
LaTeX Font Info: Checking defaults for PD1/pdf/m/n on input line 13.
LaTeX Font Info: ... okay on input line 13.
LaTeX Font Info: Checking defaults for PU/pdf/m/n on input line 13.
LaTeX Font Info: ... okay on input line 13.
LaTeX Font Info: Checking defaults for OML/cmm/m/it on input line 14.
LaTeX Font Info: ... okay on input line 14.
LaTeX Font Info: Checking defaults for OMS/cmsy/m/n on input line 14.
LaTeX Font Info: ... okay on input line 14.
LaTeX Font Info: Checking defaults for OT1/cmr/m/n on input line 14.
LaTeX Font Info: ... okay on input line 14.
LaTeX Font Info: Checking defaults for T1/cmr/m/n on input line 14.
LaTeX Font Info: ... okay on input line 14.
LaTeX Font Info: Checking defaults for TS1/cmr/m/n on input line 14.
LaTeX Font Info: ... okay on input line 14.
LaTeX Font Info: Checking defaults for TU/lmr/m/n on input line 14.
LaTeX Font Info: ... okay on input line 14.
LaTeX Font Info: Checking defaults for OMX/cmex/m/n on input line 14.
LaTeX Font Info: ... okay on input line 14.
LaTeX Font Info: Checking defaults for U/cmr/m/n on input line 14.
LaTeX Font Info: ... okay on input line 14.
LaTeX Font Info: Checking defaults for PD1/pdf/m/n on input line 14.
LaTeX Font Info: ... okay on input line 14.
LaTeX Font Info: Checking defaults for PU/pdf/m/n on input line 14.
LaTeX Font Info: ... okay on input line 14.
ABD: EverySelectfont initializing macros
LaTeX Info: Redefining \selectfont on input line 13.
LaTeX Info: Redefining \selectfont on input line 14.
Package fontspec Info: Adjusting the maths setup (use [no-math] to avoid
(fontspec) this).
\symlegacymaths=\mathgroup6
LaTeX Font Info: Overwriting symbol font `legacymaths' in version `bold'
(Font) OT1/cmr/m/n --> OT1/cmr/bx/n on input line 13.
LaTeX Font Info: Redeclaring math accent \acute on input line 13.
LaTeX Font Info: Redeclaring math accent \grave on input line 13.
LaTeX Font Info: Redeclaring math accent \ddot on input line 13.
LaTeX Font Info: Redeclaring math accent \tilde on input line 13.
LaTeX Font Info: Redeclaring math accent \bar on input line 13.
LaTeX Font Info: Redeclaring math accent \breve on input line 13.
LaTeX Font Info: Redeclaring math accent \check on input line 13.
LaTeX Font Info: Redeclaring math accent \hat on input line 13.
LaTeX Font Info: Redeclaring math accent \dot on input line 13.
LaTeX Font Info: Redeclaring math accent \mathring on input line 13.
LaTeX Font Info: Redeclaring math symbol \Gamma on input line 13.
LaTeX Font Info: Redeclaring math symbol \Delta on input line 13.
LaTeX Font Info: Redeclaring math symbol \Theta on input line 13.
LaTeX Font Info: Redeclaring math symbol \Lambda on input line 13.
LaTeX Font Info: Redeclaring math symbol \Xi on input line 13.
LaTeX Font Info: Redeclaring math symbol \Pi on input line 13.
LaTeX Font Info: Redeclaring math symbol \Sigma on input line 13.
LaTeX Font Info: Redeclaring math symbol \Upsilon on input line 13.
LaTeX Font Info: Redeclaring math symbol \Phi on input line 13.
LaTeX Font Info: Redeclaring math symbol \Psi on input line 13.
LaTeX Font Info: Redeclaring math symbol \Omega on input line 13.
LaTeX Font Info: Redeclaring math symbol \mathdollar on input line 13.
LaTeX Font Info: Redeclaring symbol font `operators' on input line 13.
(Font) OT1/cmr/m/n --> OT1/cmr/bx/n on input line 14.
LaTeX Font Info: Redeclaring math accent \acute on input line 14.
LaTeX Font Info: Redeclaring math accent \grave on input line 14.
LaTeX Font Info: Redeclaring math accent \ddot on input line 14.
LaTeX Font Info: Redeclaring math accent \tilde on input line 14.
LaTeX Font Info: Redeclaring math accent \bar on input line 14.
LaTeX Font Info: Redeclaring math accent \breve on input line 14.
LaTeX Font Info: Redeclaring math accent \check on input line 14.
LaTeX Font Info: Redeclaring math accent \hat on input line 14.
LaTeX Font Info: Redeclaring math accent \dot on input line 14.
LaTeX Font Info: Redeclaring math accent \mathring on input line 14.
LaTeX Font Info: Redeclaring math symbol \Gamma on input line 14.
LaTeX Font Info: Redeclaring math symbol \Delta on input line 14.
LaTeX Font Info: Redeclaring math symbol \Theta on input line 14.
LaTeX Font Info: Redeclaring math symbol \Lambda on input line 14.
LaTeX Font Info: Redeclaring math symbol \Xi on input line 14.
LaTeX Font Info: Redeclaring math symbol \Pi on input line 14.
LaTeX Font Info: Redeclaring math symbol \Sigma on input line 14.
LaTeX Font Info: Redeclaring math symbol \Upsilon on input line 14.
LaTeX Font Info: Redeclaring math symbol \Phi on input line 14.
LaTeX Font Info: Redeclaring math symbol \Psi on input line 14.
LaTeX Font Info: Redeclaring math symbol \Omega on input line 14.
LaTeX Font Info: Redeclaring math symbol \mathdollar on input line 14.
LaTeX Font Info: Redeclaring symbol font `operators' on input line 14.
LaTeX Font Info: Encoding `OT1' has changed to `TU' for symbol font
(Font) `operators' in the math version `normal' on input line 13.
(Font) `operators' in the math version `normal' on input line 14.
LaTeX Font Info: Overwriting symbol font `operators' in version `normal'
(Font) OT1/cmr/m/n --> TU/lmr/m/n on input line 13.
(Font) OT1/cmr/m/n --> TU/lmr/m/n on input line 14.
LaTeX Font Info: Encoding `OT1' has changed to `TU' for symbol font
(Font) `operators' in the math version `bold' on input line 13.
(Font) `operators' in the math version `bold' on input line 14.
LaTeX Font Info: Overwriting symbol font `operators' in version `bold'
(Font) OT1/cmr/bx/n --> TU/lmr/m/n on input line 13.
(Font) OT1/cmr/bx/n --> TU/lmr/m/n on input line 14.
LaTeX Font Info: Overwriting symbol font `operators' in version `normal'
(Font) TU/lmr/m/n --> TU/lmr/m/n on input line 13.
(Font) TU/lmr/m/n --> TU/lmr/m/n on input line 14.
LaTeX Font Info: Overwriting math alphabet `\mathit' in version `normal'
(Font) OT1/cmr/m/it --> TU/lmr/m/it on input line 13.
(Font) OT1/cmr/m/it --> TU/lmr/m/it on input line 14.
LaTeX Font Info: Overwriting math alphabet `\mathbf' in version `normal'
(Font) OT1/cmr/bx/n --> TU/lmr/b/n on input line 13.
(Font) OT1/cmr/bx/n --> TU/lmr/b/n on input line 14.
LaTeX Font Info: Overwriting math alphabet `\mathsf' in version `normal'
(Font) OT1/cmss/m/n --> TU/lmss/m/n on input line 13.
(Font) OT1/cmss/m/n --> TU/lmss/m/n on input line 14.
LaTeX Font Info: Overwriting math alphabet `\mathtt' in version `normal'
(Font) OT1/cmtt/m/n --> TU/lmtt/m/n on input line 13.
(Font) OT1/cmtt/m/n --> TU/lmtt/m/n on input line 14.
LaTeX Font Info: Overwriting symbol font `operators' in version `bold'
(Font) TU/lmr/m/n --> TU/lmr/b/n on input line 13.
(Font) TU/lmr/m/n --> TU/lmr/b/n on input line 14.
LaTeX Font Info: Overwriting math alphabet `\mathit' in version `bold'
(Font) OT1/cmr/bx/it --> TU/lmr/b/it on input line 13.
(Font) OT1/cmr/bx/it --> TU/lmr/b/it on input line 14.
LaTeX Font Info: Overwriting math alphabet `\mathsf' in version `bold'
(Font) OT1/cmss/bx/n --> TU/lmss/b/n on input line 13.
(Font) OT1/cmss/bx/n --> TU/lmss/b/n on input line 14.
LaTeX Font Info: Overwriting math alphabet `\mathtt' in version `bold'
(Font) OT1/cmtt/m/n --> TU/lmtt/b/n on input line 13.
Package hyperref Info: Link coloring OFF on input line 13.
(Font) OT1/cmtt/m/n --> TU/lmtt/b/n on input line 14.
Package hyperref Info: Link coloring OFF on input line 14.
(/usr/local/texlive/2021/texmf-dist/tex/latex/hyperref/nameref.sty
Package: nameref 2021-04-02 v2.47 Cross-referencing by name of section
(/usr/local/texlive/2021/texmf-dist/tex/latex/refcount/refcount.sty
@@ -503,9 +503,9 @@ Package: gettitlestring 2019/12/15 v1.6 Cleanup title references (HO)
)
\c@section@level=\count313
)
LaTeX Info: Redefining \ref on input line 13.
LaTeX Info: Redefining \pageref on input line 13.
LaTeX Info: Redefining \nameref on input line 13.
LaTeX Info: Redefining \ref on input line 14.
LaTeX Info: Redefining \pageref on input line 14.
LaTeX Info: Redefining \nameref on input line 14.
(./pseudocodes.out) (./pseudocodes.out)
\@outlinefile=\write3
\openout3 = `pseudocodes.out'.
@@ -515,19 +515,19 @@ LaTeX Info: Redefining \nameref on input line 13.
\openout4 = `pseudocodes.toc'.
LaTeX Font Info: Font shape `TU/SongtiSCLight(0)/m/sl' in size <10.95> not available
(Font) Font shape `TU/SongtiSCLight(0)/m/it' tried instead on input line 16.
(Font) Font shape `TU/SongtiSCLight(0)/m/it' tried instead on input line 17.
[1
]
Package hyperref Info: bookmark level for unknown algorithm defaults to 0 on input line 21.
Package hyperref Info: bookmark level for unknown algorithm defaults to 0 on input line 22.
[2
]
LaTeX Font Info: Trying to load font information for U+msa on input line 31.
LaTeX Font Info: Trying to load font information for U+msa on input line 32.
(/usr/local/texlive/2021/texmf-dist/tex/latex/amsfonts/umsa.fd
File: umsa.fd 2013/01/14 v3.01 AMS symbols A
)
LaTeX Font Info: Trying to load font information for U+msb on input line 31.
LaTeX Font Info: Trying to load font information for U+msb on input line 32.
(/usr/local/texlive/2021/texmf-dist/tex/latex/amsfonts/umsb.fd
File: umsb.fd 2013/01/14 v3.01 AMS symbols B
) [3
@@ -536,38 +536,35 @@ File: umsb.fd 2013/01/14 v3.01 AMS symbols B
] [5
]
Underfull \hbox (badness 10000) in paragraph at lines 111--112
[] []\TU/SongtiSCLight(0)/m/n/10.95 计 算 实 际 的 $\OML/cmm/m/it/10.95 Q$ \TU/SongtiSCLight(0)/m/n/10.95 值,| 即 $\OML/cmm/m/it/10.95 y[] \OT1/cmr/m/n/10.95 =
[]
[6
] [6
] [7
] [8
]
Overfull \hbox (32.54117pt too wide) in paragraph at lines 183--183
Overfull \hbox (32.54117pt too wide) in paragraph at lines 212--212
[][]$[]\OML/cmm/m/it/9 J[]\OT1/cmr/m/n/9 (\OML/cmm/m/it/9 ^^R\OT1/cmr/m/n/9 ) = \OMS/cmsy/m/n/9 r[]\OML/cmm/m/it/9 Q[] [] []$|
[]
Overfull \hbox (15.41673pt too wide) in paragraph at lines 184--184
Overfull \hbox (15.41673pt too wide) in paragraph at lines 213--213
[][]$[]\OML/cmm/m/it/9 J[]\OT1/cmr/m/n/9 (\OML/cmm/m/it/9 ^^^\OT1/cmr/m/n/9 ) = \OMS/cmsy/m/n/9 r[]\OML/cmm/m/it/9 ^^K [] [] \OT1/cmr/m/n/9 + [] \OMS/cmsy/m/n/9 r[]\OML/cmm/m/it/9 f[] []$\TU/lmr/m/n/9 ,$[][] \OT1/cmr/m/n/9 =
[]
[8
[9
] (./pseudocodes.aux)
Package rerunfilecheck Info: File `pseudocodes.out' has not changed.
(rerunfilecheck) Checksum: 4575BA7458AA23D6E696EFFE39D05727;640.
(rerunfilecheck) Checksum: 35B5A79A86EF3BC70F1A0B3BCBEBAA13;724.
)
Here is how much of TeX's memory you used:
14813 strings out of 476919
312635 string characters out of 5821840
653471 words of memory out of 5000000
34563 multiletter control sequences out of 15000+600000
413601 words of font info for 90 fonts, out of 8000000 for 9000
14827 strings out of 476919
313456 string characters out of 5821840
653576 words of memory out of 5000000
34576 multiletter control sequences out of 15000+600000
413609 words of font info for 91 fonts, out of 8000000 for 9000
1348 hyphenation exceptions out of 8191
101i,13n,104p,676b,736s stack positions out of 5000i,500n,10000p,200000b,80000s
101i,13n,104p,676b,697s stack positions out of 5000i,500n,10000p,200000b,80000s
Output written on pseudocodes.pdf (8 pages).
Output written on pseudocodes.pdf (9 pages).

View File

@@ -4,4 +4,5 @@
\BOOKMARK [1][-]{section.4}{\376\377\000P\000o\000l\000i\000c\000y\000\040\000G\000r\000a\000d\000i\000e\000n\000t\173\227\154\325}{}% 4
\BOOKMARK [1][-]{section.5}{\376\377\000D\000Q\000N\173\227\154\325}{}% 5
\BOOKMARK [1][-]{section.6}{\376\377\000S\000o\000f\000t\000Q\173\227\154\325}{}% 6
\BOOKMARK [1][-]{section.7}{\376\377\000S\000A\000C\173\227\154\325}{}% 7
\BOOKMARK [1][-]{section.7}{\376\377\000S\000A\000C\000-\000S\173\227\154\325}{}% 7
\BOOKMARK [1][-]{section.8}{\376\377\000S\000A\000C\173\227\154\325}{}% 8

View File

@@ -10,6 +10,7 @@
\usepackage{titlesec}
\usepackage{float} % 调用该包能够使用[H]
% \pagestyle{plain} % 去除页眉但是保留页脚编号都去掉plain换empty
\begin{document}
\tableofcontents % 目录注意要运行两下或者vscode保存两下才能显示
% \singlespacing
@@ -88,7 +89,7 @@
\clearpage
\section{DQN算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{DQN算法}}
\floatname{algorithm}{{DQN算法}{\hypersetup{linkcolor=white}\footnotemark}}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\renewcommand{\algorithmicrequire}{\textbf{输入:}}
@@ -108,13 +109,17 @@
\STATE 更新环境状态$s_{t+1} \leftarrow s_t$
\STATE {\bfseries 更新策略:}
\STATE$D$中采样一个batch的transition
\STATE 计算实际的$Q$值,即$y_{j}= \begin{cases}r_{j} & \text {对于终止状态} s_{j+1} \\ r_{j}+\gamma \max _{a^{\prime}} Q\left(s_{j+1}, a^{\prime} ; \theta\right) & \text {对于非终止状态} s_{j+1}\end{cases}$
\STATE 对损失 $\left(y_{j}-Q\left(s_{j}, a_{j} ; \theta\right)\right)^{2}$关于参数$\theta$做随机梯度下降
\STATE 计算实际的$Q$值,即$y_{j}${\hypersetup{linkcolor=white}\footnotemark}
\STATE 对损失 $L(\theta)=\left(y_{i}-Q\left(s_{i}, a_{i} ; \theta\right)\right)^{2}$关于参数$\theta$做随机梯度下降{\hypersetup{linkcolor=white}\footnotemark}
\ENDFOR
\STATE$C$个回合复制参数$\hat{Q}\leftarrow Q$(此处也可像原论文中放到小循环中改成每$C$步,但没有每$C$个回合稳定)
\STATE$C$个回合复制参数$\hat{Q}\leftarrow Q${\hypersetup{linkcolor=white}\footnotemark}
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{Playing Atari with Deep Reinforcement Learning}
\footnotetext[2]{$y_{i}= \begin{cases}r_{i} & \text {对于终止状态} s_{i+1} \\ r_{i}+\gamma \max _{a^{\prime}} Q\left(s_{i+1}, a^{\prime} ; \theta\right) & \text {对于非终止状态} s_{i+1}\end{cases}$}
\footnotetext[3]{$\theta_i \leftarrow \theta_i - \lambda \nabla_{\theta_{i}} L_{i}\left(\theta_{i}\right)$}
\footnotetext[4]{此处也可像原论文中放到小循环中改成每$C$步,但没有每$C$个回合稳定}
\clearpage
\section{SoftQ算法}
@@ -153,13 +158,37 @@
\footnotetext[2]{$J_{Q}(\theta)=\mathbb{E}_{\mathbf{s}_{t} \sim q_{\mathbf{s}_{t}}, \mathbf{a}_{t} \sim q_{\mathbf{a}_{t}}}\left[\frac{1}{2}\left(\hat{Q}_{\mathrm{soft}}^{\bar{\theta}}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-Q_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)^{2}\right]$}
\footnotetext[3]{$\begin{aligned} \Delta f^{\phi}\left(\cdot ; \mathbf{s}_{t}\right)=& \mathbb{E}_{\mathbf{a}_{t} \sim \pi^{\phi}}\left[\left.\kappa\left(\mathbf{a}_{t}, f^{\phi}\left(\cdot ; \mathbf{s}_{t}\right)\right) \nabla_{\mathbf{a}^{\prime}} Q_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t}, \mathbf{a}^{\prime}\right)\right|_{\mathbf{a}^{\prime}=\mathbf{a}_{t}}\right.\\ &\left.+\left.\alpha \nabla_{\mathbf{a}^{\prime}} \kappa\left(\mathbf{a}^{\prime}, f^{\phi}\left(\cdot ; \mathbf{s}_{t}\right)\right)\right|_{\mathbf{a}^{\prime}=\mathbf{a}_{t}}\right] \end{aligned}$}
\clearpage
\section{SAC-S算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{SAC-S算法}\footnotemark[1]}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\begin{algorithmic}[1] % [1]显示步数
\STATE 初始化参数$\psi, \bar{\psi}, \theta, \phi$
\FOR {回合数 = $1,M$}
\FOR {时步 = $1,t$}
\STATE 根据$\boldsymbol{a}_{t} \sim \pi_{\phi}\left(\boldsymbol{a}_{t} \mid \mathbf{s}_{t}\right)$采样动作$a_t$
\STATE 环境反馈奖励和下一个状态,$\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)$
\STATE 存储transition到经验回放中$\mathcal{D} \leftarrow \mathcal{D} \cup\left\{\left(\mathbf{s}_{t}, \mathbf{a}_{t}, r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right), \mathbf{s}_{t+1}\right)\right\}$
\STATE 更新环境状态$s_{t+1} \leftarrow s_t$
\STATE {\bfseries 更新策略:}
\STATE $\psi \leftarrow \psi-\lambda_{V} \hat{\nabla}_{\psi} J_{V}(\psi)$
\STATE $\theta_{i} \leftarrow \theta_{i}-\lambda_{Q} \hat{\nabla}_{\theta_{i}} J_{Q}\left(\theta_{i}\right)$ for $i \in\{1,2\}$
\STATE $\phi \leftarrow \phi-\lambda_{\pi} \hat{\nabla}_{\phi} J_{\pi}(\phi)$
\STATE $\bar{\psi} \leftarrow \tau \psi+(1-\tau) \bar{\psi}$
\ENDFOR
\ENDFOR
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor}
\clearpage
\section{SAC算法}
\begin{algorithm}[H] % [H]固定位置
\floatname{algorithm}{{Soft Actor Critic算法}}
\floatname{algorithm}{{SAC算法}\footnotemark[1]}
\renewcommand{\thealgorithm}{} % 去掉算法标号
\caption{}
\begin{algorithmic}[1]
\STATE 初始化两个Actor的网络参数$\theta_1,\theta_2$以及一个Critic网络参数$\phi$ % 初始化
\STATE 初始化网络参数$\theta_1,\theta_2$以及$\phi$ % 初始化
\STATE 复制参数到目标网络$\bar{\theta_1} \leftarrow \theta_1,\bar{\theta_2} \leftarrow \theta_2,$
\STATE 初始化经验回放$D$
\FOR {回合数 = $1,M$}
@@ -170,18 +199,18 @@
\STATE 存储transition到经验回放中$\mathcal{D} \leftarrow \mathcal{D} \cup\left\{\left(\mathbf{s}_{t}, \mathbf{a}_{t}, r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right), \mathbf{s}_{t+1}\right)\right\}$
\STATE 更新环境状态$s_{t+1} \leftarrow s_t$
\STATE {\bfseries 更新策略:}
\STATE 更新$Q$函数,$\theta_{i} \leftarrow \theta_{i}-\lambda_{Q} \hat{\nabla}_{\theta_{i}} J_{Q}\left(\theta_{i}\right)$ for $i \in\{1,2\}$\footnotemark[1]\footnotemark[2]
\STATE 更新策略权重,$\phi \leftarrow \phi-\lambda_{\pi} \hat{\nabla}_{\phi} J_{\pi}(\phi)$ \footnotemark[3]
\STATE 调整temperature$\alpha \leftarrow \alpha-\lambda \hat{\nabla}_{\alpha} J(\alpha)$ \footnotemark[4]
\STATE 更新$Q$函数,$\theta_{i} \leftarrow \theta_{i}-\lambda_{Q} \hat{\nabla}_{\theta_{i}} J_{Q}\left(\theta_{i}\right)$ for $i \in\{1,2\}$\footnotemark[2]\footnotemark[3]
\STATE 更新策略权重,$\phi \leftarrow \phi-\lambda_{\pi} \hat{\nabla}_{\phi} J_{\pi}(\phi)$ \footnotemark[4]
\STATE 调整temperature$\alpha \leftarrow \alpha-\lambda \hat{\nabla}_{\alpha} J(\alpha)$ \footnotemark[5]
\STATE 更新目标网络权重,$\bar{\theta}_{i} \leftarrow \tau \theta_{i}+(1-\tau) \bar{\theta}_{i}$ for $i \in\{1,2\}$
\ENDFOR
\ENDFOR
\end{algorithmic}
\end{algorithmic}
\end{algorithm}
\footnotetext[1]{$J_{Q}(\theta)=\mathbb{E}_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \mathcal{D}}\left[\frac{1}{2}\left(Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p}\left[V_{\bar{\theta}}\left(\mathbf{s}_{t+1}\right)\right]\right)\right)^{2}\right]$}
\footnotetext[2]{$\hat{\nabla}_{\theta} J_{Q}(\theta)=\nabla_{\theta} Q_{\theta}\left(\mathbf{a}_{t}, \mathbf{s}_{t}\right)\left(Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\gamma\left(Q_{\bar{\theta}}\left(\mathbf{s}_{t+1}, \mathbf{a}_{t+1}\right)-\alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t+1} \mid \mathbf{s}_{t+1}\right)\right)\right)\right)\right.$}
\footnotetext[3]{$\hat{\nabla}_{\phi} J_{\pi}(\phi)=\nabla_{\phi} \alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)+\left(\nabla_{\mathbf{a}_{t}} \alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)-\nabla_{\mathbf{a}_{t}} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) \nabla_{\phi} f_{\phi}\left(\epsilon_{t} ; \mathbf{s}_{t}\right)$,$\mathbf{a}_{t}=f_{\phi}\left(\epsilon_{t} ; \mathbf{s}_{t}\right)$}
\footnotetext[4]{$J(\alpha)=\mathbb{E}_{\mathbf{a}_{t} \sim \pi_{t}}\left[-\alpha \log \pi_{t}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)-\alpha \overline{\mathcal{H}}\right]$}
\footnotetext[2]{Soft Actor-Critic Algorithms and Applications}
\footnotetext[2]{$J_{Q}(\theta)=\mathbb{E}_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \mathcal{D}}\left[\frac{1}{2}\left(Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p}\left[V_{\bar{\theta}}\left(\mathbf{s}_{t+1}\right)\right]\right)\right)^{2}\right]$}
\footnotetext[3]{$\hat{\nabla}_{\theta} J_{Q}(\theta)=\nabla_{\theta} Q_{\theta}\left(\mathbf{a}_{t}, \mathbf{s}_{t}\right)\left(Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\gamma\left(Q_{\bar{\theta}}\left(\mathbf{s}_{t+1}, \mathbf{a}_{t+1}\right)-\alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t+1} \mid \mathbf{s}_{t+1}\right)\right)\right)\right)\right.$}
\footnotetext[4]{$\hat{\nabla}_{\phi} J_{\pi}(\phi)=\nabla_{\phi} \alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)+\left(\nabla_{\mathbf{a}_{t}} \alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)-\nabla_{\mathbf{a}_{t}} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) \nabla_{\phi} f_{\phi}\left(\epsilon_{t} ; \mathbf{s}_{t}\right)$,$\mathbf{a}_{t}=f_{\phi}\left(\epsilon_{t} ; \mathbf{s}_{t}\right)$}
\footnotetext[5]{$J(\alpha)=\mathbb{E}_{\mathbf{a}_{t} \sim \pi_{t}}\left[-\alpha \log \pi_{t}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)-\alpha \overline{\mathcal{H}}\right]$}
\clearpage
\end{document}

View File

@@ -4,4 +4,5 @@
\contentsline {section}{\numberline {4}Policy Gradient算法}{5}{section.4}%
\contentsline {section}{\numberline {5}DQN算法}{6}{section.5}%
\contentsline {section}{\numberline {6}SoftQ算法}{7}{section.6}%
\contentsline {section}{\numberline {7}SAC算法}{8}{section.7}%
\contentsline {section}{\numberline {7}SAC-S算法}{8}{section.7}%
\contentsline {section}{\numberline {8}SAC算法}{9}{section.8}%

View File

@@ -5,7 +5,7 @@
@Email: johnjim0816@gmail.com
@Date: 2020-06-12 00:50:49
@LastEditor: John
LastEditTime: 2022-08-18 14:27:18
LastEditTime: 2022-08-23 23:59:54
@Discription:
@Environment: python 3.7.7
'''
@@ -20,26 +20,26 @@ import math
import numpy as np
class DQN:
def __init__(self,n_actions,model,memory,cfg):
def __init__(self,model,memory,cfg):
self.n_actions = n_actions
self.device = torch.device(cfg.device)
self.gamma = cfg.gamma
self.n_actions = cfg['n_actions']
self.device = torch.device(cfg['device'])
self.gamma = cfg['gamma']
## e-greedy parameters
self.sample_count = 0 # sample count for epsilon decay
self.epsilon = cfg.epsilon_start
self.epsilon = cfg['epsilon_start']
self.sample_count = 0
self.epsilon_start = cfg.epsilon_start
self.epsilon_end = cfg.epsilon_end
self.epsilon_decay = cfg.epsilon_decay
self.batch_size = cfg.batch_size
self.epsilon_start = cfg['epsilon_start']
self.epsilon_end = cfg['epsilon_end']
self.epsilon_decay = cfg['epsilon_decay']
self.batch_size = cfg['batch_size']
self.policy_net = model.to(self.device)
self.target_net = model.to(self.device)
## copy parameters from policy net to target net
for target_param, param in zip(self.target_net.parameters(),self.policy_net.parameters()):
target_param.data.copy_(param.data)
# self.target_net.load_state_dict(self.policy_net.state_dict()) # or use this to copy parameters
self.optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg.lr)
self.optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg['lr'])
self.memory = memory
self.update_flag = False

137
projects/codes/DQN/main.py Normal file
View File

@@ -0,0 +1,137 @@
import sys,os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized."
curr_path = os.path.dirname(os.path.abspath(__file__)) # current path
parent_path = os.path.dirname(curr_path) # parent path
sys.path.append(parent_path) # add path to system path
import gym
import torch
import datetime
import numpy as np
import argparse
from common.utils import save_results,all_seed
from common.utils import plot_rewards,save_args
from common.models import MLP
from common.memories import ReplayBuffer
from dqn import DQN
def get_args():
""" hyperparameters
"""
curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time
parser = argparse.ArgumentParser(description="hyperparameters")
parser.add_argument('--algo_name',default='DQN',type=str,help="name of algorithm")
parser.add_argument('--env_name',default='CartPole-v0',type=str,help="name of environment")
parser.add_argument('--train_eps',default=200,type=int,help="episodes of training")
parser.add_argument('--test_eps',default=20,type=int,help="episodes of testing")
parser.add_argument('--gamma',default=0.95,type=float,help="discounted factor")
parser.add_argument('--epsilon_start',default=0.95,type=float,help="initial value of epsilon")
parser.add_argument('--epsilon_end',default=0.01,type=float,help="final value of epsilon")
parser.add_argument('--epsilon_decay',default=500,type=int,help="decay rate of epsilon")
parser.add_argument('--lr',default=0.0001,type=float,help="learning rate")
parser.add_argument('--memory_capacity',default=100000,type=int,help="memory capacity")
parser.add_argument('--batch_size',default=64,type=int)
parser.add_argument('--target_update',default=4,type=int)
parser.add_argument('--hidden_dim',default=256,type=int)
parser.add_argument('--device',default='cpu',type=str,help="cpu or cuda")
parser.add_argument('--seed',default=10,type=int,help="seed")
parser.add_argument('--show_fig',default=False,type=bool,help="if show figure or not")
parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not")
# please manually change the following args in this script if you want
parser.add_argument('--result_path',default=curr_path + "/outputs/" + parser.parse_args().env_name + \
'/' + curr_time + '/results' )
parser.add_argument('--model_path',default=curr_path + "/outputs/" + parser.parse_args().env_name + \
'/' + curr_time + '/models' )
args = parser.parse_args()
args = {**vars(args)} # type(dict)
return args
def env_agent_config(cfg):
''' create env and agent
'''
env = gym.make(cfg['env_name']) # create env
if cfg['seed'] !=0: # set random seed
all_seed(env,seed=cfg["seed"])
n_states = env.observation_space.shape[0] # state dimension
n_actions = env.action_space.n # action dimension
print(f"n_states: {n_states}, n_actions: {n_actions}")
cfg.update({"n_states":n_states,"n_actions":n_actions}) # update to cfg paramters
model = MLP(n_states,n_actions,hidden_dim=cfg["hidden_dim"])
memory = ReplayBuffer(cfg["memory_capacity"]) # replay buffer
agent = DQN(model,memory,cfg) # create agent
return env, agent
def train(cfg, env, agent):
''' 训练
'''
print("start training!")
print(f"Env: {cfg['env_name']}, Algo: {cfg['algo_name']}, Device: {cfg['device']}")
rewards = [] # record rewards for all episodes
steps = []
for i_ep in range(cfg["train_eps"]):
ep_reward = 0 # reward per episode
ep_step = 0
state = env.reset() # reset and obtain initial state
while True:
ep_step += 1
action = agent.sample_action(state) # sample action
next_state, reward, done, _ = env.step(action) # update env and return transitions
agent.memory.push(state, action, reward,
next_state, done) # save transitions
state = next_state # update next state for env
agent.update() # update agent
ep_reward += reward #
if done:
break
if (i_ep + 1) % cfg["target_update"] == 0: # target net update, target_update means "C" in pseucodes
agent.target_net.load_state_dict(agent.policy_net.state_dict())
steps.append(ep_step)
rewards.append(ep_reward)
if (i_ep + 1) % 10 == 0:
print(f'Episode: {i_ep+1}/{cfg["train_eps"]}, Reward: {ep_reward:.2f}: Epislon: {agent.epsilon:.3f}')
print("finish training!")
env.close()
res_dic = {'episodes':range(len(rewards)),'rewards':rewards}
return res_dic
def test(cfg, env, agent):
print("start testing!")
print(f"Env: {cfg.env_name}, Algo: {cfg.algo_name}, Device: {cfg.device}")
rewards = [] # record rewards for all episodes
steps = []
for i_ep in range(cfg.test_eps):
ep_reward = 0 # reward per episode
ep_step = 0
state = env.reset() # reset and obtain initial state
while True:
ep_step+=1
action = agent.predict_action(state) # predict action
next_state, reward, done, _ = env.step(action)
state = next_state
ep_reward += reward
if done:
break
steps.append(ep_step)
rewards.append(ep_reward)
print(f'Episode: {i_ep+1}/{cfg.test_eps}Reward: {ep_reward:.2f}')
print("finish testing!")
env.close()
return {'episodes':range(len(rewards)),'rewards':rewards}
if __name__ == "__main__":
cfg = get_args()
# training
env, agent = env_agent_config(cfg)
res_dic = train(cfg, env, agent)
save_args(cfg,path = cfg['result_path']) # save parameters
agent.save_model(path = cfg['model_path']) # save models
save_results(res_dic, tag = 'train', path = cfg['result_path']) # save results
plot_rewards(res_dic['rewards'], cfg, path = cfg['result_path'],tag = "train") # plot results
# testing
env, agent = env_agent_config(cfg) # create new env for testing, sometimes can ignore this step
agent.load_model(path = cfg['model_path']) # load model
res_dic = test(cfg, env, agent)
save_results(res_dic, tag='test',
path = cfg['result_path'])
plot_rewards(res_dic['rewards'], cfg, path = cfg['result_path'],tag = "test")

View File

@@ -0,0 +1 @@
{"algo_name": "DQN", "env_name": "CartPole-v0", "train_eps": 200, "test_eps": 20, "gamma": 0.95, "epsilon_start": 0.95, "epsilon_end": 0.01, "epsilon_decay": 500, "lr": 0.0001, "memory_capacity": 100000, "batch_size": 64, "target_update": 4, "hidden_dim": 256, "device": "cpu", "seed": 10, "result_path": "C:\\Users\\jiangji\\Desktop\\rl-tutorials\\codes\\DQN/outputs/CartPole-v0/20220823-173936/results", "model_path": "C:\\Users\\jiangji\\Desktop\\rl-tutorials\\codes\\DQN/outputs/CartPole-v0/20220823-173936/models", "show_fig": false, "save_fig": true}

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

View File

@@ -0,0 +1,21 @@
episodes,rewards
0,200.0
1,200.0
2,200.0
3,200.0
4,200.0
5,200.0
6,200.0
7,200.0
8,200.0
9,200.0
10,200.0
11,200.0
12,200.0
13,200.0
14,200.0
15,200.0
16,200.0
17,200.0
18,200.0
19,200.0
1 episodes rewards
2 0 200.0
3 1 200.0
4 2 200.0
5 3 200.0
6 4 200.0
7 5 200.0
8 6 200.0
9 7 200.0
10 8 200.0
11 9 200.0
12 10 200.0
13 11 200.0
14 12 200.0
15 13 200.0
16 14 200.0
17 15 200.0
18 16 200.0
19 17 200.0
20 18 200.0
21 19 200.0

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

View File

@@ -0,0 +1,201 @@
episodes,rewards
0,38.0
1,16.0
2,37.0
3,15.0
4,22.0
5,34.0
6,20.0
7,12.0
8,16.0
9,14.0
10,13.0
11,21.0
12,14.0
13,12.0
14,17.0
15,12.0
16,10.0
17,14.0
18,10.0
19,10.0
20,16.0
21,9.0
22,14.0
23,13.0
24,10.0
25,9.0
26,12.0
27,12.0
28,14.0
29,11.0
30,9.0
31,8.0
32,9.0
33,11.0
34,12.0
35,10.0
36,11.0
37,10.0
38,10.0
39,18.0
40,13.0
41,15.0
42,10.0
43,9.0
44,14.0
45,14.0
46,23.0
47,17.0
48,15.0
49,15.0
50,20.0
51,28.0
52,36.0
53,36.0
54,23.0
55,27.0
56,53.0
57,19.0
58,35.0
59,62.0
60,57.0
61,38.0
62,61.0
63,65.0
64,58.0
65,43.0
66,67.0
67,56.0
68,91.0
69,128.0
70,71.0
71,126.0
72,100.0
73,200.0
74,200.0
75,200.0
76,200.0
77,200.0
78,200.0
79,200.0
80,200.0
81,200.0
82,200.0
83,200.0
84,200.0
85,200.0
86,200.0
87,200.0
88,200.0
89,200.0
90,200.0
91,200.0
92,200.0
93,200.0
94,200.0
95,200.0
96,200.0
97,200.0
98,200.0
99,200.0
100,200.0
101,200.0
102,200.0
103,200.0
104,200.0
105,200.0
106,200.0
107,200.0
108,200.0
109,200.0
110,200.0
111,200.0
112,200.0
113,200.0
114,200.0
115,200.0
116,200.0
117,200.0
118,200.0
119,200.0
120,200.0
121,200.0
122,200.0
123,200.0
124,200.0
125,200.0
126,200.0
127,200.0
128,200.0
129,200.0
130,200.0
131,200.0
132,200.0
133,200.0
134,200.0
135,200.0
136,200.0
137,200.0
138,200.0
139,200.0
140,200.0
141,200.0
142,200.0
143,200.0
144,200.0
145,200.0
146,200.0
147,200.0
148,200.0
149,200.0
150,200.0
151,200.0
152,200.0
153,200.0
154,200.0
155,200.0
156,200.0
157,200.0
158,200.0
159,200.0
160,200.0
161,200.0
162,200.0
163,200.0
164,200.0
165,200.0
166,200.0
167,200.0
168,200.0
169,200.0
170,200.0
171,200.0
172,200.0
173,200.0
174,200.0
175,200.0
176,200.0
177,200.0
178,200.0
179,200.0
180,200.0
181,200.0
182,200.0
183,200.0
184,200.0
185,200.0
186,200.0
187,200.0
188,200.0
189,200.0
190,200.0
191,200.0
192,200.0
193,200.0
194,200.0
195,200.0
196,200.0
197,200.0
198,200.0
199,200.0
1 episodes rewards
2 0 38.0
3 1 16.0
4 2 37.0
5 3 15.0
6 4 22.0
7 5 34.0
8 6 20.0
9 7 12.0
10 8 16.0
11 9 14.0
12 10 13.0
13 11 21.0
14 12 14.0
15 13 12.0
16 14 17.0
17 15 12.0
18 16 10.0
19 17 14.0
20 18 10.0
21 19 10.0
22 20 16.0
23 21 9.0
24 22 14.0
25 23 13.0
26 24 10.0
27 25 9.0
28 26 12.0
29 27 12.0
30 28 14.0
31 29 11.0
32 30 9.0
33 31 8.0
34 32 9.0
35 33 11.0
36 34 12.0
37 35 10.0
38 36 11.0
39 37 10.0
40 38 10.0
41 39 18.0
42 40 13.0
43 41 15.0
44 42 10.0
45 43 9.0
46 44 14.0
47 45 14.0
48 46 23.0
49 47 17.0
50 48 15.0
51 49 15.0
52 50 20.0
53 51 28.0
54 52 36.0
55 53 36.0
56 54 23.0
57 55 27.0
58 56 53.0
59 57 19.0
60 58 35.0
61 59 62.0
62 60 57.0
63 61 38.0
64 62 61.0
65 63 65.0
66 64 58.0
67 65 43.0
68 66 67.0
69 67 56.0
70 68 91.0
71 69 128.0
72 70 71.0
73 71 126.0
74 72 100.0
75 73 200.0
76 74 200.0
77 75 200.0
78 76 200.0
79 77 200.0
80 78 200.0
81 79 200.0
82 80 200.0
83 81 200.0
84 82 200.0
85 83 200.0
86 84 200.0
87 85 200.0
88 86 200.0
89 87 200.0
90 88 200.0
91 89 200.0
92 90 200.0
93 91 200.0
94 92 200.0
95 93 200.0
96 94 200.0
97 95 200.0
98 96 200.0
99 97 200.0
100 98 200.0
101 99 200.0
102 100 200.0
103 101 200.0
104 102 200.0
105 103 200.0
106 104 200.0
107 105 200.0
108 106 200.0
109 107 200.0
110 108 200.0
111 109 200.0
112 110 200.0
113 111 200.0
114 112 200.0
115 113 200.0
116 114 200.0
117 115 200.0
118 116 200.0
119 117 200.0
120 118 200.0
121 119 200.0
122 120 200.0
123 121 200.0
124 122 200.0
125 123 200.0
126 124 200.0
127 125 200.0
128 126 200.0
129 127 200.0
130 128 200.0
131 129 200.0
132 130 200.0
133 131 200.0
134 132 200.0
135 133 200.0
136 134 200.0
137 135 200.0
138 136 200.0
139 137 200.0
140 138 200.0
141 139 200.0
142 140 200.0
143 141 200.0
144 142 200.0
145 143 200.0
146 144 200.0
147 145 200.0
148 146 200.0
149 147 200.0
150 148 200.0
151 149 200.0
152 150 200.0
153 151 200.0
154 152 200.0
155 153 200.0
156 154 200.0
157 155 200.0
158 156 200.0
159 157 200.0
160 158 200.0
161 159 200.0
162 160 200.0
163 161 200.0
164 162 200.0
165 163 200.0
166 164 200.0
167 165 200.0
168 166 200.0
169 167 200.0
170 168 200.0
171 169 200.0
172 170 200.0
173 171 200.0
174 172 200.0
175 173 200.0
176 174 200.0
177 175 200.0
178 176 200.0
179 177 200.0
180 178 200.0
181 179 200.0
182 180 200.0
183 181 200.0
184 182 200.0
185 183 200.0
186 184 200.0
187 185 200.0
188 186 200.0
189 187 200.0
190 188 200.0
191 189 200.0
192 190 200.0
193 191 200.0
194 192 200.0
195 193 200.0
196 194 200.0
197 195 200.0
198 196 200.0
199 197 200.0
200 198 200.0
201 199 200.0

View File

@@ -0,0 +1,153 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Email: johnjim0816@gmail.com
Date: 2020-09-11 23:03:00
LastEditor: John
LastEditTime: 2022-08-24 11:27:01
Discription:
Environment:
'''
import sys,os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized."
curr_path = os.path.dirname(os.path.abspath(__file__)) # current path
parent_path = os.path.dirname(curr_path) # parent path
sys.path.append(parent_path) # add path to system path
import gym
import datetime
import argparse
from envs.gridworld_env import CliffWalkingWapper,FrozenLakeWapper
from qlearning import QLearning
from common.utils import plot_rewards,save_args,all_seed
from common.utils import save_results,make_dir
def get_args():
curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time
parser = argparse.ArgumentParser(description="hyperparameters")
parser.add_argument('--algo_name',default='Q-learning',type=str,help="name of algorithm")
parser.add_argument('--env_name',default='CliffWalking-v0',type=str,help="name of environment")
parser.add_argument('--train_eps',default=400,type=int,help="episodes of training")
parser.add_argument('--test_eps',default=20,type=int,help="episodes of testing")
parser.add_argument('--gamma',default=0.90,type=float,help="discounted factor")
parser.add_argument('--epsilon_start',default=0.95,type=float,help="initial value of epsilon")
parser.add_argument('--epsilon_end',default=0.01,type=float,help="final value of epsilon")
parser.add_argument('--epsilon_decay',default=300,type=int,help="decay rate of epsilon")
parser.add_argument('--lr',default=0.1,type=float,help="learning rate")
parser.add_argument('--device',default='cpu',type=str,help="cpu or cuda")
parser.add_argument('--seed',default=10,type=int,help="seed")
parser.add_argument('--show_fig',default=False,type=bool,help="if show figure or not")
parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not")
args = parser.parse_args()
default_args = {'result_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/results/",
'model_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/models/",
}
args = {**vars(args),**default_args} # type(dict)
return args
def env_agent_config(cfg):
''' create env and agent
'''
if cfg['env_name'] == 'CliffWalking-v0':
env = gym.make(cfg['env_name'])
env = CliffWalkingWapper(env)
if cfg['env_name'] == 'FrozenLake-v1':
env = gym.make(cfg['env_name'],is_slippery=False)
if cfg['seed'] !=0: # set random seed
all_seed(env,seed=cfg["seed"])
n_states = env.observation_space.n # state dimension
n_actions = env.action_space.n # action dimension
print(f"n_states: {n_states}, n_actions: {n_actions}")
cfg.update({"n_states":n_states,"n_actions":n_actions}) # update to cfg paramters
agent = QLearning(cfg)
return env,agent
def main(cfg,env,agent,tag = 'train'):
print(f"Start {tag}ing!")
print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}")
rewards = [] # 记录奖励
for i_ep in range(cfg.train_eps):
ep_reward = 0 # 记录每个回合的奖励
state = env.reset() # 重置环境,即开始新的回合
while True:
if tag == 'train':action = agent.sample_action(state) # 根据算法采样一个动作
else: agent.predict_action(state)
next_state, reward, done, _ = env.step(action) # 与环境进行一次动作交互
if tag == 'train':agent.update(state, action, reward, next_state, done) # Q学习算法更新
state = next_state # 更新状态
ep_reward += reward
if done:
break
rewards.append(ep_reward)
print(f"回合:{i_ep+1}/{cfg.train_eps},奖励:{ep_reward:.1f}Epsilon{agent.epsilon}")
print(f"Finish {tag}ing!")
return {"rewards":rewards}
def train(cfg,env,agent):
print("Start training!")
print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}")
rewards = [] # record rewards for all episodes
steps = [] # record steps for all episodes
for i_ep in range(cfg['train_eps']):
ep_reward = 0 # reward per episode
ep_step = 0 # step per episode
state = env.reset() # reset and obtain initial state
while True:
action = agent.sample_action(state) # sample action
next_state, reward, done, _ = env.step(action) # update env and return transitions
agent.update(state, action, reward, next_state, done) # update agent
state = next_state # update state
ep_reward += reward
ep_step += 1
if done:
break
rewards.append(ep_reward)
steps.append(ep_step)
if (i_ep+1)%10==0:
print(f'Episode: {i_ep+1}/{cfg["train_eps"]}, Reward: {ep_reward:.2f}, Steps:{ep_step}, Epislon: {agent.epsilon:.3f}')
print("Finish training!")
return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
def test(cfg,env,agent):
print("Start testing!")
print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}")
rewards = [] # record rewards for all episodes
steps = [] # record steps for all episodes
for i_ep in range(cfg['test_eps']):
ep_reward = 0 # reward per episode
ep_step = 0
state = env.reset() # reset and obtain initial state
while True:
action = agent.predict_action(state) # predict action
next_state, reward, done, _ = env.step(action)
state = next_state
ep_reward += reward
ep_step += 1
if done:
break
rewards.append(ep_reward)
steps.append(ep_step)
print(f"Episode: {i_ep+1}/{cfg['test_eps']}, Steps:{ep_step}, Reward: {ep_reward:.2f}")
print("Finish testing!")
return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
if __name__ == "__main__":
cfg = get_args()
# training
env, agent = env_agent_config(cfg)
res_dic = train(cfg, env, agent)
save_args(cfg,path = cfg['result_path']) # save parameters
agent.save_model(path = cfg['model_path']) # save models
save_results(res_dic, tag = 'train', path = cfg['result_path']) # save results
plot_rewards(res_dic['rewards'], cfg, path = cfg['result_path'],tag = "train") # plot results
# testing
env, agent = env_agent_config(cfg) # create new env for testing, sometimes can ignore this step
agent.load_model(path = cfg['model_path']) # load model
res_dic = test(cfg, env, agent)
save_results(res_dic, tag='test',
path = cfg['result_path'])
plot_rewards(res_dic['rewards'], cfg, path = cfg['result_path'],tag = "test")

View File

@@ -0,0 +1,19 @@
{
"algo_name": "Q-learning",
"env_name": "CliffWalking-v0",
"train_eps": 400,
"test_eps": 20,
"gamma": 0.9,
"epsilon_start": 0.95,
"epsilon_end": 0.01,
"epsilon_decay": 300,
"lr": 0.1,
"device": "cpu",
"seed": 10,
"show_fig": false,
"save_fig": true,
"result_path": "/Users/jj/Desktop/rl-tutorials/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/",
"model_path": "/Users/jj/Desktop/rl-tutorials/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/models/",
"n_states": 48,
"n_actions": 4
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

View File

@@ -0,0 +1,21 @@
episodes,rewards
0,-13
1,-13
2,-13
3,-13
4,-13
5,-13
6,-13
7,-13
8,-13
9,-13
10,-13
11,-13
12,-13
13,-13
14,-13
15,-13
16,-13
17,-13
18,-13
19,-13
1 episodes rewards
2 0 -13
3 1 -13
4 2 -13
5 3 -13
6 4 -13
7 5 -13
8 6 -13
9 7 -13
10 8 -13
11 9 -13
12 10 -13
13 11 -13
14 12 -13
15 13 -13
16 14 -13
17 15 -13
18 16 -13
19 17 -13
20 18 -13
21 19 -13

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

View File

@@ -0,0 +1,401 @@
episodes,rewards
0,-2131
1,-1086
2,-586
3,-220
4,-154
5,-122
6,-150
7,-159
8,-164
9,-88
10,-195
11,-114
12,-60
13,-179
14,-101
15,-304
16,-96
17,-119
18,-113
19,-98
20,-106
21,-105
22,-77
23,-51
24,-105
25,-136
26,-100
27,-29
28,-79
29,-114
30,-82
31,-70
32,-75
33,-51
34,-94
35,-52
36,-93
37,-71
38,-73
39,-48
40,-52
41,-96
42,-46
43,-65
44,-57
45,-41
46,-104
47,-51
48,-181
49,-229
50,-39
51,-69
52,-53
53,-59
54,-26
55,-75
56,-31
57,-60
58,-63
59,-40
60,-35
61,-79
62,-42
63,-22
64,-73
65,-71
66,-18
67,-55
68,-29
69,-43
70,-70
71,-49
72,-42
73,-29
74,-81
75,-36
76,-38
77,-36
78,-52
79,-28
80,-42
81,-52
82,-66
83,-31
84,-27
85,-49
86,-28
87,-54
88,-34
89,-35
90,-50
91,-36
92,-36
93,-46
94,-34
95,-135
96,-39
97,-36
98,-26
99,-56
100,-40
101,-40
102,-26
103,-28
104,-31
105,-35
106,-26
107,-57
108,-44
109,-41
110,-31
111,-26
112,-25
113,-41
114,-32
115,-44
116,-30
117,-32
118,-30
119,-25
120,-23
121,-47
122,-24
123,-45
124,-39
125,-21
126,-43
127,-143
128,-26
129,-20
130,-32
131,-16
132,-24
133,-42
134,-25
135,-36
136,-19
137,-29
138,-43
139,-17
140,-150
141,-32
142,-34
143,-19
144,-26
145,-30
146,-31
147,-49
148,-33
149,-21
150,-17
151,-48
152,-34
153,-20
154,-20
155,-26
156,-21
157,-13
158,-40
159,-22
160,-26
161,-30
162,-29
163,-25
164,-26
165,-27
166,-21
167,-29
168,-24
169,-17
170,-22
171,-35
172,-35
173,-18
174,-135
175,-15
176,-23
177,-28
178,-25
179,-24
180,-29
181,-31
182,-24
183,-129
184,-45
185,-24
186,-17
187,-20
188,-21
189,-23
190,-15
191,-32
192,-22
193,-19
194,-17
195,-45
196,-15
197,-14
198,-14
199,-37
200,-23
201,-17
202,-19
203,-21
204,-23
205,-27
206,-14
207,-18
208,-23
209,-34
210,-23
211,-13
212,-25
213,-17
214,-13
215,-21
216,-29
217,-18
218,-24
219,-15
220,-27
221,-25
222,-21
223,-19
224,-17
225,-18
226,-13
227,-22
228,-14
229,-13
230,-29
231,-23
232,-15
233,-15
234,-14
235,-28
236,-25
237,-17
238,-23
239,-29
240,-15
241,-14
242,-15
243,-23
244,-15
245,-16
246,-19
247,-13
248,-16
249,-17
250,-25
251,-30
252,-13
253,-14
254,-15
255,-22
256,-14
257,-17
258,-126
259,-15
260,-21
261,-16
262,-23
263,-14
264,-13
265,-13
266,-19
267,-13
268,-19
269,-17
270,-17
271,-13
272,-19
273,-13
274,-13
275,-16
276,-22
277,-14
278,-15
279,-19
280,-34
281,-13
282,-15
283,-32
284,-13
285,-13
286,-13
287,-14
288,-16
289,-13
290,-13
291,-17
292,-13
293,-13
294,-22
295,-14
296,-15
297,-13
298,-13
299,-13
300,-16
301,-13
302,-14
303,-13
304,-13
305,-13
306,-24
307,-13
308,-13
309,-15
310,-13
311,-13
312,-13
313,-15
314,-13
315,-19
316,-15
317,-17
318,-13
319,-13
320,-13
321,-13
322,-13
323,-15
324,-13
325,-13
326,-13
327,-123
328,-13
329,-13
330,-13
331,-13
332,-13
333,-13
334,-13
335,-13
336,-16
337,-13
338,-23
339,-13
340,-13
341,-13
342,-13
343,-13
344,-13
345,-13
346,-13
347,-13
348,-13
349,-13
350,-134
351,-13
352,-13
353,-13
354,-13
355,-13
356,-13
357,-13
358,-13
359,-13
360,-15
361,-13
362,-13
363,-13
364,-13
365,-13
366,-13
367,-13
368,-13
369,-14
370,-13
371,-13
372,-13
373,-13
374,-13
375,-13
376,-13
377,-124
378,-13
379,-13
380,-13
381,-13
382,-13
383,-13
384,-13
385,-13
386,-13
387,-13
388,-13
389,-121
390,-13
391,-13
392,-13
393,-13
394,-13
395,-13
396,-13
397,-13
398,-17
399,-13
1 episodes rewards
2 0 -2131
3 1 -1086
4 2 -586
5 3 -220
6 4 -154
7 5 -122
8 6 -150
9 7 -159
10 8 -164
11 9 -88
12 10 -195
13 11 -114
14 12 -60
15 13 -179
16 14 -101
17 15 -304
18 16 -96
19 17 -119
20 18 -113
21 19 -98
22 20 -106
23 21 -105
24 22 -77
25 23 -51
26 24 -105
27 25 -136
28 26 -100
29 27 -29
30 28 -79
31 29 -114
32 30 -82
33 31 -70
34 32 -75
35 33 -51
36 34 -94
37 35 -52
38 36 -93
39 37 -71
40 38 -73
41 39 -48
42 40 -52
43 41 -96
44 42 -46
45 43 -65
46 44 -57
47 45 -41
48 46 -104
49 47 -51
50 48 -181
51 49 -229
52 50 -39
53 51 -69
54 52 -53
55 53 -59
56 54 -26
57 55 -75
58 56 -31
59 57 -60
60 58 -63
61 59 -40
62 60 -35
63 61 -79
64 62 -42
65 63 -22
66 64 -73
67 65 -71
68 66 -18
69 67 -55
70 68 -29
71 69 -43
72 70 -70
73 71 -49
74 72 -42
75 73 -29
76 74 -81
77 75 -36
78 76 -38
79 77 -36
80 78 -52
81 79 -28
82 80 -42
83 81 -52
84 82 -66
85 83 -31
86 84 -27
87 85 -49
88 86 -28
89 87 -54
90 88 -34
91 89 -35
92 90 -50
93 91 -36
94 92 -36
95 93 -46
96 94 -34
97 95 -135
98 96 -39
99 97 -36
100 98 -26
101 99 -56
102 100 -40
103 101 -40
104 102 -26
105 103 -28
106 104 -31
107 105 -35
108 106 -26
109 107 -57
110 108 -44
111 109 -41
112 110 -31
113 111 -26
114 112 -25
115 113 -41
116 114 -32
117 115 -44
118 116 -30
119 117 -32
120 118 -30
121 119 -25
122 120 -23
123 121 -47
124 122 -24
125 123 -45
126 124 -39
127 125 -21
128 126 -43
129 127 -143
130 128 -26
131 129 -20
132 130 -32
133 131 -16
134 132 -24
135 133 -42
136 134 -25
137 135 -36
138 136 -19
139 137 -29
140 138 -43
141 139 -17
142 140 -150
143 141 -32
144 142 -34
145 143 -19
146 144 -26
147 145 -30
148 146 -31
149 147 -49
150 148 -33
151 149 -21
152 150 -17
153 151 -48
154 152 -34
155 153 -20
156 154 -20
157 155 -26
158 156 -21
159 157 -13
160 158 -40
161 159 -22
162 160 -26
163 161 -30
164 162 -29
165 163 -25
166 164 -26
167 165 -27
168 166 -21
169 167 -29
170 168 -24
171 169 -17
172 170 -22
173 171 -35
174 172 -35
175 173 -18
176 174 -135
177 175 -15
178 176 -23
179 177 -28
180 178 -25
181 179 -24
182 180 -29
183 181 -31
184 182 -24
185 183 -129
186 184 -45
187 185 -24
188 186 -17
189 187 -20
190 188 -21
191 189 -23
192 190 -15
193 191 -32
194 192 -22
195 193 -19
196 194 -17
197 195 -45
198 196 -15
199 197 -14
200 198 -14
201 199 -37
202 200 -23
203 201 -17
204 202 -19
205 203 -21
206 204 -23
207 205 -27
208 206 -14
209 207 -18
210 208 -23
211 209 -34
212 210 -23
213 211 -13
214 212 -25
215 213 -17
216 214 -13
217 215 -21
218 216 -29
219 217 -18
220 218 -24
221 219 -15
222 220 -27
223 221 -25
224 222 -21
225 223 -19
226 224 -17
227 225 -18
228 226 -13
229 227 -22
230 228 -14
231 229 -13
232 230 -29
233 231 -23
234 232 -15
235 233 -15
236 234 -14
237 235 -28
238 236 -25
239 237 -17
240 238 -23
241 239 -29
242 240 -15
243 241 -14
244 242 -15
245 243 -23
246 244 -15
247 245 -16
248 246 -19
249 247 -13
250 248 -16
251 249 -17
252 250 -25
253 251 -30
254 252 -13
255 253 -14
256 254 -15
257 255 -22
258 256 -14
259 257 -17
260 258 -126
261 259 -15
262 260 -21
263 261 -16
264 262 -23
265 263 -14
266 264 -13
267 265 -13
268 266 -19
269 267 -13
270 268 -19
271 269 -17
272 270 -17
273 271 -13
274 272 -19
275 273 -13
276 274 -13
277 275 -16
278 276 -22
279 277 -14
280 278 -15
281 279 -19
282 280 -34
283 281 -13
284 282 -15
285 283 -32
286 284 -13
287 285 -13
288 286 -13
289 287 -14
290 288 -16
291 289 -13
292 290 -13
293 291 -17
294 292 -13
295 293 -13
296 294 -22
297 295 -14
298 296 -15
299 297 -13
300 298 -13
301 299 -13
302 300 -16
303 301 -13
304 302 -14
305 303 -13
306 304 -13
307 305 -13
308 306 -24
309 307 -13
310 308 -13
311 309 -15
312 310 -13
313 311 -13
314 312 -13
315 313 -15
316 314 -13
317 315 -19
318 316 -15
319 317 -17
320 318 -13
321 319 -13
322 320 -13
323 321 -13
324 322 -13
325 323 -15
326 324 -13
327 325 -13
328 326 -13
329 327 -123
330 328 -13
331 329 -13
332 330 -13
333 331 -13
334 332 -13
335 333 -13
336 334 -13
337 335 -13
338 336 -16
339 337 -13
340 338 -23
341 339 -13
342 340 -13
343 341 -13
344 342 -13
345 343 -13
346 344 -13
347 345 -13
348 346 -13
349 347 -13
350 348 -13
351 349 -13
352 350 -134
353 351 -13
354 352 -13
355 353 -13
356 354 -13
357 355 -13
358 356 -13
359 357 -13
360 358 -13
361 359 -13
362 360 -15
363 361 -13
364 362 -13
365 363 -13
366 364 -13
367 365 -13
368 366 -13
369 367 -13
370 368 -13
371 369 -14
372 370 -13
373 371 -13
374 372 -13
375 373 -13
376 374 -13
377 375 -13
378 376 -13
379 377 -124
380 378 -13
381 379 -13
382 380 -13
383 381 -13
384 382 -13
385 383 -13
386 384 -13
387 385 -13
388 386 -13
389 387 -13
390 388 -13
391 389 -121
392 390 -13
393 391 -13
394 392 -13
395 393 -13
396 394 -13
397 395 -13
398 396 -13
399 397 -13
400 398 -17
401 399 -13

View File

@@ -0,0 +1,19 @@
{
"algo_name": "Q-learning",
"env_name": "FrozenLake-v1",
"train_eps": 800,
"test_eps": 20,
"gamma": 0.9,
"epsilon_start": 0.7,
"epsilon_end": 0.1,
"epsilon_decay": 2000,
"lr": 0.9,
"device": "cpu",
"seed": 10,
"show_fig": false,
"save_fig": true,
"result_path": "/Users/jj/Desktop/rl-tutorials/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/",
"model_path": "/Users/jj/Desktop/rl-tutorials/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/models/",
"n_states": 16,
"n_actions": 4
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

View File

@@ -0,0 +1,21 @@
episodes,rewards,steps
0,1.0,6
1,1.0,6
2,1.0,6
3,1.0,6
4,1.0,6
5,1.0,6
6,1.0,6
7,1.0,6
8,1.0,6
9,1.0,6
10,1.0,6
11,1.0,6
12,1.0,6
13,1.0,6
14,1.0,6
15,1.0,6
16,1.0,6
17,1.0,6
18,1.0,6
19,1.0,6
1 episodes rewards steps
2 0 1.0 6
3 1 1.0 6
4 2 1.0 6
5 3 1.0 6
6 4 1.0 6
7 5 1.0 6
8 6 1.0 6
9 7 1.0 6
10 8 1.0 6
11 9 1.0 6
12 10 1.0 6
13 11 1.0 6
14 12 1.0 6
15 13 1.0 6
16 14 1.0 6
17 15 1.0 6
18 16 1.0 6
19 17 1.0 6
20 18 1.0 6
21 19 1.0 6

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

View File

@@ -0,0 +1,801 @@
episodes,rewards,steps
0,0.0,20
1,0.0,14
2,0.0,13
3,0.0,9
4,0.0,10
5,0.0,6
6,0.0,11
7,0.0,6
8,0.0,3
9,0.0,9
10,0.0,11
11,0.0,22
12,0.0,5
13,0.0,16
14,0.0,4
15,0.0,9
16,0.0,18
17,0.0,2
18,0.0,4
19,0.0,8
20,0.0,7
21,0.0,4
22,0.0,22
23,0.0,15
24,0.0,5
25,0.0,16
26,0.0,7
27,0.0,19
28,0.0,22
29,0.0,16
30,0.0,11
31,0.0,22
32,0.0,28
33,0.0,23
34,0.0,4
35,0.0,11
36,0.0,8
37,0.0,15
38,0.0,5
39,0.0,7
40,0.0,9
41,0.0,4
42,0.0,3
43,0.0,6
44,0.0,41
45,0.0,9
46,0.0,23
47,0.0,3
48,1.0,38
49,0.0,29
50,0.0,17
51,0.0,4
52,0.0,2
53,0.0,25
54,0.0,6
55,0.0,2
56,0.0,30
57,0.0,6
58,0.0,7
59,0.0,11
60,0.0,9
61,0.0,8
62,0.0,23
63,0.0,10
64,0.0,3
65,0.0,5
66,0.0,7
67,0.0,18
68,0.0,8
69,0.0,26
70,0.0,6
71,0.0,14
72,0.0,4
73,0.0,25
74,0.0,21
75,0.0,13
76,0.0,4
77,0.0,29
78,0.0,21
79,0.0,6
80,0.0,6
81,0.0,11
82,0.0,21
83,0.0,9
84,0.0,9
85,0.0,7
86,0.0,48
87,0.0,23
88,0.0,100
89,0.0,60
90,0.0,7
91,0.0,10
92,0.0,24
93,0.0,4
94,0.0,7
95,0.0,17
96,0.0,87
97,0.0,28
98,0.0,7
99,0.0,5
100,0.0,12
101,0.0,14
102,0.0,6
103,0.0,13
104,0.0,93
105,0.0,4
106,0.0,50
107,0.0,8
108,0.0,12
109,0.0,43
110,0.0,30
111,0.0,15
112,0.0,19
113,0.0,100
114,0.0,82
115,0.0,40
116,0.0,88
117,0.0,19
118,0.0,30
119,0.0,27
120,0.0,5
121,0.0,87
122,0.0,9
123,0.0,64
124,0.0,27
125,0.0,68
126,0.0,81
127,0.0,86
128,0.0,100
129,0.0,100
130,0.0,27
131,0.0,41
132,0.0,70
133,0.0,27
134,0.0,6
135,0.0,18
136,0.0,38
137,0.0,26
138,0.0,36
139,0.0,3
140,0.0,61
141,0.0,100
142,0.0,4
143,0.0,39
144,0.0,18
145,0.0,33
146,0.0,29
147,0.0,49
148,0.0,88
149,0.0,22
150,0.0,65
151,0.0,36
152,0.0,30
153,0.0,58
154,0.0,43
155,0.0,53
156,0.0,43
157,0.0,13
158,0.0,8
159,0.0,39
160,0.0,29
161,0.0,26
162,0.0,60
163,0.0,100
164,0.0,31
165,0.0,22
166,0.0,100
167,0.0,46
168,0.0,23
169,0.0,54
170,0.0,8
171,0.0,58
172,0.0,3
173,0.0,47
174,0.0,16
175,0.0,21
176,0.0,44
177,0.0,29
178,0.0,100
179,0.0,100
180,0.0,62
181,0.0,83
182,0.0,26
183,0.0,24
184,0.0,10
185,0.0,12
186,0.0,40
187,0.0,25
188,0.0,18
189,0.0,60
190,0.0,100
191,0.0,100
192,0.0,24
193,0.0,56
194,0.0,71
195,0.0,19
196,0.0,100
197,0.0,44
198,0.0,41
199,0.0,41
200,0.0,60
201,0.0,31
202,0.0,34
203,0.0,35
204,0.0,59
205,0.0,51
206,0.0,100
207,0.0,100
208,0.0,100
209,0.0,100
210,0.0,37
211,0.0,68
212,0.0,40
213,0.0,17
214,0.0,79
215,0.0,100
216,0.0,26
217,0.0,61
218,0.0,25
219,0.0,18
220,0.0,27
221,0.0,13
222,0.0,100
223,0.0,87
224,0.0,100
225,0.0,92
226,0.0,100
227,0.0,8
228,0.0,100
229,0.0,64
230,0.0,17
231,0.0,82
232,0.0,100
233,0.0,94
234,0.0,7
235,0.0,36
236,0.0,100
237,0.0,56
238,0.0,17
239,0.0,100
240,0.0,83
241,0.0,100
242,0.0,100
243,0.0,43
244,0.0,87
245,0.0,42
246,0.0,80
247,0.0,54
248,0.0,82
249,0.0,97
250,0.0,65
251,0.0,83
252,0.0,100
253,0.0,59
254,0.0,100
255,0.0,78
256,0.0,100
257,0.0,100
258,0.0,43
259,0.0,80
260,0.0,100
261,0.0,70
262,0.0,94
263,0.0,100
264,0.0,100
265,0.0,37
266,0.0,11
267,0.0,31
268,0.0,100
269,0.0,34
270,0.0,32
271,0.0,58
272,0.0,38
273,0.0,28
274,0.0,100
275,0.0,59
276,0.0,100
277,0.0,82
278,0.0,51
279,0.0,25
280,0.0,73
281,0.0,56
282,0.0,55
283,0.0,38
284,0.0,100
285,0.0,100
286,0.0,92
287,0.0,100
288,0.0,100
289,0.0,100
290,0.0,37
291,0.0,100
292,0.0,66
293,0.0,24
294,0.0,17
295,0.0,100
296,0.0,59
297,0.0,25
298,0.0,73
299,0.0,100
300,0.0,29
301,0.0,100
302,0.0,72
303,0.0,6
304,1.0,57
305,0.0,47
306,0.0,48
307,0.0,13
308,0.0,100
309,0.0,38
310,0.0,100
311,0.0,20
312,0.0,100
313,0.0,100
314,0.0,5
315,0.0,39
316,0.0,11
317,0.0,83
318,0.0,42
319,0.0,100
320,0.0,99
321,0.0,83
322,0.0,28
323,0.0,46
324,0.0,100
325,0.0,100
326,0.0,62
327,0.0,100
328,0.0,23
329,0.0,91
330,0.0,53
331,0.0,19
332,0.0,26
333,0.0,93
334,0.0,38
335,0.0,22
336,0.0,43
337,0.0,100
338,0.0,90
339,0.0,18
340,0.0,45
341,0.0,65
342,1.0,22
343,0.0,100
344,1.0,15
345,1.0,72
346,0.0,5
347,1.0,6
348,1.0,6
349,1.0,9
350,1.0,8
351,1.0,9
352,1.0,8
353,1.0,6
354,1.0,6
355,1.0,10
356,1.0,6
357,0.0,5
358,0.0,3
359,1.0,6
360,1.0,6
361,1.0,6
362,1.0,6
363,1.0,8
364,1.0,6
365,1.0,8
366,1.0,6
367,1.0,6
368,1.0,8
369,1.0,6
370,1.0,6
371,0.0,5
372,1.0,6
373,0.0,6
374,1.0,6
375,1.0,12
376,1.0,6
377,1.0,6
378,1.0,9
379,1.0,6
380,1.0,6
381,0.0,2
382,0.0,3
383,0.0,2
384,0.0,4
385,0.0,3
386,1.0,7
387,1.0,6
388,1.0,6
389,1.0,8
390,1.0,9
391,1.0,8
392,1.0,8
393,1.0,6
394,1.0,6
395,1.0,7
396,1.0,6
397,0.0,5
398,0.0,5
399,1.0,10
400,1.0,6
401,0.0,3
402,1.0,6
403,1.0,7
404,1.0,6
405,1.0,6
406,1.0,6
407,1.0,6
408,1.0,6
409,1.0,6
410,1.0,6
411,0.0,5
412,1.0,6
413,1.0,6
414,0.0,2
415,1.0,6
416,1.0,6
417,1.0,6
418,1.0,6
419,1.0,6
420,1.0,8
421,1.0,6
422,1.0,6
423,1.0,6
424,1.0,6
425,1.0,7
426,0.0,5
427,1.0,6
428,1.0,6
429,1.0,6
430,1.0,8
431,1.0,6
432,1.0,6
433,1.0,6
434,1.0,6
435,0.0,2
436,1.0,8
437,1.0,7
438,1.0,6
439,1.0,7
440,1.0,6
441,1.0,6
442,0.0,3
443,0.0,4
444,1.0,6
445,1.0,6
446,1.0,7
447,1.0,6
448,1.0,6
449,1.0,6
450,1.0,6
451,1.0,6
452,1.0,6
453,1.0,8
454,1.0,6
455,1.0,6
456,1.0,6
457,1.0,6
458,1.0,6
459,1.0,7
460,1.0,8
461,1.0,6
462,1.0,7
463,1.0,6
464,1.0,6
465,1.0,6
466,1.0,6
467,1.0,8
468,1.0,6
469,1.0,6
470,1.0,8
471,1.0,6
472,1.0,11
473,1.0,6
474,1.0,6
475,1.0,6
476,1.0,8
477,0.0,2
478,1.0,7
479,1.0,6
480,1.0,6
481,1.0,7
482,1.0,6
483,1.0,6
484,1.0,6
485,1.0,6
486,0.0,3
487,1.0,7
488,1.0,6
489,1.0,6
490,1.0,6
491,0.0,3
492,1.0,6
493,1.0,7
494,1.0,12
495,1.0,6
496,0.0,9
497,1.0,6
498,1.0,6
499,0.0,8
500,1.0,6
501,0.0,3
502,0.0,5
503,0.0,3
504,1.0,6
505,1.0,6
506,1.0,6
507,1.0,6
508,1.0,6
509,1.0,6
510,1.0,6
511,1.0,6
512,1.0,6
513,1.0,6
514,0.0,2
515,1.0,7
516,1.0,6
517,1.0,6
518,1.0,6
519,1.0,6
520,1.0,6
521,1.0,7
522,0.0,4
523,1.0,6
524,0.0,5
525,1.0,6
526,1.0,6
527,1.0,6
528,1.0,6
529,0.0,3
530,1.0,6
531,1.0,6
532,1.0,6
533,1.0,7
534,1.0,8
535,1.0,6
536,1.0,6
537,1.0,6
538,1.0,6
539,1.0,7
540,1.0,7
541,1.0,7
542,1.0,8
543,1.0,6
544,1.0,10
545,1.0,6
546,1.0,6
547,1.0,6
548,1.0,8
549,1.0,6
550,1.0,6
551,1.0,8
552,1.0,6
553,1.0,7
554,1.0,6
555,1.0,7
556,1.0,6
557,1.0,6
558,1.0,7
559,1.0,7
560,1.0,7
561,1.0,6
562,1.0,6
563,1.0,6
564,1.0,6
565,1.0,6
566,1.0,6
567,1.0,6
568,1.0,7
569,0.0,4
570,1.0,8
571,1.0,8
572,1.0,7
573,1.0,6
574,1.0,8
575,1.0,6
576,1.0,6
577,1.0,7
578,1.0,6
579,1.0,6
580,1.0,8
581,1.0,7
582,1.0,6
583,1.0,6
584,0.0,3
585,1.0,11
586,1.0,6
587,1.0,8
588,0.0,2
589,1.0,6
590,1.0,6
591,1.0,6
592,1.0,6
593,1.0,8
594,1.0,6
595,1.0,7
596,1.0,6
597,1.0,7
598,1.0,6
599,1.0,8
600,0.0,2
601,1.0,6
602,1.0,7
603,1.0,6
604,1.0,6
605,1.0,10
606,1.0,7
607,1.0,6
608,1.0,6
609,1.0,6
610,1.0,6
611,1.0,6
612,1.0,7
613,0.0,4
614,1.0,7
615,1.0,6
616,1.0,8
617,0.0,3
618,1.0,6
619,1.0,6
620,1.0,6
621,1.0,6
622,0.0,2
623,1.0,6
624,1.0,6
625,1.0,6
626,1.0,6
627,1.0,6
628,1.0,7
629,1.0,6
630,1.0,6
631,1.0,7
632,1.0,6
633,1.0,6
634,1.0,6
635,1.0,6
636,1.0,6
637,1.0,6
638,1.0,6
639,1.0,8
640,1.0,6
641,1.0,8
642,1.0,7
643,1.0,6
644,0.0,3
645,1.0,6
646,1.0,7
647,1.0,6
648,1.0,6
649,1.0,6
650,1.0,10
651,1.0,6
652,1.0,6
653,1.0,6
654,1.0,6
655,1.0,10
656,1.0,6
657,1.0,8
658,1.0,8
659,1.0,7
660,1.0,6
661,0.0,5
662,0.0,2
663,1.0,8
664,1.0,6
665,1.0,10
666,1.0,6
667,1.0,8
668,1.0,10
669,1.0,6
670,1.0,6
671,1.0,6
672,1.0,10
673,1.0,6
674,0.0,4
675,1.0,6
676,1.0,6
677,1.0,6
678,1.0,15
679,1.0,6
680,1.0,6
681,1.0,6
682,1.0,6
683,1.0,6
684,1.0,6
685,1.0,8
686,1.0,6
687,1.0,7
688,1.0,6
689,1.0,6
690,1.0,8
691,1.0,6
692,1.0,6
693,1.0,8
694,1.0,8
695,1.0,6
696,1.0,6
697,1.0,6
698,1.0,10
699,1.0,6
700,1.0,6
701,1.0,6
702,1.0,6
703,1.0,6
704,1.0,6
705,1.0,6
706,1.0,8
707,1.0,8
708,1.0,6
709,1.0,6
710,0.0,2
711,1.0,6
712,1.0,6
713,1.0,6
714,1.0,8
715,1.0,6
716,1.0,6
717,1.0,6
718,1.0,6
719,1.0,6
720,1.0,6
721,1.0,6
722,1.0,6
723,1.0,6
724,1.0,7
725,0.0,3
726,1.0,7
727,1.0,6
728,1.0,6
729,1.0,6
730,0.0,2
731,1.0,6
732,1.0,8
733,1.0,6
734,1.0,6
735,1.0,6
736,1.0,6
737,1.0,9
738,1.0,6
739,1.0,6
740,1.0,6
741,1.0,6
742,1.0,6
743,1.0,6
744,1.0,9
745,1.0,7
746,0.0,4
747,1.0,6
748,1.0,8
749,1.0,11
750,1.0,6
751,1.0,6
752,1.0,6
753,1.0,6
754,1.0,6
755,1.0,8
756,1.0,6
757,1.0,6
758,1.0,8
759,1.0,7
760,1.0,6
761,1.0,8
762,1.0,6
763,0.0,5
764,1.0,9
765,1.0,8
766,1.0,8
767,1.0,6
768,1.0,8
769,1.0,8
770,1.0,6
771,0.0,5
772,0.0,3
773,0.0,2
774,1.0,8
775,1.0,6
776,1.0,6
777,1.0,6
778,1.0,6
779,1.0,6
780,1.0,6
781,1.0,6
782,1.0,6
783,1.0,6
784,1.0,6
785,1.0,6
786,1.0,6
787,1.0,6
788,1.0,6
789,0.0,2
790,1.0,6
791,0.0,4
792,1.0,6
793,1.0,6
794,1.0,6
795,1.0,6
796,1.0,6
797,1.0,8
798,0.0,5
799,1.0,6
1 episodes rewards steps
2 0 0.0 20
3 1 0.0 14
4 2 0.0 13
5 3 0.0 9
6 4 0.0 10
7 5 0.0 6
8 6 0.0 11
9 7 0.0 6
10 8 0.0 3
11 9 0.0 9
12 10 0.0 11
13 11 0.0 22
14 12 0.0 5
15 13 0.0 16
16 14 0.0 4
17 15 0.0 9
18 16 0.0 18
19 17 0.0 2
20 18 0.0 4
21 19 0.0 8
22 20 0.0 7
23 21 0.0 4
24 22 0.0 22
25 23 0.0 15
26 24 0.0 5
27 25 0.0 16
28 26 0.0 7
29 27 0.0 19
30 28 0.0 22
31 29 0.0 16
32 30 0.0 11
33 31 0.0 22
34 32 0.0 28
35 33 0.0 23
36 34 0.0 4
37 35 0.0 11
38 36 0.0 8
39 37 0.0 15
40 38 0.0 5
41 39 0.0 7
42 40 0.0 9
43 41 0.0 4
44 42 0.0 3
45 43 0.0 6
46 44 0.0 41
47 45 0.0 9
48 46 0.0 23
49 47 0.0 3
50 48 1.0 38
51 49 0.0 29
52 50 0.0 17
53 51 0.0 4
54 52 0.0 2
55 53 0.0 25
56 54 0.0 6
57 55 0.0 2
58 56 0.0 30
59 57 0.0 6
60 58 0.0 7
61 59 0.0 11
62 60 0.0 9
63 61 0.0 8
64 62 0.0 23
65 63 0.0 10
66 64 0.0 3
67 65 0.0 5
68 66 0.0 7
69 67 0.0 18
70 68 0.0 8
71 69 0.0 26
72 70 0.0 6
73 71 0.0 14
74 72 0.0 4
75 73 0.0 25
76 74 0.0 21
77 75 0.0 13
78 76 0.0 4
79 77 0.0 29
80 78 0.0 21
81 79 0.0 6
82 80 0.0 6
83 81 0.0 11
84 82 0.0 21
85 83 0.0 9
86 84 0.0 9
87 85 0.0 7
88 86 0.0 48
89 87 0.0 23
90 88 0.0 100
91 89 0.0 60
92 90 0.0 7
93 91 0.0 10
94 92 0.0 24
95 93 0.0 4
96 94 0.0 7
97 95 0.0 17
98 96 0.0 87
99 97 0.0 28
100 98 0.0 7
101 99 0.0 5
102 100 0.0 12
103 101 0.0 14
104 102 0.0 6
105 103 0.0 13
106 104 0.0 93
107 105 0.0 4
108 106 0.0 50
109 107 0.0 8
110 108 0.0 12
111 109 0.0 43
112 110 0.0 30
113 111 0.0 15
114 112 0.0 19
115 113 0.0 100
116 114 0.0 82
117 115 0.0 40
118 116 0.0 88
119 117 0.0 19
120 118 0.0 30
121 119 0.0 27
122 120 0.0 5
123 121 0.0 87
124 122 0.0 9
125 123 0.0 64
126 124 0.0 27
127 125 0.0 68
128 126 0.0 81
129 127 0.0 86
130 128 0.0 100
131 129 0.0 100
132 130 0.0 27
133 131 0.0 41
134 132 0.0 70
135 133 0.0 27
136 134 0.0 6
137 135 0.0 18
138 136 0.0 38
139 137 0.0 26
140 138 0.0 36
141 139 0.0 3
142 140 0.0 61
143 141 0.0 100
144 142 0.0 4
145 143 0.0 39
146 144 0.0 18
147 145 0.0 33
148 146 0.0 29
149 147 0.0 49
150 148 0.0 88
151 149 0.0 22
152 150 0.0 65
153 151 0.0 36
154 152 0.0 30
155 153 0.0 58
156 154 0.0 43
157 155 0.0 53
158 156 0.0 43
159 157 0.0 13
160 158 0.0 8
161 159 0.0 39
162 160 0.0 29
163 161 0.0 26
164 162 0.0 60
165 163 0.0 100
166 164 0.0 31
167 165 0.0 22
168 166 0.0 100
169 167 0.0 46
170 168 0.0 23
171 169 0.0 54
172 170 0.0 8
173 171 0.0 58
174 172 0.0 3
175 173 0.0 47
176 174 0.0 16
177 175 0.0 21
178 176 0.0 44
179 177 0.0 29
180 178 0.0 100
181 179 0.0 100
182 180 0.0 62
183 181 0.0 83
184 182 0.0 26
185 183 0.0 24
186 184 0.0 10
187 185 0.0 12
188 186 0.0 40
189 187 0.0 25
190 188 0.0 18
191 189 0.0 60
192 190 0.0 100
193 191 0.0 100
194 192 0.0 24
195 193 0.0 56
196 194 0.0 71
197 195 0.0 19
198 196 0.0 100
199 197 0.0 44
200 198 0.0 41
201 199 0.0 41
202 200 0.0 60
203 201 0.0 31
204 202 0.0 34
205 203 0.0 35
206 204 0.0 59
207 205 0.0 51
208 206 0.0 100
209 207 0.0 100
210 208 0.0 100
211 209 0.0 100
212 210 0.0 37
213 211 0.0 68
214 212 0.0 40
215 213 0.0 17
216 214 0.0 79
217 215 0.0 100
218 216 0.0 26
219 217 0.0 61
220 218 0.0 25
221 219 0.0 18
222 220 0.0 27
223 221 0.0 13
224 222 0.0 100
225 223 0.0 87
226 224 0.0 100
227 225 0.0 92
228 226 0.0 100
229 227 0.0 8
230 228 0.0 100
231 229 0.0 64
232 230 0.0 17
233 231 0.0 82
234 232 0.0 100
235 233 0.0 94
236 234 0.0 7
237 235 0.0 36
238 236 0.0 100
239 237 0.0 56
240 238 0.0 17
241 239 0.0 100
242 240 0.0 83
243 241 0.0 100
244 242 0.0 100
245 243 0.0 43
246 244 0.0 87
247 245 0.0 42
248 246 0.0 80
249 247 0.0 54
250 248 0.0 82
251 249 0.0 97
252 250 0.0 65
253 251 0.0 83
254 252 0.0 100
255 253 0.0 59
256 254 0.0 100
257 255 0.0 78
258 256 0.0 100
259 257 0.0 100
260 258 0.0 43
261 259 0.0 80
262 260 0.0 100
263 261 0.0 70
264 262 0.0 94
265 263 0.0 100
266 264 0.0 100
267 265 0.0 37
268 266 0.0 11
269 267 0.0 31
270 268 0.0 100
271 269 0.0 34
272 270 0.0 32
273 271 0.0 58
274 272 0.0 38
275 273 0.0 28
276 274 0.0 100
277 275 0.0 59
278 276 0.0 100
279 277 0.0 82
280 278 0.0 51
281 279 0.0 25
282 280 0.0 73
283 281 0.0 56
284 282 0.0 55
285 283 0.0 38
286 284 0.0 100
287 285 0.0 100
288 286 0.0 92
289 287 0.0 100
290 288 0.0 100
291 289 0.0 100
292 290 0.0 37
293 291 0.0 100
294 292 0.0 66
295 293 0.0 24
296 294 0.0 17
297 295 0.0 100
298 296 0.0 59
299 297 0.0 25
300 298 0.0 73
301 299 0.0 100
302 300 0.0 29
303 301 0.0 100
304 302 0.0 72
305 303 0.0 6
306 304 1.0 57
307 305 0.0 47
308 306 0.0 48
309 307 0.0 13
310 308 0.0 100
311 309 0.0 38
312 310 0.0 100
313 311 0.0 20
314 312 0.0 100
315 313 0.0 100
316 314 0.0 5
317 315 0.0 39
318 316 0.0 11
319 317 0.0 83
320 318 0.0 42
321 319 0.0 100
322 320 0.0 99
323 321 0.0 83
324 322 0.0 28
325 323 0.0 46
326 324 0.0 100
327 325 0.0 100
328 326 0.0 62
329 327 0.0 100
330 328 0.0 23
331 329 0.0 91
332 330 0.0 53
333 331 0.0 19
334 332 0.0 26
335 333 0.0 93
336 334 0.0 38
337 335 0.0 22
338 336 0.0 43
339 337 0.0 100
340 338 0.0 90
341 339 0.0 18
342 340 0.0 45
343 341 0.0 65
344 342 1.0 22
345 343 0.0 100
346 344 1.0 15
347 345 1.0 72
348 346 0.0 5
349 347 1.0 6
350 348 1.0 6
351 349 1.0 9
352 350 1.0 8
353 351 1.0 9
354 352 1.0 8
355 353 1.0 6
356 354 1.0 6
357 355 1.0 10
358 356 1.0 6
359 357 0.0 5
360 358 0.0 3
361 359 1.0 6
362 360 1.0 6
363 361 1.0 6
364 362 1.0 6
365 363 1.0 8
366 364 1.0 6
367 365 1.0 8
368 366 1.0 6
369 367 1.0 6
370 368 1.0 8
371 369 1.0 6
372 370 1.0 6
373 371 0.0 5
374 372 1.0 6
375 373 0.0 6
376 374 1.0 6
377 375 1.0 12
378 376 1.0 6
379 377 1.0 6
380 378 1.0 9
381 379 1.0 6
382 380 1.0 6
383 381 0.0 2
384 382 0.0 3
385 383 0.0 2
386 384 0.0 4
387 385 0.0 3
388 386 1.0 7
389 387 1.0 6
390 388 1.0 6
391 389 1.0 8
392 390 1.0 9
393 391 1.0 8
394 392 1.0 8
395 393 1.0 6
396 394 1.0 6
397 395 1.0 7
398 396 1.0 6
399 397 0.0 5
400 398 0.0 5
401 399 1.0 10
402 400 1.0 6
403 401 0.0 3
404 402 1.0 6
405 403 1.0 7
406 404 1.0 6
407 405 1.0 6
408 406 1.0 6
409 407 1.0 6
410 408 1.0 6
411 409 1.0 6
412 410 1.0 6
413 411 0.0 5
414 412 1.0 6
415 413 1.0 6
416 414 0.0 2
417 415 1.0 6
418 416 1.0 6
419 417 1.0 6
420 418 1.0 6
421 419 1.0 6
422 420 1.0 8
423 421 1.0 6
424 422 1.0 6
425 423 1.0 6
426 424 1.0 6
427 425 1.0 7
428 426 0.0 5
429 427 1.0 6
430 428 1.0 6
431 429 1.0 6
432 430 1.0 8
433 431 1.0 6
434 432 1.0 6
435 433 1.0 6
436 434 1.0 6
437 435 0.0 2
438 436 1.0 8
439 437 1.0 7
440 438 1.0 6
441 439 1.0 7
442 440 1.0 6
443 441 1.0 6
444 442 0.0 3
445 443 0.0 4
446 444 1.0 6
447 445 1.0 6
448 446 1.0 7
449 447 1.0 6
450 448 1.0 6
451 449 1.0 6
452 450 1.0 6
453 451 1.0 6
454 452 1.0 6
455 453 1.0 8
456 454 1.0 6
457 455 1.0 6
458 456 1.0 6
459 457 1.0 6
460 458 1.0 6
461 459 1.0 7
462 460 1.0 8
463 461 1.0 6
464 462 1.0 7
465 463 1.0 6
466 464 1.0 6
467 465 1.0 6
468 466 1.0 6
469 467 1.0 8
470 468 1.0 6
471 469 1.0 6
472 470 1.0 8
473 471 1.0 6
474 472 1.0 11
475 473 1.0 6
476 474 1.0 6
477 475 1.0 6
478 476 1.0 8
479 477 0.0 2
480 478 1.0 7
481 479 1.0 6
482 480 1.0 6
483 481 1.0 7
484 482 1.0 6
485 483 1.0 6
486 484 1.0 6
487 485 1.0 6
488 486 0.0 3
489 487 1.0 7
490 488 1.0 6
491 489 1.0 6
492 490 1.0 6
493 491 0.0 3
494 492 1.0 6
495 493 1.0 7
496 494 1.0 12
497 495 1.0 6
498 496 0.0 9
499 497 1.0 6
500 498 1.0 6
501 499 0.0 8
502 500 1.0 6
503 501 0.0 3
504 502 0.0 5
505 503 0.0 3
506 504 1.0 6
507 505 1.0 6
508 506 1.0 6
509 507 1.0 6
510 508 1.0 6
511 509 1.0 6
512 510 1.0 6
513 511 1.0 6
514 512 1.0 6
515 513 1.0 6
516 514 0.0 2
517 515 1.0 7
518 516 1.0 6
519 517 1.0 6
520 518 1.0 6
521 519 1.0 6
522 520 1.0 6
523 521 1.0 7
524 522 0.0 4
525 523 1.0 6
526 524 0.0 5
527 525 1.0 6
528 526 1.0 6
529 527 1.0 6
530 528 1.0 6
531 529 0.0 3
532 530 1.0 6
533 531 1.0 6
534 532 1.0 6
535 533 1.0 7
536 534 1.0 8
537 535 1.0 6
538 536 1.0 6
539 537 1.0 6
540 538 1.0 6
541 539 1.0 7
542 540 1.0 7
543 541 1.0 7
544 542 1.0 8
545 543 1.0 6
546 544 1.0 10
547 545 1.0 6
548 546 1.0 6
549 547 1.0 6
550 548 1.0 8
551 549 1.0 6
552 550 1.0 6
553 551 1.0 8
554 552 1.0 6
555 553 1.0 7
556 554 1.0 6
557 555 1.0 7
558 556 1.0 6
559 557 1.0 6
560 558 1.0 7
561 559 1.0 7
562 560 1.0 7
563 561 1.0 6
564 562 1.0 6
565 563 1.0 6
566 564 1.0 6
567 565 1.0 6
568 566 1.0 6
569 567 1.0 6
570 568 1.0 7
571 569 0.0 4
572 570 1.0 8
573 571 1.0 8
574 572 1.0 7
575 573 1.0 6
576 574 1.0 8
577 575 1.0 6
578 576 1.0 6
579 577 1.0 7
580 578 1.0 6
581 579 1.0 6
582 580 1.0 8
583 581 1.0 7
584 582 1.0 6
585 583 1.0 6
586 584 0.0 3
587 585 1.0 11
588 586 1.0 6
589 587 1.0 8
590 588 0.0 2
591 589 1.0 6
592 590 1.0 6
593 591 1.0 6
594 592 1.0 6
595 593 1.0 8
596 594 1.0 6
597 595 1.0 7
598 596 1.0 6
599 597 1.0 7
600 598 1.0 6
601 599 1.0 8
602 600 0.0 2
603 601 1.0 6
604 602 1.0 7
605 603 1.0 6
606 604 1.0 6
607 605 1.0 10
608 606 1.0 7
609 607 1.0 6
610 608 1.0 6
611 609 1.0 6
612 610 1.0 6
613 611 1.0 6
614 612 1.0 7
615 613 0.0 4
616 614 1.0 7
617 615 1.0 6
618 616 1.0 8
619 617 0.0 3
620 618 1.0 6
621 619 1.0 6
622 620 1.0 6
623 621 1.0 6
624 622 0.0 2
625 623 1.0 6
626 624 1.0 6
627 625 1.0 6
628 626 1.0 6
629 627 1.0 6
630 628 1.0 7
631 629 1.0 6
632 630 1.0 6
633 631 1.0 7
634 632 1.0 6
635 633 1.0 6
636 634 1.0 6
637 635 1.0 6
638 636 1.0 6
639 637 1.0 6
640 638 1.0 6
641 639 1.0 8
642 640 1.0 6
643 641 1.0 8
644 642 1.0 7
645 643 1.0 6
646 644 0.0 3
647 645 1.0 6
648 646 1.0 7
649 647 1.0 6
650 648 1.0 6
651 649 1.0 6
652 650 1.0 10
653 651 1.0 6
654 652 1.0 6
655 653 1.0 6
656 654 1.0 6
657 655 1.0 10
658 656 1.0 6
659 657 1.0 8
660 658 1.0 8
661 659 1.0 7
662 660 1.0 6
663 661 0.0 5
664 662 0.0 2
665 663 1.0 8
666 664 1.0 6
667 665 1.0 10
668 666 1.0 6
669 667 1.0 8
670 668 1.0 10
671 669 1.0 6
672 670 1.0 6
673 671 1.0 6
674 672 1.0 10
675 673 1.0 6
676 674 0.0 4
677 675 1.0 6
678 676 1.0 6
679 677 1.0 6
680 678 1.0 15
681 679 1.0 6
682 680 1.0 6
683 681 1.0 6
684 682 1.0 6
685 683 1.0 6
686 684 1.0 6
687 685 1.0 8
688 686 1.0 6
689 687 1.0 7
690 688 1.0 6
691 689 1.0 6
692 690 1.0 8
693 691 1.0 6
694 692 1.0 6
695 693 1.0 8
696 694 1.0 8
697 695 1.0 6
698 696 1.0 6
699 697 1.0 6
700 698 1.0 10
701 699 1.0 6
702 700 1.0 6
703 701 1.0 6
704 702 1.0 6
705 703 1.0 6
706 704 1.0 6
707 705 1.0 6
708 706 1.0 8
709 707 1.0 8
710 708 1.0 6
711 709 1.0 6
712 710 0.0 2
713 711 1.0 6
714 712 1.0 6
715 713 1.0 6
716 714 1.0 8
717 715 1.0 6
718 716 1.0 6
719 717 1.0 6
720 718 1.0 6
721 719 1.0 6
722 720 1.0 6
723 721 1.0 6
724 722 1.0 6
725 723 1.0 6
726 724 1.0 7
727 725 0.0 3
728 726 1.0 7
729 727 1.0 6
730 728 1.0 6
731 729 1.0 6
732 730 0.0 2
733 731 1.0 6
734 732 1.0 8
735 733 1.0 6
736 734 1.0 6
737 735 1.0 6
738 736 1.0 6
739 737 1.0 9
740 738 1.0 6
741 739 1.0 6
742 740 1.0 6
743 741 1.0 6
744 742 1.0 6
745 743 1.0 6
746 744 1.0 9
747 745 1.0 7
748 746 0.0 4
749 747 1.0 6
750 748 1.0 8
751 749 1.0 11
752 750 1.0 6
753 751 1.0 6
754 752 1.0 6
755 753 1.0 6
756 754 1.0 6
757 755 1.0 8
758 756 1.0 6
759 757 1.0 6
760 758 1.0 8
761 759 1.0 7
762 760 1.0 6
763 761 1.0 8
764 762 1.0 6
765 763 0.0 5
766 764 1.0 9
767 765 1.0 8
768 766 1.0 8
769 767 1.0 6
770 768 1.0 8
771 769 1.0 8
772 770 1.0 6
773 771 0.0 5
774 772 0.0 3
775 773 0.0 2
776 774 1.0 8
777 775 1.0 6
778 776 1.0 6
779 777 1.0 6
780 778 1.0 6
781 779 1.0 6
782 780 1.0 6
783 781 1.0 6
784 782 1.0 6
785 783 1.0 6
786 784 1.0 6
787 785 1.0 6
788 786 1.0 6
789 787 1.0 6
790 788 1.0 6
791 789 0.0 2
792 790 1.0 6
793 791 0.0 4
794 792 1.0 6
795 793 1.0 6
796 794 1.0 6
797 795 1.0 6
798 796 1.0 6
799 797 1.0 8
800 798 0.0 5
801 799 1.0 6

View File

@@ -5,7 +5,7 @@ Author: John
Email: johnjim0816@gmail.com
Date: 2020-09-11 23:03:00
LastEditor: John
LastEditTime: 2021-12-22 10:54:57
LastEditTime: 2022-08-24 10:31:04
Discription: use defaultdict to define Q table
Environment:
'''
@@ -15,50 +15,52 @@ import torch
from collections import defaultdict
class QLearning(object):
def __init__(self,
n_actions,cfg):
self.n_actions = n_actions
self.lr = cfg.lr # 学习率
self.gamma = cfg.gamma
self.epsilon = cfg.epsilon_start
def __init__(self,cfg):
self.n_actions = cfg['n_actions']
self.lr = cfg['lr']
self.gamma = cfg['gamma']
self.epsilon = cfg['epsilon_start']
self.sample_count = 0
self.epsilon_start = cfg.epsilon_start
self.epsilon_end = cfg.epsilon_end
self.epsilon_decay = cfg.epsilon_decay
self.Q_table = defaultdict(lambda: np.zeros(n_actions)) # 用嵌套字典存放状态->动作->状态-动作值Q值的映射即Q表
def sample(self, state):
''' 采样动作,训练时用
self.epsilon_start = cfg['epsilon_start']
self.epsilon_end = cfg['epsilon_end']
self.epsilon_decay = cfg['epsilon_decay']
self.Q_table = defaultdict(lambda: np.zeros(self.n_actions)) # use nested dictionary to represent Q(s,a), here set all Q(s,a)=0 initially, not like pseudo code
def sample_action(self, state):
''' sample action with e-greedy policy while training
'''
self.sample_count += 1
# epsilon must decay(linear,exponential and etc.) for balancing exploration and exploitation
self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
math.exp(-1. * self.sample_count / self.epsilon_decay) # epsilon是会递减的这里选择指数递减
# e-greedy 策略
math.exp(-1. * self.sample_count / self.epsilon_decay)
if np.random.uniform(0, 1) > self.epsilon:
action = np.argmax(self.Q_table[str(state)]) # 选择Q(s,a)最大对应的动作
action = np.argmax(self.Q_table[str(state)]) # choose action corresponding to the maximum q value
else:
action = np.random.choice(self.n_actions) # 随机选择动作
action = np.random.choice(self.n_actions) # choose action randomly
return action
def predict(self,state):
''' 预测或选择动作,测试时用
def predict_action(self,state):
''' predict action while testing
'''
action = np.argmax(self.Q_table[str(state)])
return action
def update(self, state, action, reward, next_state, done):
Q_predict = self.Q_table[str(state)][action]
if done: # 终止状态
if done: # terminal state
Q_target = reward
else:
Q_target = reward + self.gamma * np.max(self.Q_table[str(next_state)])
self.Q_table[str(state)][action] += self.lr * (Q_target - Q_predict)
def save(self,path):
def save_model(self,path):
import dill
from pathlib import Path
# create path
Path(path).mkdir(parents=True, exist_ok=True)
torch.save(
obj=self.Q_table,
f=path+"Qleaning_model.pkl",
pickle_module=dill
)
print("保存模型成功!")
def load(self, path):
print("Model saved!")
def load_model(self, path):
import dill
self.Q_table =torch.load(f=path+'Qleaning_model.pkl',pickle_module=dill)
print("加载模型成功!")
print("Mode loaded!")

View File

@@ -0,0 +1,27 @@
import torch
import torch.optim as optim
import torch.nn as nn
import numpy as np
class SAC:
def __init__(self,n_actions,models,memory,cfg):
self.device = cfg.device
self.value_net = models['ValueNet'].to(self.device) # $\psi$
self.target_value_net = models['ValueNet'].to(self.device) # $\bar{\psi}$
self.soft_q_net = models['SoftQNet'].to(self.device) # $\theta$
self.policy_net = models['PolicyNet'].to(self.device) # $\phi$
self.value_optimizer = optim.Adam(self.value_net.parameters(), lr=cfg.value_lr)
self.soft_q_optimizer = optim.Adam(self.soft_q_net.parameters(), lr=cfg.soft_q_lr)
self.policy_optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg.policy_lr)
for target_param, param in zip(self.target_value_net.parameters(), self.value_net.parameters()):
target_param.data.copy_(param.data)
self.value_criterion = nn.MSELoss()
self.soft_q_criterion = nn.MSELoss()
def update(self):
# sample a batch of transitions from replay buffer
state_batch, action_batch, reward_batch, next_state_batch, done_batch = self.memory.sample(
self.batch_size)
state_batch = torch.tensor(np.array(state_batch), device=self.device, dtype=torch.float) # shape(batchsize,n_states)
action_batch = torch.tensor(action_batch, device=self.device).unsqueeze(1) # shape(batchsize,1)
reward_batch = torch.tensor(reward_batch, device=self.device, dtype=torch.float).unsqueeze(1) # shape(batchsize)
next_state_batch = torch.tensor(np.array(next_state_batch), device=self.device, dtype=torch.float) # shape(batchsize,n_states)
done_batch = torch.tensor(np.float32(done_batch), device=self.device).unsqueeze(1) # shape(batchsize,1)

View File

View File

@@ -5,7 +5,7 @@ Author: John
Email: johnjim0816@gmail.com
Date: 2021-03-12 16:02:24
LastEditor: John
LastEditTime: 2022-08-22 17:41:28
LastEditTime: 2022-08-24 10:31:30
Discription:
Environment:
'''
@@ -64,14 +64,14 @@ def smooth(data, weight=0.9):
def plot_rewards(rewards,cfg,path=None,tag='train'):
sns.set()
plt.figure() # 创建一个图形实例,方便同时多画几个图
plt.title(f"{tag}ing curve on {cfg.device} of {cfg.algo_name} for {cfg.env_name}")
plt.title(f"{tag}ing curve on {cfg['device']} of {cfg['algo_name']} for {cfg['env_name']}")
plt.xlabel('epsiodes')
plt.plot(rewards, label='rewards')
plt.plot(smooth(rewards), label='smoothed')
plt.legend()
if cfg.save_fig:
if cfg['save_fig']:
plt.savefig(f"{path}/{tag}ing_curve.png")
if cfg.show_fig:
if cfg['show_fig']:
plt.show()
def plot_losses(losses, algo="DQN", save=True, path='./'):
@@ -110,12 +110,21 @@ def del_empty_dir(*paths):
if not os.listdir(os.path.join(path, dir)):
os.removedirs(os.path.join(path, dir))
class NpEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, np.integer):
return int(obj)
if isinstance(obj, np.floating):
return float(obj)
if isinstance(obj, np.ndarray):
return obj.tolist()
return json.JSONEncoder.default(self, obj)
def save_args(args,path=None):
# 保存参数
args_dict = vars(args)
# save parameters
Path(path).mkdir(parents=True, exist_ok=True)
with open(f"{path}/params.json", 'w') as fp:
json.dump(args_dict, fp)
json.dump(args, fp,cls=NpEncoder)
print("Parameters saved!")
def all_seed(env,seed = 1):

View File

@@ -1,6 +1,18 @@
## 环境汇总
# 环境说明汇总
## 算法SAR一览
说明SAR分别指状态(S)、动作(A)以及奖励(R)下表的Reward Range表示每回合能获得的奖励范围Steps表示环境中每回合的最大步数
| Environment ID | Observation Space | Action Space | Reward Range | Steps |
| :--------------------------------: | :---------------: | :----------: | :----------: | :------: |
| CartPole-v0 | Box(4,) | Discrete(2) | [0,200] | 200 |
| CartPole-v1 | Box(4,) | Discrete(2) | [0,500] | 500 |
| CliffWalking-v0 | Discrete(48) | Discrete(4) | [-inf,-13] | [13,inf] |
| FrozenLake-v1(*is_slippery*=False) | Discrete(16) | Discrete(4) | 0 or 1 | [6,info] |
## 环境描述
[OpenAI Gym](./gym_info.md)
[MuJoCo](./mujoco_info.md)

View File

@@ -0,0 +1,15 @@
# run DQN on CartPole-v0
# source conda, if you are already in proper conda environment, then comment the codes util "conda activate easyrl"
if [ -f "$HOME/anaconda3/etc/profile.d/conda.sh" ]; then
echo "source file at ~/anaconda3/etc/profile.d/conda.sh"
source ~/anaconda3/etc/profile.d/conda.sh
elif [ -f "$HOME/opt/anaconda3/etc/profile.d/conda.sh" ]; then
echo "source file at ~/opt/anaconda3/etc/profile.d/conda.sh"
source ~/opt/anaconda3/etc/profile.d/conda.sh
else
echo 'please manually config the conda source path'
fi
conda activate easyrl # easyrl here can be changed to another name of conda env that you have created
codes_dir=$(dirname $(dirname $(readlink -f "$0"))) # "codes" path
python $codes_dir/DQN/main.py

View File

@@ -0,0 +1,16 @@
'''
run DQN on CartPole-v1, not finished yet
'''
# source conda, if you are already in proper conda environment, then comment the codes util "conda activate easyrl"
if [ -f "$HOME/anaconda3/etc/profile.d/conda.sh" ]; then
echo "source file at ~/anaconda3/etc/profile.d/conda.sh"
source ~/anaconda3/etc/profile.d/conda.sh
elif [ -f "$HOME/opt/anaconda3/etc/profile.d/conda.sh" ]; then
echo "source file at ~/opt/anaconda3/etc/profile.d/conda.sh"
source ~/opt/anaconda3/etc/profile.d/conda.sh
else
echo 'please manually config the conda source path'
fi
conda activate easyrl # easyrl here can be changed to another name of conda env that you have created
codes_dir=$(dirname $(dirname $(readlink -f "$0"))) # "codes" path
python $codes_dir/DQN/main.py --env_name CartPole-v1 --train_eps 500 --epsilon_decay 1000 --memory_capacity 200000 --batch_size 128 --device cuda

View File

@@ -0,0 +1,14 @@
# source conda, if you are already in proper conda environment, then comment the codes util "conda activate easyrl"
if [ -f "$HOME/anaconda3/etc/profile.d/conda.sh" ]; then
echo "source file at ~/anaconda3/etc/profile.d/conda.sh"
source ~/anaconda3/etc/profile.d/conda.sh
elif [ -f "$HOME/opt/anaconda3/etc/profile.d/conda.sh" ]; then
echo "source file at ~/opt/anaconda3/etc/profile.d/conda.sh"
source ~/opt/anaconda3/etc/profile.d/conda.sh
else
echo 'please manually config the conda source path'
fi
conda activate easyrl # easyrl here can be changed to another name of conda env that you have created
codes_dir=$(dirname $(dirname $(readlink -f "$0"))) # "codes" path
python $codes_dir/QLearning/main.py --device cpu

View File

@@ -0,0 +1,14 @@
# source conda, if you are already in proper conda environment, then comment the codes util "conda activate easyrl"
if [ -f "$HOME/anaconda3/etc/profile.d/conda.sh" ]; then
echo "source file at ~/anaconda3/etc/profile.d/conda.sh"
source ~/anaconda3/etc/profile.d/conda.sh
elif [ -f "$HOME/opt/anaconda3/etc/profile.d/conda.sh" ]; then
echo "source file at ~/opt/anaconda3/etc/profile.d/conda.sh"
source ~/opt/anaconda3/etc/profile.d/conda.sh
else
echo 'please manually config the conda source path'
fi
conda activate easyrl # easyrl here can be changed to another name of conda env that you have created
codes_dir=$(dirname $(dirname $(readlink -f "$0"))) # "codes" path
python $codes_dir/QLearning/main.py --env_name FrozenLake-v1 --train_eps 800 --epsilon_start 0.70 --epsilon_end 0.1 --epsilon_decay 2000 --gamma 0.9 --lr 0.9 --device cpu

Binary file not shown.

Before

Width:  |  Height:  |  Size: 317 KiB

After

Width:  |  Height:  |  Size: 235 KiB

View File

@@ -1,11 +1,8 @@
gym==0.21.0
torch==1.10.0
torchvision==0.11.0
torchaudio==0.10.0
ipykernel==6.15.1
jupyter==1.0.0
matplotlib==3.5.2
seaborn==0.11.2
dill==0.3.5.1
argparse==1.4.0
pandas==1.3.5
pandas==1.3.5