diff --git a/projects/README.md b/projects/README.md index c8c5d74..bd112d9 100644 --- a/projects/README.md +++ b/projects/README.md @@ -23,52 +23,71 @@ 注:点击对应的名称会跳到[codes](./codes/)下对应的算法中,其他版本还请读者自行翻阅 -| 算法名称 | 参考文献 | 环境 | 备注 | -| :-----------------------: | :----------------------------------------------------------: | :--: | :--: | -| | | | | -| DQN-CNN | | | 待更 | -| [SoftQ](codes/SoftQ) | [Soft Q-learning paper](https://arxiv.org/abs/1702.08165) | | | -| [SAC](codes/SAC) | [SAC paper](https://arxiv.org/pdf/1812.05905.pdf) | | | -| [SAC-Discrete](codes/SAC) | [SAC-Discrete paper](https://arxiv.org/pdf/1910.07207.pdf) | | | -| SAC-V | [SAC-V paper](https://arxiv.org/abs/1801.01290) | | | -| DSAC | [DSAC paper](https://paperswithcode.com/paper/addressing-value-estimation-errors-in) | | 待更 | +| 算法名称 | 参考文献 | 备注 | +| :-----------------------: | :----------------------------------------------------------: | :--: | +| | | | +| DQN-CNN | | 待更 | +| [SoftQ](codes/SoftQ) | [Soft Q-learning paper](https://arxiv.org/abs/1702.08165) | | +| [SAC](codes/SAC) | [SAC paper](https://arxiv.org/pdf/1812.05905.pdf) | | +| [SAC-Discrete](codes/SAC) | [SAC-Discrete paper](https://arxiv.org/pdf/1910.07207.pdf) | | +| SAC-S | [SAC-S paper](https://arxiv.org/abs/1801.01290) | | +| DSAC | [DSAC paper](https://paperswithcode.com/paper/addressing-value-estimation-errors-in) | 待更 | +## 3、算法环境 +算法环境说明请跳转[env](./codes/envs/README.md) -## 3、运行环境 +## 4、运行环境 -Python 3.7、PyTorch 1.10.0、Gym 0.21.0 +主要依赖:Python 3.7、PyTorch 1.10.0、Gym 0.21.0。 -在项目根目录下执行以下命令复现环境: +### 4.1、创建Conda环境 +```bash +conda create -n easyrl python=3.7 +conda activate easyrl # 激活环境 +``` +### 4.2、安装Torch + +安装CPU版本: +```bash +conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cpuonly -c pytorch +``` +安装CUDA版本: +```bash +conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge +``` +如果安装Torch需要镜像加速的话,点击[清华镜像链接](https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/),选择对应的操作系统,如```win-64```,然后复制链接,执行: +```bash +conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/win-64/ +``` +也可以使用PiP镜像安装(仅限CUDA版本): +```bash +pip install torch==1.10.0+cu113 torchvision==0.11.0+cu113 torchaudio==0.10.0 --extra-index-url https://download.pytorch.org/whl/cu113 +``` +### 4.3、安装其他依赖 + +项目根目录下执行: ```bash pip install -r requirements.txt ``` -如果需要使用CUDA,则需另外安装```cudatoolkit```,推荐```10.2```或者```11.3```版本的CUDA,如下: -```bash -conda install cudatoolkit=11.3 -c pytorch -``` -如果conda需要镜像加速安装的话,点击[该清华镜像链接](https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/),选择对应的操作系统,比如```win-64```,然后复制链接,执行如下命令: -```bash -conda install cudatoolkit=11.3 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/win-64/ -``` -执行以下Python脚本,如果返回True说明cuda安装成功: +### 4.4、检验CUDA版本Torch安装 + +CPU版本Torch请忽略此步,执行如下Python脚本,如果返回True说明CUDA版本安装成功: ```python import torch print(torch.cuda.is_available()) ``` -如果还是不成功,可以使用pip安装: -```bash -pip install torch==1.10.0+cu113 torchvision==0.11.0+cu113 torchaudio==0.10.0 --extra-index-url https://download.pytorch.org/whl/cu113 -``` -## 4、使用说明 + +## 5、使用说明 对于[codes](./codes/): -* 运行带有task的py脚本 +* 运行带有```main.py```脚本 +* 执行[scripts](codes\scripts)下对应的Bash脚本,例如```sh codes/scripts/DQN_task0.sh```,推荐创建名为"easyrl"的conda环境,否则需要更改sh脚本相关信息。对于Windows系统,建议安装Git(不要更改默认安装路径,否则VS Code可能不会显示Git Bash)然后使用git bash终端,而非PowerShell或者cmd终端! 对于[Jupyter Notebook](./notebooks/): * 直接运行对应的ipynb文件就行 -## 5、友情说明 +## 6、友情说明 推荐使用VS Code做项目,入门可参考[VSCode上手指南](https://blog.csdn.net/JohnJim0/article/details/126366454) \ No newline at end of file diff --git a/projects/assets/pseudocodes/pseudocodes.aux b/projects/assets/pseudocodes/pseudocodes.aux index 67d3e69..403c058 100644 --- a/projects/assets/pseudocodes/pseudocodes.aux +++ b/projects/assets/pseudocodes/pseudocodes.aux @@ -28,6 +28,8 @@ \@writefile{loa}{\contentsline {algorithm}{\numberline {}{\ignorespaces }}{6}{algorithm.}\protected@file@percent } \@writefile{toc}{\contentsline {section}{\numberline {6}SoftQ算法}{7}{section.6}\protected@file@percent } \@writefile{loa}{\contentsline {algorithm}{\numberline {}{\ignorespaces }}{7}{algorithm.}\protected@file@percent } -\@writefile{toc}{\contentsline {section}{\numberline {7}SAC算法}{8}{section.7}\protected@file@percent } +\@writefile{toc}{\contentsline {section}{\numberline {7}SAC-S算法}{8}{section.7}\protected@file@percent } \@writefile{loa}{\contentsline {algorithm}{\numberline {}{\ignorespaces }}{8}{algorithm.}\protected@file@percent } -\gdef \@abspage@last{8} +\@writefile{toc}{\contentsline {section}{\numberline {8}SAC算法}{9}{section.8}\protected@file@percent } +\@writefile{loa}{\contentsline {algorithm}{\numberline {}{\ignorespaces }}{9}{algorithm.}\protected@file@percent } +\gdef \@abspage@last{9} diff --git a/projects/assets/pseudocodes/pseudocodes.log b/projects/assets/pseudocodes/pseudocodes.log index 865cabe..096a0ed 100644 --- a/projects/assets/pseudocodes/pseudocodes.log +++ b/projects/assets/pseudocodes/pseudocodes.log @@ -1,4 +1,4 @@ -This is XeTeX, Version 3.141592653-2.6-0.999993 (TeX Live 2021) (preloaded format=xelatex 2021.8.22) 22 AUG 2022 16:54 +This is XeTeX, Version 3.141592653-2.6-0.999993 (TeX Live 2021) (preloaded format=xelatex 2021.8.22) 23 AUG 2022 19:26 entering extended mode restricted \write18 enabled. file:line:error style messages enabled. @@ -415,85 +415,85 @@ Package: titlesec 2019/10/16 v2.13 Sectioning titles ) (./pseudocodes.aux) \openout1 = `pseudocodes.aux'. -LaTeX Font Info: Checking defaults for OML/cmm/m/it on input line 13. -LaTeX Font Info: ... okay on input line 13. -LaTeX Font Info: Checking defaults for OMS/cmsy/m/n on input line 13. -LaTeX Font Info: ... okay on input line 13. -LaTeX Font Info: Checking defaults for OT1/cmr/m/n on input line 13. -LaTeX Font Info: ... okay on input line 13. -LaTeX Font Info: Checking defaults for T1/cmr/m/n on input line 13. -LaTeX Font Info: ... okay on input line 13. -LaTeX Font Info: Checking defaults for TS1/cmr/m/n on input line 13. -LaTeX Font Info: ... okay on input line 13. -LaTeX Font Info: Checking defaults for TU/lmr/m/n on input line 13. -LaTeX Font Info: ... okay on input line 13. -LaTeX Font Info: Checking defaults for OMX/cmex/m/n on input line 13. -LaTeX Font Info: ... okay on input line 13. -LaTeX Font Info: Checking defaults for U/cmr/m/n on input line 13. -LaTeX Font Info: ... okay on input line 13. -LaTeX Font Info: Checking defaults for PD1/pdf/m/n on input line 13. -LaTeX Font Info: ... okay on input line 13. -LaTeX Font Info: Checking defaults for PU/pdf/m/n on input line 13. -LaTeX Font Info: ... okay on input line 13. +LaTeX Font Info: Checking defaults for OML/cmm/m/it on input line 14. +LaTeX Font Info: ... okay on input line 14. +LaTeX Font Info: Checking defaults for OMS/cmsy/m/n on input line 14. +LaTeX Font Info: ... okay on input line 14. +LaTeX Font Info: Checking defaults for OT1/cmr/m/n on input line 14. +LaTeX Font Info: ... okay on input line 14. +LaTeX Font Info: Checking defaults for T1/cmr/m/n on input line 14. +LaTeX Font Info: ... okay on input line 14. +LaTeX Font Info: Checking defaults for TS1/cmr/m/n on input line 14. +LaTeX Font Info: ... okay on input line 14. +LaTeX Font Info: Checking defaults for TU/lmr/m/n on input line 14. +LaTeX Font Info: ... okay on input line 14. +LaTeX Font Info: Checking defaults for OMX/cmex/m/n on input line 14. +LaTeX Font Info: ... okay on input line 14. +LaTeX Font Info: Checking defaults for U/cmr/m/n on input line 14. +LaTeX Font Info: ... okay on input line 14. +LaTeX Font Info: Checking defaults for PD1/pdf/m/n on input line 14. +LaTeX Font Info: ... okay on input line 14. +LaTeX Font Info: Checking defaults for PU/pdf/m/n on input line 14. +LaTeX Font Info: ... okay on input line 14. ABD: EverySelectfont initializing macros -LaTeX Info: Redefining \selectfont on input line 13. +LaTeX Info: Redefining \selectfont on input line 14. Package fontspec Info: Adjusting the maths setup (use [no-math] to avoid (fontspec) this). \symlegacymaths=\mathgroup6 LaTeX Font Info: Overwriting symbol font `legacymaths' in version `bold' -(Font) OT1/cmr/m/n --> OT1/cmr/bx/n on input line 13. -LaTeX Font Info: Redeclaring math accent \acute on input line 13. -LaTeX Font Info: Redeclaring math accent \grave on input line 13. -LaTeX Font Info: Redeclaring math accent \ddot on input line 13. -LaTeX Font Info: Redeclaring math accent \tilde on input line 13. -LaTeX Font Info: Redeclaring math accent \bar on input line 13. -LaTeX Font Info: Redeclaring math accent \breve on input line 13. -LaTeX Font Info: Redeclaring math accent \check on input line 13. -LaTeX Font Info: Redeclaring math accent \hat on input line 13. -LaTeX Font Info: Redeclaring math accent \dot on input line 13. -LaTeX Font Info: Redeclaring math accent \mathring on input line 13. -LaTeX Font Info: Redeclaring math symbol \Gamma on input line 13. -LaTeX Font Info: Redeclaring math symbol \Delta on input line 13. -LaTeX Font Info: Redeclaring math symbol \Theta on input line 13. -LaTeX Font Info: Redeclaring math symbol \Lambda on input line 13. -LaTeX Font Info: Redeclaring math symbol \Xi on input line 13. -LaTeX Font Info: Redeclaring math symbol \Pi on input line 13. -LaTeX Font Info: Redeclaring math symbol \Sigma on input line 13. -LaTeX Font Info: Redeclaring math symbol \Upsilon on input line 13. -LaTeX Font Info: Redeclaring math symbol \Phi on input line 13. -LaTeX Font Info: Redeclaring math symbol \Psi on input line 13. -LaTeX Font Info: Redeclaring math symbol \Omega on input line 13. -LaTeX Font Info: Redeclaring math symbol \mathdollar on input line 13. -LaTeX Font Info: Redeclaring symbol font `operators' on input line 13. +(Font) OT1/cmr/m/n --> OT1/cmr/bx/n on input line 14. +LaTeX Font Info: Redeclaring math accent \acute on input line 14. +LaTeX Font Info: Redeclaring math accent \grave on input line 14. +LaTeX Font Info: Redeclaring math accent \ddot on input line 14. +LaTeX Font Info: Redeclaring math accent \tilde on input line 14. +LaTeX Font Info: Redeclaring math accent \bar on input line 14. +LaTeX Font Info: Redeclaring math accent \breve on input line 14. +LaTeX Font Info: Redeclaring math accent \check on input line 14. +LaTeX Font Info: Redeclaring math accent \hat on input line 14. +LaTeX Font Info: Redeclaring math accent \dot on input line 14. +LaTeX Font Info: Redeclaring math accent \mathring on input line 14. +LaTeX Font Info: Redeclaring math symbol \Gamma on input line 14. +LaTeX Font Info: Redeclaring math symbol \Delta on input line 14. +LaTeX Font Info: Redeclaring math symbol \Theta on input line 14. +LaTeX Font Info: Redeclaring math symbol \Lambda on input line 14. +LaTeX Font Info: Redeclaring math symbol \Xi on input line 14. +LaTeX Font Info: Redeclaring math symbol \Pi on input line 14. +LaTeX Font Info: Redeclaring math symbol \Sigma on input line 14. +LaTeX Font Info: Redeclaring math symbol \Upsilon on input line 14. +LaTeX Font Info: Redeclaring math symbol \Phi on input line 14. +LaTeX Font Info: Redeclaring math symbol \Psi on input line 14. +LaTeX Font Info: Redeclaring math symbol \Omega on input line 14. +LaTeX Font Info: Redeclaring math symbol \mathdollar on input line 14. +LaTeX Font Info: Redeclaring symbol font `operators' on input line 14. LaTeX Font Info: Encoding `OT1' has changed to `TU' for symbol font -(Font) `operators' in the math version `normal' on input line 13. +(Font) `operators' in the math version `normal' on input line 14. LaTeX Font Info: Overwriting symbol font `operators' in version `normal' -(Font) OT1/cmr/m/n --> TU/lmr/m/n on input line 13. +(Font) OT1/cmr/m/n --> TU/lmr/m/n on input line 14. LaTeX Font Info: Encoding `OT1' has changed to `TU' for symbol font -(Font) `operators' in the math version `bold' on input line 13. +(Font) `operators' in the math version `bold' on input line 14. LaTeX Font Info: Overwriting symbol font `operators' in version `bold' -(Font) OT1/cmr/bx/n --> TU/lmr/m/n on input line 13. +(Font) OT1/cmr/bx/n --> TU/lmr/m/n on input line 14. LaTeX Font Info: Overwriting symbol font `operators' in version `normal' -(Font) TU/lmr/m/n --> TU/lmr/m/n on input line 13. +(Font) TU/lmr/m/n --> TU/lmr/m/n on input line 14. LaTeX Font Info: Overwriting math alphabet `\mathit' in version `normal' -(Font) OT1/cmr/m/it --> TU/lmr/m/it on input line 13. +(Font) OT1/cmr/m/it --> TU/lmr/m/it on input line 14. LaTeX Font Info: Overwriting math alphabet `\mathbf' in version `normal' -(Font) OT1/cmr/bx/n --> TU/lmr/b/n on input line 13. +(Font) OT1/cmr/bx/n --> TU/lmr/b/n on input line 14. LaTeX Font Info: Overwriting math alphabet `\mathsf' in version `normal' -(Font) OT1/cmss/m/n --> TU/lmss/m/n on input line 13. +(Font) OT1/cmss/m/n --> TU/lmss/m/n on input line 14. LaTeX Font Info: Overwriting math alphabet `\mathtt' in version `normal' -(Font) OT1/cmtt/m/n --> TU/lmtt/m/n on input line 13. +(Font) OT1/cmtt/m/n --> TU/lmtt/m/n on input line 14. LaTeX Font Info: Overwriting symbol font `operators' in version `bold' -(Font) TU/lmr/m/n --> TU/lmr/b/n on input line 13. +(Font) TU/lmr/m/n --> TU/lmr/b/n on input line 14. LaTeX Font Info: Overwriting math alphabet `\mathit' in version `bold' -(Font) OT1/cmr/bx/it --> TU/lmr/b/it on input line 13. +(Font) OT1/cmr/bx/it --> TU/lmr/b/it on input line 14. LaTeX Font Info: Overwriting math alphabet `\mathsf' in version `bold' -(Font) OT1/cmss/bx/n --> TU/lmss/b/n on input line 13. +(Font) OT1/cmss/bx/n --> TU/lmss/b/n on input line 14. LaTeX Font Info: Overwriting math alphabet `\mathtt' in version `bold' -(Font) OT1/cmtt/m/n --> TU/lmtt/b/n on input line 13. -Package hyperref Info: Link coloring OFF on input line 13. +(Font) OT1/cmtt/m/n --> TU/lmtt/b/n on input line 14. +Package hyperref Info: Link coloring OFF on input line 14. (/usr/local/texlive/2021/texmf-dist/tex/latex/hyperref/nameref.sty Package: nameref 2021-04-02 v2.47 Cross-referencing by name of section (/usr/local/texlive/2021/texmf-dist/tex/latex/refcount/refcount.sty @@ -503,9 +503,9 @@ Package: gettitlestring 2019/12/15 v1.6 Cleanup title references (HO) ) \c@section@level=\count313 ) -LaTeX Info: Redefining \ref on input line 13. -LaTeX Info: Redefining \pageref on input line 13. -LaTeX Info: Redefining \nameref on input line 13. +LaTeX Info: Redefining \ref on input line 14. +LaTeX Info: Redefining \pageref on input line 14. +LaTeX Info: Redefining \nameref on input line 14. (./pseudocodes.out) (./pseudocodes.out) \@outlinefile=\write3 \openout3 = `pseudocodes.out'. @@ -515,19 +515,19 @@ LaTeX Info: Redefining \nameref on input line 13. \openout4 = `pseudocodes.toc'. LaTeX Font Info: Font shape `TU/SongtiSCLight(0)/m/sl' in size <10.95> not available -(Font) Font shape `TU/SongtiSCLight(0)/m/it' tried instead on input line 16. +(Font) Font shape `TU/SongtiSCLight(0)/m/it' tried instead on input line 17. [1 ] -Package hyperref Info: bookmark level for unknown algorithm defaults to 0 on input line 21. +Package hyperref Info: bookmark level for unknown algorithm defaults to 0 on input line 22. [2 ] -LaTeX Font Info: Trying to load font information for U+msa on input line 31. +LaTeX Font Info: Trying to load font information for U+msa on input line 32. (/usr/local/texlive/2021/texmf-dist/tex/latex/amsfonts/umsa.fd File: umsa.fd 2013/01/14 v3.01 AMS symbols A ) -LaTeX Font Info: Trying to load font information for U+msb on input line 31. +LaTeX Font Info: Trying to load font information for U+msb on input line 32. (/usr/local/texlive/2021/texmf-dist/tex/latex/amsfonts/umsb.fd File: umsb.fd 2013/01/14 v3.01 AMS symbols B ) [3 @@ -536,38 +536,35 @@ File: umsb.fd 2013/01/14 v3.01 AMS symbols B ] [5 -] -Underfull \hbox (badness 10000) in paragraph at lines 111--112 -[] []\TU/SongtiSCLight(0)/m/n/10.95 计 算 实 际 的 $\OML/cmm/m/it/10.95 Q$ \TU/SongtiSCLight(0)/m/n/10.95 值,| 即 $\OML/cmm/m/it/10.95 y[] \OT1/cmr/m/n/10.95 = - [] - -[6 +] [6 ] [7 +] [8 + ] -Overfull \hbox (32.54117pt too wide) in paragraph at lines 183--183 +Overfull \hbox (32.54117pt too wide) in paragraph at lines 212--212 [][]$[]\OML/cmm/m/it/9 J[]\OT1/cmr/m/n/9 (\OML/cmm/m/it/9 ^^R\OT1/cmr/m/n/9 ) = \OMS/cmsy/m/n/9 r[]\OML/cmm/m/it/9 Q[] [] []$| [] -Overfull \hbox (15.41673pt too wide) in paragraph at lines 184--184 +Overfull \hbox (15.41673pt too wide) in paragraph at lines 213--213 [][]$[]\OML/cmm/m/it/9 J[]\OT1/cmr/m/n/9 (\OML/cmm/m/it/9 ^^^\OT1/cmr/m/n/9 ) = \OMS/cmsy/m/n/9 r[]\OML/cmm/m/it/9 ^^K [] [] \OT1/cmr/m/n/9 + [] \OMS/cmsy/m/n/9 r[]\OML/cmm/m/it/9 f[] []$\TU/lmr/m/n/9 ,$[][] \OT1/cmr/m/n/9 = [] -[8 +[9 ] (./pseudocodes.aux) Package rerunfilecheck Info: File `pseudocodes.out' has not changed. -(rerunfilecheck) Checksum: 4575BA7458AA23D6E696EFFE39D05727;640. +(rerunfilecheck) Checksum: 35B5A79A86EF3BC70F1A0B3BCBEBAA13;724. ) Here is how much of TeX's memory you used: - 14813 strings out of 476919 - 312635 string characters out of 5821840 - 653471 words of memory out of 5000000 - 34563 multiletter control sequences out of 15000+600000 - 413601 words of font info for 90 fonts, out of 8000000 for 9000 + 14827 strings out of 476919 + 313456 string characters out of 5821840 + 653576 words of memory out of 5000000 + 34576 multiletter control sequences out of 15000+600000 + 413609 words of font info for 91 fonts, out of 8000000 for 9000 1348 hyphenation exceptions out of 8191 - 101i,13n,104p,676b,736s stack positions out of 5000i,500n,10000p,200000b,80000s + 101i,13n,104p,676b,697s stack positions out of 5000i,500n,10000p,200000b,80000s -Output written on pseudocodes.pdf (8 pages). +Output written on pseudocodes.pdf (9 pages). diff --git a/projects/assets/pseudocodes/pseudocodes.out b/projects/assets/pseudocodes/pseudocodes.out index de023a5..38f7e61 100644 --- a/projects/assets/pseudocodes/pseudocodes.out +++ b/projects/assets/pseudocodes/pseudocodes.out @@ -4,4 +4,5 @@ \BOOKMARK [1][-]{section.4}{\376\377\000P\000o\000l\000i\000c\000y\000\040\000G\000r\000a\000d\000i\000e\000n\000t\173\227\154\325}{}% 4 \BOOKMARK [1][-]{section.5}{\376\377\000D\000Q\000N\173\227\154\325}{}% 5 \BOOKMARK [1][-]{section.6}{\376\377\000S\000o\000f\000t\000Q\173\227\154\325}{}% 6 -\BOOKMARK [1][-]{section.7}{\376\377\000S\000A\000C\173\227\154\325}{}% 7 +\BOOKMARK [1][-]{section.7}{\376\377\000S\000A\000C\000-\000S\173\227\154\325}{}% 7 +\BOOKMARK [1][-]{section.8}{\376\377\000S\000A\000C\173\227\154\325}{}% 8 diff --git a/projects/assets/pseudocodes/pseudocodes.pdf b/projects/assets/pseudocodes/pseudocodes.pdf index e1852d6..c476940 100644 Binary files a/projects/assets/pseudocodes/pseudocodes.pdf and b/projects/assets/pseudocodes/pseudocodes.pdf differ diff --git a/projects/assets/pseudocodes/pseudocodes.synctex.gz b/projects/assets/pseudocodes/pseudocodes.synctex.gz index 07d025c..4e7cc3f 100644 Binary files a/projects/assets/pseudocodes/pseudocodes.synctex.gz and b/projects/assets/pseudocodes/pseudocodes.synctex.gz differ diff --git a/projects/assets/pseudocodes/pseudocodes.tex b/projects/assets/pseudocodes/pseudocodes.tex index 3cc47ab..d48ab65 100644 --- a/projects/assets/pseudocodes/pseudocodes.tex +++ b/projects/assets/pseudocodes/pseudocodes.tex @@ -10,6 +10,7 @@ \usepackage{titlesec} \usepackage{float} % 调用该包能够使用[H] % \pagestyle{plain} % 去除页眉,但是保留页脚编号,都去掉plain换empty + \begin{document} \tableofcontents % 目录,注意要运行两下或者vscode保存两下才能显示 % \singlespacing @@ -88,7 +89,7 @@ \clearpage \section{DQN算法} \begin{algorithm}[H] % [H]固定位置 - \floatname{algorithm}{{DQN算法}} + \floatname{algorithm}{{DQN算法}{\hypersetup{linkcolor=white}\footnotemark}} \renewcommand{\thealgorithm}{} % 去掉算法标号 \caption{} \renewcommand{\algorithmicrequire}{\textbf{输入:}} @@ -108,13 +109,17 @@ \STATE 更新环境状态$s_{t+1} \leftarrow s_t$ \STATE {\bfseries 更新策略:} \STATE 从$D$中采样一个batch的transition - \STATE 计算实际的$Q$值,即$y_{j}= \begin{cases}r_{j} & \text {对于终止状态} s_{j+1} \\ r_{j}+\gamma \max _{a^{\prime}} Q\left(s_{j+1}, a^{\prime} ; \theta\right) & \text {对于非终止状态} s_{j+1}\end{cases}$ - \STATE 对损失 $\left(y_{j}-Q\left(s_{j}, a_{j} ; \theta\right)\right)^{2}$关于参数$\theta$做随机梯度下降 + \STATE 计算实际的$Q$值,即$y_{j}${\hypersetup{linkcolor=white}\footnotemark} + \STATE 对损失 $L(\theta)=\left(y_{i}-Q\left(s_{i}, a_{i} ; \theta\right)\right)^{2}$关于参数$\theta$做随机梯度下降{\hypersetup{linkcolor=white}\footnotemark} \ENDFOR - \STATE 每$C$个回合复制参数$\hat{Q}\leftarrow Q$(此处也可像原论文中放到小循环中改成每$C$步,但没有每$C$个回合稳定) + \STATE 每$C$个回合复制参数$\hat{Q}\leftarrow Q${\hypersetup{linkcolor=white}\footnotemark} \ENDFOR \end{algorithmic} \end{algorithm} +\footnotetext[1]{Playing Atari with Deep Reinforcement Learning} +\footnotetext[2]{$y_{i}= \begin{cases}r_{i} & \text {对于终止状态} s_{i+1} \\ r_{i}+\gamma \max _{a^{\prime}} Q\left(s_{i+1}, a^{\prime} ; \theta\right) & \text {对于非终止状态} s_{i+1}\end{cases}$} +\footnotetext[3]{$\theta_i \leftarrow \theta_i - \lambda \nabla_{\theta_{i}} L_{i}\left(\theta_{i}\right)$} +\footnotetext[4]{此处也可像原论文中放到小循环中改成每$C$步,但没有每$C$个回合稳定} \clearpage \section{SoftQ算法} @@ -153,13 +158,37 @@ \footnotetext[2]{$J_{Q}(\theta)=\mathbb{E}_{\mathbf{s}_{t} \sim q_{\mathbf{s}_{t}}, \mathbf{a}_{t} \sim q_{\mathbf{a}_{t}}}\left[\frac{1}{2}\left(\hat{Q}_{\mathrm{soft}}^{\bar{\theta}}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-Q_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)^{2}\right]$} \footnotetext[3]{$\begin{aligned} \Delta f^{\phi}\left(\cdot ; \mathbf{s}_{t}\right)=& \mathbb{E}_{\mathbf{a}_{t} \sim \pi^{\phi}}\left[\left.\kappa\left(\mathbf{a}_{t}, f^{\phi}\left(\cdot ; \mathbf{s}_{t}\right)\right) \nabla_{\mathbf{a}^{\prime}} Q_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t}, \mathbf{a}^{\prime}\right)\right|_{\mathbf{a}^{\prime}=\mathbf{a}_{t}}\right.\\ &\left.+\left.\alpha \nabla_{\mathbf{a}^{\prime}} \kappa\left(\mathbf{a}^{\prime}, f^{\phi}\left(\cdot ; \mathbf{s}_{t}\right)\right)\right|_{\mathbf{a}^{\prime}=\mathbf{a}_{t}}\right] \end{aligned}$} \clearpage +\section{SAC-S算法} +\begin{algorithm}[H] % [H]固定位置 + \floatname{algorithm}{{SAC-S算法}\footnotemark[1]} + \renewcommand{\thealgorithm}{} % 去掉算法标号 + \caption{} + \begin{algorithmic}[1] % [1]显示步数 + \STATE 初始化参数$\psi, \bar{\psi}, \theta, \phi$ + \FOR {回合数 = $1,M$} + \FOR {时步 = $1,t$} + \STATE 根据$\boldsymbol{a}_{t} \sim \pi_{\phi}\left(\boldsymbol{a}_{t} \mid \mathbf{s}_{t}\right)$采样动作$a_t$ + \STATE 环境反馈奖励和下一个状态,$\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)$ + \STATE 存储transition到经验回放中,$\mathcal{D} \leftarrow \mathcal{D} \cup\left\{\left(\mathbf{s}_{t}, \mathbf{a}_{t}, r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right), \mathbf{s}_{t+1}\right)\right\}$ + \STATE 更新环境状态$s_{t+1} \leftarrow s_t$ + \STATE {\bfseries 更新策略:} + \STATE $\psi \leftarrow \psi-\lambda_{V} \hat{\nabla}_{\psi} J_{V}(\psi)$ + \STATE $\theta_{i} \leftarrow \theta_{i}-\lambda_{Q} \hat{\nabla}_{\theta_{i}} J_{Q}\left(\theta_{i}\right)$ for $i \in\{1,2\}$ + \STATE $\phi \leftarrow \phi-\lambda_{\pi} \hat{\nabla}_{\phi} J_{\pi}(\phi)$ + \STATE $\bar{\psi} \leftarrow \tau \psi+(1-\tau) \bar{\psi}$ + \ENDFOR + \ENDFOR + \end{algorithmic} +\end{algorithm} +\footnotetext[1]{Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor} +\clearpage \section{SAC算法} \begin{algorithm}[H] % [H]固定位置 - \floatname{algorithm}{{Soft Actor Critic算法}} + \floatname{algorithm}{{SAC算法}\footnotemark[1]} \renewcommand{\thealgorithm}{} % 去掉算法标号 \caption{} \begin{algorithmic}[1] - \STATE 初始化两个Actor的网络参数$\theta_1,\theta_2$以及一个Critic网络参数$\phi$ % 初始化 + \STATE 初始化网络参数$\theta_1,\theta_2$以及$\phi$ % 初始化 \STATE 复制参数到目标网络$\bar{\theta_1} \leftarrow \theta_1,\bar{\theta_2} \leftarrow \theta_2,$ \STATE 初始化经验回放$D$ \FOR {回合数 = $1,M$} @@ -170,18 +199,18 @@ \STATE 存储transition到经验回放中,$\mathcal{D} \leftarrow \mathcal{D} \cup\left\{\left(\mathbf{s}_{t}, \mathbf{a}_{t}, r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right), \mathbf{s}_{t+1}\right)\right\}$ \STATE 更新环境状态$s_{t+1} \leftarrow s_t$ \STATE {\bfseries 更新策略:} - \STATE 更新$Q$函数,$\theta_{i} \leftarrow \theta_{i}-\lambda_{Q} \hat{\nabla}_{\theta_{i}} J_{Q}\left(\theta_{i}\right)$ for $i \in\{1,2\}$\footnotemark[1]\footnotemark[2] - \STATE 更新策略权重,$\phi \leftarrow \phi-\lambda_{\pi} \hat{\nabla}_{\phi} J_{\pi}(\phi)$ \footnotemark[3] - \STATE 调整temperature,$\alpha \leftarrow \alpha-\lambda \hat{\nabla}_{\alpha} J(\alpha)$ \footnotemark[4] + \STATE 更新$Q$函数,$\theta_{i} \leftarrow \theta_{i}-\lambda_{Q} \hat{\nabla}_{\theta_{i}} J_{Q}\left(\theta_{i}\right)$ for $i \in\{1,2\}$\footnotemark[2]\footnotemark[3] + \STATE 更新策略权重,$\phi \leftarrow \phi-\lambda_{\pi} \hat{\nabla}_{\phi} J_{\pi}(\phi)$ \footnotemark[4] + \STATE 调整temperature,$\alpha \leftarrow \alpha-\lambda \hat{\nabla}_{\alpha} J(\alpha)$ \footnotemark[5] \STATE 更新目标网络权重,$\bar{\theta}_{i} \leftarrow \tau \theta_{i}+(1-\tau) \bar{\theta}_{i}$ for $i \in\{1,2\}$ \ENDFOR \ENDFOR - \end{algorithmic} - + \end{algorithmic} \end{algorithm} -\footnotetext[1]{$J_{Q}(\theta)=\mathbb{E}_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \mathcal{D}}\left[\frac{1}{2}\left(Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p}\left[V_{\bar{\theta}}\left(\mathbf{s}_{t+1}\right)\right]\right)\right)^{2}\right]$} -\footnotetext[2]{$\hat{\nabla}_{\theta} J_{Q}(\theta)=\nabla_{\theta} Q_{\theta}\left(\mathbf{a}_{t}, \mathbf{s}_{t}\right)\left(Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\gamma\left(Q_{\bar{\theta}}\left(\mathbf{s}_{t+1}, \mathbf{a}_{t+1}\right)-\alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t+1} \mid \mathbf{s}_{t+1}\right)\right)\right)\right)\right.$} -\footnotetext[3]{$\hat{\nabla}_{\phi} J_{\pi}(\phi)=\nabla_{\phi} \alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)+\left(\nabla_{\mathbf{a}_{t}} \alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)-\nabla_{\mathbf{a}_{t}} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) \nabla_{\phi} f_{\phi}\left(\epsilon_{t} ; \mathbf{s}_{t}\right)$,$\mathbf{a}_{t}=f_{\phi}\left(\epsilon_{t} ; \mathbf{s}_{t}\right)$} -\footnotetext[4]{$J(\alpha)=\mathbb{E}_{\mathbf{a}_{t} \sim \pi_{t}}\left[-\alpha \log \pi_{t}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)-\alpha \overline{\mathcal{H}}\right]$} +\footnotetext[2]{Soft Actor-Critic Algorithms and Applications} +\footnotetext[2]{$J_{Q}(\theta)=\mathbb{E}_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \mathcal{D}}\left[\frac{1}{2}\left(Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p}\left[V_{\bar{\theta}}\left(\mathbf{s}_{t+1}\right)\right]\right)\right)^{2}\right]$} +\footnotetext[3]{$\hat{\nabla}_{\theta} J_{Q}(\theta)=\nabla_{\theta} Q_{\theta}\left(\mathbf{a}_{t}, \mathbf{s}_{t}\right)\left(Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\gamma\left(Q_{\bar{\theta}}\left(\mathbf{s}_{t+1}, \mathbf{a}_{t+1}\right)-\alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t+1} \mid \mathbf{s}_{t+1}\right)\right)\right)\right)\right.$} +\footnotetext[4]{$\hat{\nabla}_{\phi} J_{\pi}(\phi)=\nabla_{\phi} \alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)+\left(\nabla_{\mathbf{a}_{t}} \alpha \log \left(\pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)-\nabla_{\mathbf{a}_{t}} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) \nabla_{\phi} f_{\phi}\left(\epsilon_{t} ; \mathbf{s}_{t}\right)$,$\mathbf{a}_{t}=f_{\phi}\left(\epsilon_{t} ; \mathbf{s}_{t}\right)$} +\footnotetext[5]{$J(\alpha)=\mathbb{E}_{\mathbf{a}_{t} \sim \pi_{t}}\left[-\alpha \log \pi_{t}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)-\alpha \overline{\mathcal{H}}\right]$} \clearpage \end{document} \ No newline at end of file diff --git a/projects/assets/pseudocodes/pseudocodes.toc b/projects/assets/pseudocodes/pseudocodes.toc index 85e0526..e33ad0b 100644 --- a/projects/assets/pseudocodes/pseudocodes.toc +++ b/projects/assets/pseudocodes/pseudocodes.toc @@ -4,4 +4,5 @@ \contentsline {section}{\numberline {4}Policy Gradient算法}{5}{section.4}% \contentsline {section}{\numberline {5}DQN算法}{6}{section.5}% \contentsline {section}{\numberline {6}SoftQ算法}{7}{section.6}% -\contentsline {section}{\numberline {7}SAC算法}{8}{section.7}% +\contentsline {section}{\numberline {7}SAC-S算法}{8}{section.7}% +\contentsline {section}{\numberline {8}SAC算法}{9}{section.8}% diff --git a/projects/codes/DQN/dqn.py b/projects/codes/DQN/dqn.py index de071cc..ecfcc36 100644 --- a/projects/codes/DQN/dqn.py +++ b/projects/codes/DQN/dqn.py @@ -5,7 +5,7 @@ @Email: johnjim0816@gmail.com @Date: 2020-06-12 00:50:49 @LastEditor: John -LastEditTime: 2022-08-18 14:27:18 +LastEditTime: 2022-08-23 23:59:54 @Discription: @Environment: python 3.7.7 ''' @@ -20,26 +20,26 @@ import math import numpy as np class DQN: - def __init__(self,n_actions,model,memory,cfg): + def __init__(self,model,memory,cfg): - self.n_actions = n_actions - self.device = torch.device(cfg.device) - self.gamma = cfg.gamma + self.n_actions = cfg['n_actions'] + self.device = torch.device(cfg['device']) + self.gamma = cfg['gamma'] ## e-greedy parameters self.sample_count = 0 # sample count for epsilon decay - self.epsilon = cfg.epsilon_start + self.epsilon = cfg['epsilon_start'] self.sample_count = 0 - self.epsilon_start = cfg.epsilon_start - self.epsilon_end = cfg.epsilon_end - self.epsilon_decay = cfg.epsilon_decay - self.batch_size = cfg.batch_size + self.epsilon_start = cfg['epsilon_start'] + self.epsilon_end = cfg['epsilon_end'] + self.epsilon_decay = cfg['epsilon_decay'] + self.batch_size = cfg['batch_size'] self.policy_net = model.to(self.device) self.target_net = model.to(self.device) ## copy parameters from policy net to target net for target_param, param in zip(self.target_net.parameters(),self.policy_net.parameters()): target_param.data.copy_(param.data) # self.target_net.load_state_dict(self.policy_net.state_dict()) # or use this to copy parameters - self.optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg.lr) + self.optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg['lr']) self.memory = memory self.update_flag = False @@ -71,7 +71,7 @@ class DQN: return else: if not self.update_flag: - print("begin to update!") + print("Begin to update!") self.update_flag = True # sample a batch of transitions from replay buffer state_batch, action_batch, reward_batch, next_state_batch, done_batch = self.memory.sample( diff --git a/projects/codes/DQN/task0.py b/projects/codes/DQN/main.py similarity index 68% rename from projects/codes/DQN/task0.py rename to projects/codes/DQN/main.py index 5a6f9a5..ecf281d 100644 --- a/projects/codes/DQN/task0.py +++ b/projects/codes/DQN/main.py @@ -1,4 +1,5 @@ import sys,os +os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized." curr_path = os.path.dirname(os.path.abspath(__file__)) # current path parent_path = os.path.dirname(curr_path) # parent path sys.path.append(parent_path) # add path to system path @@ -26,7 +27,7 @@ def get_args(): parser.add_argument('--gamma',default=0.95,type=float,help="discounted factor") parser.add_argument('--epsilon_start',default=0.95,type=float,help="initial value of epsilon") parser.add_argument('--epsilon_end',default=0.01,type=float,help="final value of epsilon") - parser.add_argument('--epsilon_decay',default=500,type=int,help="decay rate of epsilon") + parser.add_argument('--epsilon_decay',default=500,type=int,help="decay rate of epsilon, the higher value, the slower decay") parser.add_argument('--lr',default=0.0001,type=float,help="learning rate") parser.add_argument('--memory_capacity',default=100000,type=int,help="memory capacity") parser.add_argument('--batch_size',default=64,type=int) @@ -34,37 +35,40 @@ def get_args(): parser.add_argument('--hidden_dim',default=256,type=int) parser.add_argument('--device',default='cpu',type=str,help="cpu or cuda") parser.add_argument('--seed',default=10,type=int,help="seed") + parser.add_argument('--show_fig',default=False,type=bool,help="if show figure or not") + parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not") + # please manually change the following args in this script if you want parser.add_argument('--result_path',default=curr_path + "/outputs/" + parser.parse_args().env_name + \ '/' + curr_time + '/results' ) parser.add_argument('--model_path',default=curr_path + "/outputs/" + parser.parse_args().env_name + \ - '/' + curr_time + '/models' ) - parser.add_argument('--show_fig',default=False,type=bool,help="if show figure or not") - parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not") - args = parser.parse_args() + '/' + curr_time + '/models' ) + args = parser.parse_args() + args = {**vars(args)} # type(dict) return args def env_agent_config(cfg): ''' create env and agent ''' - env = gym.make(cfg.env_name) # create env - if cfg.seed !=0: # set random seed - all_seed(env,seed=cfg.seed) + env = gym.make(cfg['env_name']) # create env + if cfg['seed'] !=0: # set random seed + all_seed(env,seed=cfg["seed"]) n_states = env.observation_space.shape[0] # state dimension n_actions = env.action_space.n # action dimension - print(f"state dim: {n_states}, action dim: {n_actions}") - model = MLP(n_states,n_actions,hidden_dim=cfg.hidden_dim) - memory = ReplayBuffer(cfg.memory_capacity) # replay buffer - agent = DQN(n_actions,model,memory,cfg) # create agent + print(f"n_states: {n_states}, n_actions: {n_actions}") + cfg.update({"n_states":n_states,"n_actions":n_actions}) # update to cfg paramters + model = MLP(n_states,n_actions,hidden_dim=cfg["hidden_dim"]) + memory = ReplayBuffer(cfg["memory_capacity"]) # replay buffer + agent = DQN(model,memory,cfg) # create agent return env, agent def train(cfg, env, agent): ''' 训练 ''' - print("start training!") - print(f"Env: {cfg.env_name}, Algo: {cfg.algo_name}, Device: {cfg.device}") + print("Start training!") + print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}") rewards = [] # record rewards for all episodes steps = [] - for i_ep in range(cfg.train_eps): + for i_ep in range(cfg["train_eps"]): ep_reward = 0 # reward per episode ep_step = 0 state = env.reset() # reset and obtain initial state @@ -79,23 +83,23 @@ def train(cfg, env, agent): ep_reward += reward # if done: break - if (i_ep + 1) % cfg.target_update == 0: # target net update, target_update means "C" in pseucodes + if (i_ep + 1) % cfg["target_update"] == 0: # target net update, target_update means "C" in pseucodes agent.target_net.load_state_dict(agent.policy_net.state_dict()) steps.append(ep_step) rewards.append(ep_reward) if (i_ep + 1) % 10 == 0: - print(f'Episode: {i_ep+1}/{cfg.train_eps}, Reward: {ep_reward:.2f}: Epislon: {agent.epsilon:.3f}') - print("finish training!") + print(f'Episode: {i_ep+1}/{cfg["train_eps"]}, Reward: {ep_reward:.2f}: Epislon: {agent.epsilon:.3f}') + print("Finish training!") env.close() res_dic = {'episodes':range(len(rewards)),'rewards':rewards} return res_dic def test(cfg, env, agent): - print("start testing!") - print(f"Env: {cfg.env_name}, Algo: {cfg.algo_name}, Device: {cfg.device}") + print("Start testing!") + print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}") rewards = [] # record rewards for all episodes steps = [] - for i_ep in range(cfg.test_eps): + for i_ep in range(cfg['test_eps']): ep_reward = 0 # reward per episode ep_step = 0 state = env.reset() # reset and obtain initial state @@ -109,8 +113,8 @@ def test(cfg, env, agent): break steps.append(ep_step) rewards.append(ep_reward) - print(f'Episode: {i_ep+1}/{cfg.test_eps},Reward: {ep_reward:.2f}') - print("finish testing!") + print(f"Episode: {i_ep+1}/{cfg['test_eps']},Reward: {ep_reward:.2f}") + print("Finish testing!") env.close() return {'episodes':range(len(rewards)),'rewards':rewards} @@ -120,14 +124,14 @@ if __name__ == "__main__": # training env, agent = env_agent_config(cfg) res_dic = train(cfg, env, agent) - save_args(cfg,path = cfg.result_path) # save parameters - agent.save_model(path = cfg.model_path) # save models - save_results(res_dic, tag = 'train', path = cfg.result_path) # save results - plot_rewards(res_dic['rewards'], cfg, path = cfg.result_path,tag = "train") # plot results + save_args(cfg,path = cfg['result_path']) # save parameters + agent.save_model(path = cfg['model_path']) # save models + save_results(res_dic, tag = 'train', path = cfg['result_path']) # save results + plot_rewards(res_dic['rewards'], cfg, path = cfg['result_path'],tag = "train") # plot results # testing env, agent = env_agent_config(cfg) # create new env for testing, sometimes can ignore this step - agent.load_model(path = cfg.model_path) # load model + agent.load_model(path = cfg['model_path']) # load model res_dic = test(cfg, env, agent) save_results(res_dic, tag='test', - path = cfg.result_path) - plot_rewards(res_dic['rewards'], cfg, path = cfg.result_path,tag = "test") + path = cfg['result_path']) + plot_rewards(res_dic['rewards'], cfg, path = cfg['result_path'],tag = "test") diff --git a/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/models/checkpoint.pt b/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/models/checkpoint.pt new file mode 100644 index 0000000..c27c438 Binary files /dev/null and b/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/models/checkpoint.pt differ diff --git a/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/params.json b/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/params.json new file mode 100644 index 0000000..0501731 --- /dev/null +++ b/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/params.json @@ -0,0 +1 @@ +{"algo_name": "DQN", "env_name": "Acrobot-v1", "train_eps": 100, "test_eps": 20, "gamma": 0.95, "epsilon_start": 0.95, "epsilon_end": 0.01, "epsilon_decay": 1500, "lr": 0.002, "memory_capacity": 200000, "batch_size": 128, "target_update": 4, "hidden_dim": 256, "device": "cuda", "seed": 10, "show_fig": false, "save_fig": true, "result_path": "C:\\Users\\jiangji\\Desktop\\rl-tutorials\\codes\\DQN/outputs/Acrobot-v1/20220824-124401/results", "model_path": "C:\\Users\\jiangji\\Desktop\\rl-tutorials\\codes\\DQN/outputs/Acrobot-v1/20220824-124401/models", "n_states": 6, "n_actions": 3} \ No newline at end of file diff --git a/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/testing_curve.png b/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/testing_curve.png new file mode 100644 index 0000000..067e301 Binary files /dev/null and b/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/testing_curve.png differ diff --git a/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/testing_results.csv b/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/testing_results.csv new file mode 100644 index 0000000..65499c3 --- /dev/null +++ b/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/testing_results.csv @@ -0,0 +1,21 @@ +episodes,rewards +0,-79.0 +1,-113.0 +2,-81.0 +3,-132.0 +4,-110.0 +5,-114.0 +6,-80.0 +7,-101.0 +8,-78.0 +9,-91.0 +10,-107.0 +11,-87.0 +12,-105.0 +13,-91.0 +14,-128.0 +15,-132.0 +16,-119.0 +17,-77.0 +18,-89.0 +19,-134.0 diff --git a/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/training_curve.png b/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/training_curve.png new file mode 100644 index 0000000..9dbeb09 Binary files /dev/null and b/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/training_curve.png differ diff --git a/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/training_results.csv b/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/training_results.csv new file mode 100644 index 0000000..c5afa37 --- /dev/null +++ b/projects/codes/DQN/outputs/Acrobot-v1/20220824-124401/results/training_results.csv @@ -0,0 +1,101 @@ +episodes,rewards +0,-500.0 +1,-500.0 +2,-500.0 +3,-370.0 +4,-449.0 +5,-500.0 +6,-312.0 +7,-374.0 +8,-180.0 +9,-154.0 +10,-137.0 +11,-185.0 +12,-135.0 +13,-302.0 +14,-146.0 +15,-137.0 +16,-119.0 +17,-149.0 +18,-217.0 +19,-191.0 +20,-157.0 +21,-166.0 +22,-138.0 +23,-135.0 +24,-182.0 +25,-130.0 +26,-175.0 +27,-222.0 +28,-133.0 +29,-108.0 +30,-250.0 +31,-119.0 +32,-135.0 +33,-148.0 +34,-194.0 +35,-194.0 +36,-186.0 +37,-131.0 +38,-185.0 +39,-79.0 +40,-129.0 +41,-271.0 +42,-117.0 +43,-159.0 +44,-156.0 +45,-117.0 +46,-158.0 +47,-153.0 +48,-119.0 +49,-164.0 +50,-134.0 +51,-231.0 +52,-117.0 +53,-119.0 +54,-136.0 +55,-173.0 +56,-202.0 +57,-133.0 +58,-142.0 +59,-169.0 +60,-137.0 +61,-123.0 +62,-205.0 +63,-107.0 +64,-194.0 +65,-150.0 +66,-143.0 +67,-218.0 +68,-145.0 +69,-90.0 +70,-107.0 +71,-169.0 +72,-125.0 +73,-142.0 +74,-145.0 +75,-94.0 +76,-150.0 +77,-134.0 +78,-159.0 +79,-137.0 +80,-146.0 +81,-191.0 +82,-242.0 +83,-117.0 +84,-92.0 +85,-193.0 +86,-239.0 +87,-173.0 +88,-140.0 +89,-157.0 +90,-133.0 +91,-148.0 +92,-87.0 +93,-398.0 +94,-98.0 +95,-121.0 +96,-102.0 +97,-120.0 +98,-195.0 +99,-219.0 diff --git a/projects/codes/DQN/outputs/CartPole-v0/20220818-143132/models/checkpoint.pt b/projects/codes/DQN/outputs/CartPole-v0/20220818-143132/models/checkpoint.pt deleted file mode 100644 index c7455d1..0000000 Binary files a/projects/codes/DQN/outputs/CartPole-v0/20220818-143132/models/checkpoint.pt and /dev/null differ diff --git a/projects/codes/DQN/outputs/CartPole-v0/20220818-143132/results/params.json b/projects/codes/DQN/outputs/CartPole-v0/20220818-143132/results/params.json deleted file mode 100644 index dbd1ff7..0000000 --- a/projects/codes/DQN/outputs/CartPole-v0/20220818-143132/results/params.json +++ /dev/null @@ -1 +0,0 @@ -{"algo_name": "DQN", "env_name": "CartPole-v0", "train_eps": 200, "test_eps": 20, "gamma": 0.95, "epsilon_start": 0.95, "epsilon_end": 0.01, "epsilon_decay": 500, "lr": 0.0001, "memory_capacity": 100000, "batch_size": 64, "target_update": 4, "hidden_dim": 256, "device": "cpu", "seed": 10, "result_path": "/Users/jj/Desktop/rl-tutorials/codes/DQN/outputs/CartPole-v0/20220818-143132/results", "model_path": "/Users/jj/Desktop/rl-tutorials/codes/DQN/outputs/CartPole-v0/20220818-143132/models", "show_fig": false, "save_fig": true} \ No newline at end of file diff --git a/projects/codes/DQN/outputs/CartPole-v0/20220823-173936/models/checkpoint.pt b/projects/codes/DQN/outputs/CartPole-v0/20220823-173936/models/checkpoint.pt new file mode 100644 index 0000000..db7288b Binary files /dev/null and b/projects/codes/DQN/outputs/CartPole-v0/20220823-173936/models/checkpoint.pt differ diff --git a/projects/codes/DQN/outputs/CartPole-v0/20220823-173936/results/params.json b/projects/codes/DQN/outputs/CartPole-v0/20220823-173936/results/params.json new file mode 100644 index 0000000..afad0d9 --- /dev/null +++ b/projects/codes/DQN/outputs/CartPole-v0/20220823-173936/results/params.json @@ -0,0 +1 @@ +{"algo_name": "DQN", "env_name": "CartPole-v0", "train_eps": 200, "test_eps": 20, "gamma": 0.95, "epsilon_start": 0.95, "epsilon_end": 0.01, "epsilon_decay": 500, "lr": 0.0001, "memory_capacity": 100000, "batch_size": 64, "target_update": 4, "hidden_dim": 256, "device": "cpu", "seed": 10, "result_path": "C:\\Users\\jiangji\\Desktop\\rl-tutorials\\codes\\DQN/outputs/CartPole-v0/20220823-173936/results", "model_path": "C:\\Users\\jiangji\\Desktop\\rl-tutorials\\codes\\DQN/outputs/CartPole-v0/20220823-173936/models", "show_fig": false, "save_fig": true} \ No newline at end of file diff --git a/projects/codes/DQN/outputs/CartPole-v0/20220818-143132/results/testing_curve.png b/projects/codes/DQN/outputs/CartPole-v0/20220823-173936/results/testing_curve.png similarity index 100% rename from projects/codes/DQN/outputs/CartPole-v0/20220818-143132/results/testing_curve.png rename to projects/codes/DQN/outputs/CartPole-v0/20220823-173936/results/testing_curve.png diff --git a/projects/codes/DQN/outputs/CartPole-v0/20220818-143132/results/testing_results.csv b/projects/codes/DQN/outputs/CartPole-v0/20220823-173936/results/testing_results.csv similarity index 100% rename from projects/codes/DQN/outputs/CartPole-v0/20220818-143132/results/testing_results.csv rename to projects/codes/DQN/outputs/CartPole-v0/20220823-173936/results/testing_results.csv diff --git a/projects/codes/DQN/outputs/CartPole-v0/20220818-143132/results/training_curve.png b/projects/codes/DQN/outputs/CartPole-v0/20220823-173936/results/training_curve.png similarity index 100% rename from projects/codes/DQN/outputs/CartPole-v0/20220818-143132/results/training_curve.png rename to projects/codes/DQN/outputs/CartPole-v0/20220823-173936/results/training_curve.png diff --git a/projects/codes/DQN/outputs/CartPole-v0/20220818-143132/results/training_results.csv b/projects/codes/DQN/outputs/CartPole-v0/20220823-173936/results/training_results.csv similarity index 100% rename from projects/codes/DQN/outputs/CartPole-v0/20220818-143132/results/training_results.csv rename to projects/codes/DQN/outputs/CartPole-v0/20220823-173936/results/training_results.csv diff --git a/projects/codes/QLearning/main.py b/projects/codes/QLearning/main.py new file mode 100644 index 0000000..5939722 --- /dev/null +++ b/projects/codes/QLearning/main.py @@ -0,0 +1,153 @@ +#!/usr/bin/env python +# coding=utf-8 +''' +Author: John +Email: johnjim0816@gmail.com +Date: 2020-09-11 23:03:00 +LastEditor: John +LastEditTime: 2022-08-24 11:27:01 +Discription: +Environment: +''' +import sys,os +os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized." +curr_path = os.path.dirname(os.path.abspath(__file__)) # current path +parent_path = os.path.dirname(curr_path) # parent path +sys.path.append(parent_path) # add path to system path + +import gym +import datetime +import argparse +from envs.gridworld_env import CliffWalkingWapper,FrozenLakeWapper +from qlearning import QLearning +from common.utils import plot_rewards,save_args,all_seed +from common.utils import save_results,make_dir + +def get_args(): + curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time + parser = argparse.ArgumentParser(description="hyperparameters") + parser.add_argument('--algo_name',default='Q-learning',type=str,help="name of algorithm") + parser.add_argument('--env_name',default='CliffWalking-v0',type=str,help="name of environment") + parser.add_argument('--train_eps',default=400,type=int,help="episodes of training") + parser.add_argument('--test_eps',default=20,type=int,help="episodes of testing") + parser.add_argument('--gamma',default=0.90,type=float,help="discounted factor") + parser.add_argument('--epsilon_start',default=0.95,type=float,help="initial value of epsilon") + parser.add_argument('--epsilon_end',default=0.01,type=float,help="final value of epsilon") + parser.add_argument('--epsilon_decay',default=300,type=int,help="decay rate of epsilon") + parser.add_argument('--lr',default=0.1,type=float,help="learning rate") + parser.add_argument('--device',default='cpu',type=str,help="cpu or cuda") + parser.add_argument('--seed',default=10,type=int,help="seed") + parser.add_argument('--show_fig',default=False,type=bool,help="if show figure or not") + parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not") + args = parser.parse_args() + default_args = {'result_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/results/", + 'model_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/models/", + } + args = {**vars(args),**default_args} # type(dict) + return args +def env_agent_config(cfg): + ''' create env and agent + ''' + if cfg['env_name'] == 'CliffWalking-v0': + env = gym.make(cfg['env_name']) + env = CliffWalkingWapper(env) + if cfg['env_name'] == 'FrozenLake-v1': + env = gym.make(cfg['env_name'],is_slippery=False) + if cfg['seed'] !=0: # set random seed + all_seed(env,seed=cfg["seed"]) + n_states = env.observation_space.n # state dimension + n_actions = env.action_space.n # action dimension + print(f"n_states: {n_states}, n_actions: {n_actions}") + cfg.update({"n_states":n_states,"n_actions":n_actions}) # update to cfg paramters + agent = QLearning(cfg) + return env,agent + +def main(cfg,env,agent,tag = 'train'): + print(f"Start {tag}ing!") + print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}") + rewards = [] # 记录奖励 + for i_ep in range(cfg.train_eps): + ep_reward = 0 # 记录每个回合的奖励 + state = env.reset() # 重置环境,即开始新的回合 + while True: + if tag == 'train':action = agent.sample_action(state) # 根据算法采样一个动作 + else: agent.predict_action(state) + next_state, reward, done, _ = env.step(action) # 与环境进行一次动作交互 + if tag == 'train':agent.update(state, action, reward, next_state, done) # Q学习算法更新 + state = next_state # 更新状态 + ep_reward += reward + if done: + break + rewards.append(ep_reward) + print(f"回合:{i_ep+1}/{cfg.train_eps},奖励:{ep_reward:.1f},Epsilon:{agent.epsilon}") + print(f"Finish {tag}ing!") + return {"rewards":rewards} + +def train(cfg,env,agent): + print("Start training!") + print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}") + rewards = [] # record rewards for all episodes + steps = [] # record steps for all episodes + for i_ep in range(cfg['train_eps']): + ep_reward = 0 # reward per episode + ep_step = 0 # step per episode + state = env.reset() # reset and obtain initial state + while True: + action = agent.sample_action(state) # sample action + next_state, reward, done, _ = env.step(action) # update env and return transitions + agent.update(state, action, reward, next_state, done) # update agent + state = next_state # update state + ep_reward += reward + ep_step += 1 + if done: + break + rewards.append(ep_reward) + steps.append(ep_step) + if (i_ep+1)%10==0: + print(f'Episode: {i_ep+1}/{cfg["train_eps"]}, Reward: {ep_reward:.2f}, Steps:{ep_step}, Epislon: {agent.epsilon:.3f}') + print("Finish training!") + return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps} + +def test(cfg,env,agent): + print("Start testing!") + print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}") + rewards = [] # record rewards for all episodes + steps = [] # record steps for all episodes + for i_ep in range(cfg['test_eps']): + ep_reward = 0 # reward per episode + ep_step = 0 + state = env.reset() # reset and obtain initial state + while True: + action = agent.predict_action(state) # predict action + next_state, reward, done, _ = env.step(action) + state = next_state + ep_reward += reward + ep_step += 1 + if done: + break + rewards.append(ep_reward) + steps.append(ep_step) + print(f"Episode: {i_ep+1}/{cfg['test_eps']}, Steps:{ep_step}, Reward: {ep_reward:.2f}") + print("Finish testing!") + return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps} + + +if __name__ == "__main__": + cfg = get_args() + # training + env, agent = env_agent_config(cfg) + res_dic = train(cfg, env, agent) + save_args(cfg,path = cfg['result_path']) # save parameters + agent.save_model(path = cfg['model_path']) # save models + save_results(res_dic, tag = 'train', path = cfg['result_path']) # save results + plot_rewards(res_dic['rewards'], cfg, path = cfg['result_path'],tag = "train") # plot results + # testing + env, agent = env_agent_config(cfg) # create new env for testing, sometimes can ignore this step + agent.load_model(path = cfg['model_path']) # load model + res_dic = test(cfg, env, agent) + save_results(res_dic, tag='test', + path = cfg['result_path']) + plot_rewards(res_dic['rewards'], cfg, path = cfg['result_path'],tag = "test") + + + diff --git a/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/models/Qleaning_model.pkl b/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/models/Qleaning_model.pkl deleted file mode 100644 index b28eedd..0000000 Binary files a/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/models/Qleaning_model.pkl and /dev/null differ diff --git a/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/params.json b/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/params.json deleted file mode 100644 index 232925a..0000000 --- a/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/params.json +++ /dev/null @@ -1,15 +0,0 @@ -{ - "algo_name": "Q-learning", - "env_name": "CliffWalking-v0", - "train_eps": 400, - "test_eps": 20, - "gamma": 0.9, - "epsilon_start": 0.95, - "epsilon_end": 0.01, - "epsilon_decay": 300, - "lr": 0.1, - "device": "cpu", - "result_path": "/root/Desktop/rl-tutorials/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/", - "model_path": "/root/Desktop/rl-tutorials/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/models/", - "save_fig": true -} \ No newline at end of file diff --git a/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/test_rewards.npy b/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/test_rewards.npy deleted file mode 100644 index 8aeb5dd..0000000 Binary files a/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/test_rewards.npy and /dev/null differ diff --git a/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/test_rewards_curve.png b/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/test_rewards_curve.png deleted file mode 100644 index 72171e3..0000000 Binary files a/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/test_rewards_curve.png and /dev/null differ diff --git a/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/train_rewards.npy b/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/train_rewards.npy deleted file mode 100644 index 0f18270..0000000 Binary files a/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/train_rewards.npy and /dev/null differ diff --git a/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/train_rewards_curve.png b/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/train_rewards_curve.png deleted file mode 100644 index 92992c3..0000000 Binary files a/projects/codes/QLearning/outputs/CliffWalking-v0/20220802-163256/results/train_rewards_curve.png and /dev/null differ diff --git a/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/models/Qleaning_model.pkl b/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/models/Qleaning_model.pkl new file mode 100644 index 0000000..a328ce9 Binary files /dev/null and b/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/models/Qleaning_model.pkl differ diff --git a/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/params.json b/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/params.json new file mode 100644 index 0000000..0bdad3a --- /dev/null +++ b/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/params.json @@ -0,0 +1,19 @@ +{ + "algo_name": "Q-learning", + "env_name": "CliffWalking-v0", + "train_eps": 400, + "test_eps": 20, + "gamma": 0.9, + "epsilon_start": 0.95, + "epsilon_end": 0.01, + "epsilon_decay": 300, + "lr": 0.1, + "device": "cpu", + "seed": 10, + "show_fig": false, + "save_fig": true, + "result_path": "/Users/jj/Desktop/rl-tutorials/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/", + "model_path": "/Users/jj/Desktop/rl-tutorials/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/models/", + "n_states": 48, + "n_actions": 4 +} \ No newline at end of file diff --git a/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/testing_curve.png b/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/testing_curve.png new file mode 100644 index 0000000..527fff3 Binary files /dev/null and b/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/testing_curve.png differ diff --git a/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/testing_results.csv b/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/testing_results.csv new file mode 100644 index 0000000..86359a6 --- /dev/null +++ b/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/testing_results.csv @@ -0,0 +1,21 @@ +episodes,rewards +0,-13 +1,-13 +2,-13 +3,-13 +4,-13 +5,-13 +6,-13 +7,-13 +8,-13 +9,-13 +10,-13 +11,-13 +12,-13 +13,-13 +14,-13 +15,-13 +16,-13 +17,-13 +18,-13 +19,-13 diff --git a/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/training_curve.png b/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/training_curve.png new file mode 100644 index 0000000..819e106 Binary files /dev/null and b/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/training_curve.png differ diff --git a/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/training_results.csv b/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/training_results.csv new file mode 100644 index 0000000..99b7e5f --- /dev/null +++ b/projects/codes/QLearning/outputs/CliffWalking-v0/20220824-103255/results/training_results.csv @@ -0,0 +1,401 @@ +episodes,rewards +0,-2131 +1,-1086 +2,-586 +3,-220 +4,-154 +5,-122 +6,-150 +7,-159 +8,-164 +9,-88 +10,-195 +11,-114 +12,-60 +13,-179 +14,-101 +15,-304 +16,-96 +17,-119 +18,-113 +19,-98 +20,-106 +21,-105 +22,-77 +23,-51 +24,-105 +25,-136 +26,-100 +27,-29 +28,-79 +29,-114 +30,-82 +31,-70 +32,-75 +33,-51 +34,-94 +35,-52 +36,-93 +37,-71 +38,-73 +39,-48 +40,-52 +41,-96 +42,-46 +43,-65 +44,-57 +45,-41 +46,-104 +47,-51 +48,-181 +49,-229 +50,-39 +51,-69 +52,-53 +53,-59 +54,-26 +55,-75 +56,-31 +57,-60 +58,-63 +59,-40 +60,-35 +61,-79 +62,-42 +63,-22 +64,-73 +65,-71 +66,-18 +67,-55 +68,-29 +69,-43 +70,-70 +71,-49 +72,-42 +73,-29 +74,-81 +75,-36 +76,-38 +77,-36 +78,-52 +79,-28 +80,-42 +81,-52 +82,-66 +83,-31 +84,-27 +85,-49 +86,-28 +87,-54 +88,-34 +89,-35 +90,-50 +91,-36 +92,-36 +93,-46 +94,-34 +95,-135 +96,-39 +97,-36 +98,-26 +99,-56 +100,-40 +101,-40 +102,-26 +103,-28 +104,-31 +105,-35 +106,-26 +107,-57 +108,-44 +109,-41 +110,-31 +111,-26 +112,-25 +113,-41 +114,-32 +115,-44 +116,-30 +117,-32 +118,-30 +119,-25 +120,-23 +121,-47 +122,-24 +123,-45 +124,-39 +125,-21 +126,-43 +127,-143 +128,-26 +129,-20 +130,-32 +131,-16 +132,-24 +133,-42 +134,-25 +135,-36 +136,-19 +137,-29 +138,-43 +139,-17 +140,-150 +141,-32 +142,-34 +143,-19 +144,-26 +145,-30 +146,-31 +147,-49 +148,-33 +149,-21 +150,-17 +151,-48 +152,-34 +153,-20 +154,-20 +155,-26 +156,-21 +157,-13 +158,-40 +159,-22 +160,-26 +161,-30 +162,-29 +163,-25 +164,-26 +165,-27 +166,-21 +167,-29 +168,-24 +169,-17 +170,-22 +171,-35 +172,-35 +173,-18 +174,-135 +175,-15 +176,-23 +177,-28 +178,-25 +179,-24 +180,-29 +181,-31 +182,-24 +183,-129 +184,-45 +185,-24 +186,-17 +187,-20 +188,-21 +189,-23 +190,-15 +191,-32 +192,-22 +193,-19 +194,-17 +195,-45 +196,-15 +197,-14 +198,-14 +199,-37 +200,-23 +201,-17 +202,-19 +203,-21 +204,-23 +205,-27 +206,-14 +207,-18 +208,-23 +209,-34 +210,-23 +211,-13 +212,-25 +213,-17 +214,-13 +215,-21 +216,-29 +217,-18 +218,-24 +219,-15 +220,-27 +221,-25 +222,-21 +223,-19 +224,-17 +225,-18 +226,-13 +227,-22 +228,-14 +229,-13 +230,-29 +231,-23 +232,-15 +233,-15 +234,-14 +235,-28 +236,-25 +237,-17 +238,-23 +239,-29 +240,-15 +241,-14 +242,-15 +243,-23 +244,-15 +245,-16 +246,-19 +247,-13 +248,-16 +249,-17 +250,-25 +251,-30 +252,-13 +253,-14 +254,-15 +255,-22 +256,-14 +257,-17 +258,-126 +259,-15 +260,-21 +261,-16 +262,-23 +263,-14 +264,-13 +265,-13 +266,-19 +267,-13 +268,-19 +269,-17 +270,-17 +271,-13 +272,-19 +273,-13 +274,-13 +275,-16 +276,-22 +277,-14 +278,-15 +279,-19 +280,-34 +281,-13 +282,-15 +283,-32 +284,-13 +285,-13 +286,-13 +287,-14 +288,-16 +289,-13 +290,-13 +291,-17 +292,-13 +293,-13 +294,-22 +295,-14 +296,-15 +297,-13 +298,-13 +299,-13 +300,-16 +301,-13 +302,-14 +303,-13 +304,-13 +305,-13 +306,-24 +307,-13 +308,-13 +309,-15 +310,-13 +311,-13 +312,-13 +313,-15 +314,-13 +315,-19 +316,-15 +317,-17 +318,-13 +319,-13 +320,-13 +321,-13 +322,-13 +323,-15 +324,-13 +325,-13 +326,-13 +327,-123 +328,-13 +329,-13 +330,-13 +331,-13 +332,-13 +333,-13 +334,-13 +335,-13 +336,-16 +337,-13 +338,-23 +339,-13 +340,-13 +341,-13 +342,-13 +343,-13 +344,-13 +345,-13 +346,-13 +347,-13 +348,-13 +349,-13 +350,-134 +351,-13 +352,-13 +353,-13 +354,-13 +355,-13 +356,-13 +357,-13 +358,-13 +359,-13 +360,-15 +361,-13 +362,-13 +363,-13 +364,-13 +365,-13 +366,-13 +367,-13 +368,-13 +369,-14 +370,-13 +371,-13 +372,-13 +373,-13 +374,-13 +375,-13 +376,-13 +377,-124 +378,-13 +379,-13 +380,-13 +381,-13 +382,-13 +383,-13 +384,-13 +385,-13 +386,-13 +387,-13 +388,-13 +389,-121 +390,-13 +391,-13 +392,-13 +393,-13 +394,-13 +395,-13 +396,-13 +397,-13 +398,-17 +399,-13 diff --git a/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/models/Qleaning_model.pkl b/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/models/Qleaning_model.pkl new file mode 100644 index 0000000..8524501 Binary files /dev/null and b/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/models/Qleaning_model.pkl differ diff --git a/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/params.json b/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/params.json new file mode 100644 index 0000000..0c647d0 --- /dev/null +++ b/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/params.json @@ -0,0 +1,19 @@ +{ + "algo_name": "Q-learning", + "env_name": "FrozenLake-v1", + "train_eps": 800, + "test_eps": 20, + "gamma": 0.9, + "epsilon_start": 0.7, + "epsilon_end": 0.1, + "epsilon_decay": 2000, + "lr": 0.9, + "device": "cpu", + "seed": 10, + "show_fig": false, + "save_fig": true, + "result_path": "/Users/jj/Desktop/rl-tutorials/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/", + "model_path": "/Users/jj/Desktop/rl-tutorials/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/models/", + "n_states": 16, + "n_actions": 4 +} \ No newline at end of file diff --git a/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/testing_curve.png b/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/testing_curve.png new file mode 100644 index 0000000..3c9e22b Binary files /dev/null and b/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/testing_curve.png differ diff --git a/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/testing_results.csv b/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/testing_results.csv new file mode 100644 index 0000000..b871b84 --- /dev/null +++ b/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/testing_results.csv @@ -0,0 +1,21 @@ +episodes,rewards,steps +0,1.0,6 +1,1.0,6 +2,1.0,6 +3,1.0,6 +4,1.0,6 +5,1.0,6 +6,1.0,6 +7,1.0,6 +8,1.0,6 +9,1.0,6 +10,1.0,6 +11,1.0,6 +12,1.0,6 +13,1.0,6 +14,1.0,6 +15,1.0,6 +16,1.0,6 +17,1.0,6 +18,1.0,6 +19,1.0,6 diff --git a/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/training_curve.png b/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/training_curve.png new file mode 100644 index 0000000..d9e399e Binary files /dev/null and b/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/training_curve.png differ diff --git a/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/training_results.csv b/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/training_results.csv new file mode 100644 index 0000000..aebae85 --- /dev/null +++ b/projects/codes/QLearning/outputs/FrozenLake-v1/20220824-112735/results/training_results.csv @@ -0,0 +1,801 @@ +episodes,rewards,steps +0,0.0,20 +1,0.0,14 +2,0.0,13 +3,0.0,9 +4,0.0,10 +5,0.0,6 +6,0.0,11 +7,0.0,6 +8,0.0,3 +9,0.0,9 +10,0.0,11 +11,0.0,22 +12,0.0,5 +13,0.0,16 +14,0.0,4 +15,0.0,9 +16,0.0,18 +17,0.0,2 +18,0.0,4 +19,0.0,8 +20,0.0,7 +21,0.0,4 +22,0.0,22 +23,0.0,15 +24,0.0,5 +25,0.0,16 +26,0.0,7 +27,0.0,19 +28,0.0,22 +29,0.0,16 +30,0.0,11 +31,0.0,22 +32,0.0,28 +33,0.0,23 +34,0.0,4 +35,0.0,11 +36,0.0,8 +37,0.0,15 +38,0.0,5 +39,0.0,7 +40,0.0,9 +41,0.0,4 +42,0.0,3 +43,0.0,6 +44,0.0,41 +45,0.0,9 +46,0.0,23 +47,0.0,3 +48,1.0,38 +49,0.0,29 +50,0.0,17 +51,0.0,4 +52,0.0,2 +53,0.0,25 +54,0.0,6 +55,0.0,2 +56,0.0,30 +57,0.0,6 +58,0.0,7 +59,0.0,11 +60,0.0,9 +61,0.0,8 +62,0.0,23 +63,0.0,10 +64,0.0,3 +65,0.0,5 +66,0.0,7 +67,0.0,18 +68,0.0,8 +69,0.0,26 +70,0.0,6 +71,0.0,14 +72,0.0,4 +73,0.0,25 +74,0.0,21 +75,0.0,13 +76,0.0,4 +77,0.0,29 +78,0.0,21 +79,0.0,6 +80,0.0,6 +81,0.0,11 +82,0.0,21 +83,0.0,9 +84,0.0,9 +85,0.0,7 +86,0.0,48 +87,0.0,23 +88,0.0,100 +89,0.0,60 +90,0.0,7 +91,0.0,10 +92,0.0,24 +93,0.0,4 +94,0.0,7 +95,0.0,17 +96,0.0,87 +97,0.0,28 +98,0.0,7 +99,0.0,5 +100,0.0,12 +101,0.0,14 +102,0.0,6 +103,0.0,13 +104,0.0,93 +105,0.0,4 +106,0.0,50 +107,0.0,8 +108,0.0,12 +109,0.0,43 +110,0.0,30 +111,0.0,15 +112,0.0,19 +113,0.0,100 +114,0.0,82 +115,0.0,40 +116,0.0,88 +117,0.0,19 +118,0.0,30 +119,0.0,27 +120,0.0,5 +121,0.0,87 +122,0.0,9 +123,0.0,64 +124,0.0,27 +125,0.0,68 +126,0.0,81 +127,0.0,86 +128,0.0,100 +129,0.0,100 +130,0.0,27 +131,0.0,41 +132,0.0,70 +133,0.0,27 +134,0.0,6 +135,0.0,18 +136,0.0,38 +137,0.0,26 +138,0.0,36 +139,0.0,3 +140,0.0,61 +141,0.0,100 +142,0.0,4 +143,0.0,39 +144,0.0,18 +145,0.0,33 +146,0.0,29 +147,0.0,49 +148,0.0,88 +149,0.0,22 +150,0.0,65 +151,0.0,36 +152,0.0,30 +153,0.0,58 +154,0.0,43 +155,0.0,53 +156,0.0,43 +157,0.0,13 +158,0.0,8 +159,0.0,39 +160,0.0,29 +161,0.0,26 +162,0.0,60 +163,0.0,100 +164,0.0,31 +165,0.0,22 +166,0.0,100 +167,0.0,46 +168,0.0,23 +169,0.0,54 +170,0.0,8 +171,0.0,58 +172,0.0,3 +173,0.0,47 +174,0.0,16 +175,0.0,21 +176,0.0,44 +177,0.0,29 +178,0.0,100 +179,0.0,100 +180,0.0,62 +181,0.0,83 +182,0.0,26 +183,0.0,24 +184,0.0,10 +185,0.0,12 +186,0.0,40 +187,0.0,25 +188,0.0,18 +189,0.0,60 +190,0.0,100 +191,0.0,100 +192,0.0,24 +193,0.0,56 +194,0.0,71 +195,0.0,19 +196,0.0,100 +197,0.0,44 +198,0.0,41 +199,0.0,41 +200,0.0,60 +201,0.0,31 +202,0.0,34 +203,0.0,35 +204,0.0,59 +205,0.0,51 +206,0.0,100 +207,0.0,100 +208,0.0,100 +209,0.0,100 +210,0.0,37 +211,0.0,68 +212,0.0,40 +213,0.0,17 +214,0.0,79 +215,0.0,100 +216,0.0,26 +217,0.0,61 +218,0.0,25 +219,0.0,18 +220,0.0,27 +221,0.0,13 +222,0.0,100 +223,0.0,87 +224,0.0,100 +225,0.0,92 +226,0.0,100 +227,0.0,8 +228,0.0,100 +229,0.0,64 +230,0.0,17 +231,0.0,82 +232,0.0,100 +233,0.0,94 +234,0.0,7 +235,0.0,36 +236,0.0,100 +237,0.0,56 +238,0.0,17 +239,0.0,100 +240,0.0,83 +241,0.0,100 +242,0.0,100 +243,0.0,43 +244,0.0,87 +245,0.0,42 +246,0.0,80 +247,0.0,54 +248,0.0,82 +249,0.0,97 +250,0.0,65 +251,0.0,83 +252,0.0,100 +253,0.0,59 +254,0.0,100 +255,0.0,78 +256,0.0,100 +257,0.0,100 +258,0.0,43 +259,0.0,80 +260,0.0,100 +261,0.0,70 +262,0.0,94 +263,0.0,100 +264,0.0,100 +265,0.0,37 +266,0.0,11 +267,0.0,31 +268,0.0,100 +269,0.0,34 +270,0.0,32 +271,0.0,58 +272,0.0,38 +273,0.0,28 +274,0.0,100 +275,0.0,59 +276,0.0,100 +277,0.0,82 +278,0.0,51 +279,0.0,25 +280,0.0,73 +281,0.0,56 +282,0.0,55 +283,0.0,38 +284,0.0,100 +285,0.0,100 +286,0.0,92 +287,0.0,100 +288,0.0,100 +289,0.0,100 +290,0.0,37 +291,0.0,100 +292,0.0,66 +293,0.0,24 +294,0.0,17 +295,0.0,100 +296,0.0,59 +297,0.0,25 +298,0.0,73 +299,0.0,100 +300,0.0,29 +301,0.0,100 +302,0.0,72 +303,0.0,6 +304,1.0,57 +305,0.0,47 +306,0.0,48 +307,0.0,13 +308,0.0,100 +309,0.0,38 +310,0.0,100 +311,0.0,20 +312,0.0,100 +313,0.0,100 +314,0.0,5 +315,0.0,39 +316,0.0,11 +317,0.0,83 +318,0.0,42 +319,0.0,100 +320,0.0,99 +321,0.0,83 +322,0.0,28 +323,0.0,46 +324,0.0,100 +325,0.0,100 +326,0.0,62 +327,0.0,100 +328,0.0,23 +329,0.0,91 +330,0.0,53 +331,0.0,19 +332,0.0,26 +333,0.0,93 +334,0.0,38 +335,0.0,22 +336,0.0,43 +337,0.0,100 +338,0.0,90 +339,0.0,18 +340,0.0,45 +341,0.0,65 +342,1.0,22 +343,0.0,100 +344,1.0,15 +345,1.0,72 +346,0.0,5 +347,1.0,6 +348,1.0,6 +349,1.0,9 +350,1.0,8 +351,1.0,9 +352,1.0,8 +353,1.0,6 +354,1.0,6 +355,1.0,10 +356,1.0,6 +357,0.0,5 +358,0.0,3 +359,1.0,6 +360,1.0,6 +361,1.0,6 +362,1.0,6 +363,1.0,8 +364,1.0,6 +365,1.0,8 +366,1.0,6 +367,1.0,6 +368,1.0,8 +369,1.0,6 +370,1.0,6 +371,0.0,5 +372,1.0,6 +373,0.0,6 +374,1.0,6 +375,1.0,12 +376,1.0,6 +377,1.0,6 +378,1.0,9 +379,1.0,6 +380,1.0,6 +381,0.0,2 +382,0.0,3 +383,0.0,2 +384,0.0,4 +385,0.0,3 +386,1.0,7 +387,1.0,6 +388,1.0,6 +389,1.0,8 +390,1.0,9 +391,1.0,8 +392,1.0,8 +393,1.0,6 +394,1.0,6 +395,1.0,7 +396,1.0,6 +397,0.0,5 +398,0.0,5 +399,1.0,10 +400,1.0,6 +401,0.0,3 +402,1.0,6 +403,1.0,7 +404,1.0,6 +405,1.0,6 +406,1.0,6 +407,1.0,6 +408,1.0,6 +409,1.0,6 +410,1.0,6 +411,0.0,5 +412,1.0,6 +413,1.0,6 +414,0.0,2 +415,1.0,6 +416,1.0,6 +417,1.0,6 +418,1.0,6 +419,1.0,6 +420,1.0,8 +421,1.0,6 +422,1.0,6 +423,1.0,6 +424,1.0,6 +425,1.0,7 +426,0.0,5 +427,1.0,6 +428,1.0,6 +429,1.0,6 +430,1.0,8 +431,1.0,6 +432,1.0,6 +433,1.0,6 +434,1.0,6 +435,0.0,2 +436,1.0,8 +437,1.0,7 +438,1.0,6 +439,1.0,7 +440,1.0,6 +441,1.0,6 +442,0.0,3 +443,0.0,4 +444,1.0,6 +445,1.0,6 +446,1.0,7 +447,1.0,6 +448,1.0,6 +449,1.0,6 +450,1.0,6 +451,1.0,6 +452,1.0,6 +453,1.0,8 +454,1.0,6 +455,1.0,6 +456,1.0,6 +457,1.0,6 +458,1.0,6 +459,1.0,7 +460,1.0,8 +461,1.0,6 +462,1.0,7 +463,1.0,6 +464,1.0,6 +465,1.0,6 +466,1.0,6 +467,1.0,8 +468,1.0,6 +469,1.0,6 +470,1.0,8 +471,1.0,6 +472,1.0,11 +473,1.0,6 +474,1.0,6 +475,1.0,6 +476,1.0,8 +477,0.0,2 +478,1.0,7 +479,1.0,6 +480,1.0,6 +481,1.0,7 +482,1.0,6 +483,1.0,6 +484,1.0,6 +485,1.0,6 +486,0.0,3 +487,1.0,7 +488,1.0,6 +489,1.0,6 +490,1.0,6 +491,0.0,3 +492,1.0,6 +493,1.0,7 +494,1.0,12 +495,1.0,6 +496,0.0,9 +497,1.0,6 +498,1.0,6 +499,0.0,8 +500,1.0,6 +501,0.0,3 +502,0.0,5 +503,0.0,3 +504,1.0,6 +505,1.0,6 +506,1.0,6 +507,1.0,6 +508,1.0,6 +509,1.0,6 +510,1.0,6 +511,1.0,6 +512,1.0,6 +513,1.0,6 +514,0.0,2 +515,1.0,7 +516,1.0,6 +517,1.0,6 +518,1.0,6 +519,1.0,6 +520,1.0,6 +521,1.0,7 +522,0.0,4 +523,1.0,6 +524,0.0,5 +525,1.0,6 +526,1.0,6 +527,1.0,6 +528,1.0,6 +529,0.0,3 +530,1.0,6 +531,1.0,6 +532,1.0,6 +533,1.0,7 +534,1.0,8 +535,1.0,6 +536,1.0,6 +537,1.0,6 +538,1.0,6 +539,1.0,7 +540,1.0,7 +541,1.0,7 +542,1.0,8 +543,1.0,6 +544,1.0,10 +545,1.0,6 +546,1.0,6 +547,1.0,6 +548,1.0,8 +549,1.0,6 +550,1.0,6 +551,1.0,8 +552,1.0,6 +553,1.0,7 +554,1.0,6 +555,1.0,7 +556,1.0,6 +557,1.0,6 +558,1.0,7 +559,1.0,7 +560,1.0,7 +561,1.0,6 +562,1.0,6 +563,1.0,6 +564,1.0,6 +565,1.0,6 +566,1.0,6 +567,1.0,6 +568,1.0,7 +569,0.0,4 +570,1.0,8 +571,1.0,8 +572,1.0,7 +573,1.0,6 +574,1.0,8 +575,1.0,6 +576,1.0,6 +577,1.0,7 +578,1.0,6 +579,1.0,6 +580,1.0,8 +581,1.0,7 +582,1.0,6 +583,1.0,6 +584,0.0,3 +585,1.0,11 +586,1.0,6 +587,1.0,8 +588,0.0,2 +589,1.0,6 +590,1.0,6 +591,1.0,6 +592,1.0,6 +593,1.0,8 +594,1.0,6 +595,1.0,7 +596,1.0,6 +597,1.0,7 +598,1.0,6 +599,1.0,8 +600,0.0,2 +601,1.0,6 +602,1.0,7 +603,1.0,6 +604,1.0,6 +605,1.0,10 +606,1.0,7 +607,1.0,6 +608,1.0,6 +609,1.0,6 +610,1.0,6 +611,1.0,6 +612,1.0,7 +613,0.0,4 +614,1.0,7 +615,1.0,6 +616,1.0,8 +617,0.0,3 +618,1.0,6 +619,1.0,6 +620,1.0,6 +621,1.0,6 +622,0.0,2 +623,1.0,6 +624,1.0,6 +625,1.0,6 +626,1.0,6 +627,1.0,6 +628,1.0,7 +629,1.0,6 +630,1.0,6 +631,1.0,7 +632,1.0,6 +633,1.0,6 +634,1.0,6 +635,1.0,6 +636,1.0,6 +637,1.0,6 +638,1.0,6 +639,1.0,8 +640,1.0,6 +641,1.0,8 +642,1.0,7 +643,1.0,6 +644,0.0,3 +645,1.0,6 +646,1.0,7 +647,1.0,6 +648,1.0,6 +649,1.0,6 +650,1.0,10 +651,1.0,6 +652,1.0,6 +653,1.0,6 +654,1.0,6 +655,1.0,10 +656,1.0,6 +657,1.0,8 +658,1.0,8 +659,1.0,7 +660,1.0,6 +661,0.0,5 +662,0.0,2 +663,1.0,8 +664,1.0,6 +665,1.0,10 +666,1.0,6 +667,1.0,8 +668,1.0,10 +669,1.0,6 +670,1.0,6 +671,1.0,6 +672,1.0,10 +673,1.0,6 +674,0.0,4 +675,1.0,6 +676,1.0,6 +677,1.0,6 +678,1.0,15 +679,1.0,6 +680,1.0,6 +681,1.0,6 +682,1.0,6 +683,1.0,6 +684,1.0,6 +685,1.0,8 +686,1.0,6 +687,1.0,7 +688,1.0,6 +689,1.0,6 +690,1.0,8 +691,1.0,6 +692,1.0,6 +693,1.0,8 +694,1.0,8 +695,1.0,6 +696,1.0,6 +697,1.0,6 +698,1.0,10 +699,1.0,6 +700,1.0,6 +701,1.0,6 +702,1.0,6 +703,1.0,6 +704,1.0,6 +705,1.0,6 +706,1.0,8 +707,1.0,8 +708,1.0,6 +709,1.0,6 +710,0.0,2 +711,1.0,6 +712,1.0,6 +713,1.0,6 +714,1.0,8 +715,1.0,6 +716,1.0,6 +717,1.0,6 +718,1.0,6 +719,1.0,6 +720,1.0,6 +721,1.0,6 +722,1.0,6 +723,1.0,6 +724,1.0,7 +725,0.0,3 +726,1.0,7 +727,1.0,6 +728,1.0,6 +729,1.0,6 +730,0.0,2 +731,1.0,6 +732,1.0,8 +733,1.0,6 +734,1.0,6 +735,1.0,6 +736,1.0,6 +737,1.0,9 +738,1.0,6 +739,1.0,6 +740,1.0,6 +741,1.0,6 +742,1.0,6 +743,1.0,6 +744,1.0,9 +745,1.0,7 +746,0.0,4 +747,1.0,6 +748,1.0,8 +749,1.0,11 +750,1.0,6 +751,1.0,6 +752,1.0,6 +753,1.0,6 +754,1.0,6 +755,1.0,8 +756,1.0,6 +757,1.0,6 +758,1.0,8 +759,1.0,7 +760,1.0,6 +761,1.0,8 +762,1.0,6 +763,0.0,5 +764,1.0,9 +765,1.0,8 +766,1.0,8 +767,1.0,6 +768,1.0,8 +769,1.0,8 +770,1.0,6 +771,0.0,5 +772,0.0,3 +773,0.0,2 +774,1.0,8 +775,1.0,6 +776,1.0,6 +777,1.0,6 +778,1.0,6 +779,1.0,6 +780,1.0,6 +781,1.0,6 +782,1.0,6 +783,1.0,6 +784,1.0,6 +785,1.0,6 +786,1.0,6 +787,1.0,6 +788,1.0,6 +789,0.0,2 +790,1.0,6 +791,0.0,4 +792,1.0,6 +793,1.0,6 +794,1.0,6 +795,1.0,6 +796,1.0,6 +797,1.0,8 +798,0.0,5 +799,1.0,6 diff --git a/projects/codes/QLearning/qlearning.py b/projects/codes/QLearning/qlearning.py index e97a594..c987242 100644 --- a/projects/codes/QLearning/qlearning.py +++ b/projects/codes/QLearning/qlearning.py @@ -5,7 +5,7 @@ Author: John Email: johnjim0816@gmail.com Date: 2020-09-11 23:03:00 LastEditor: John -LastEditTime: 2021-12-22 10:54:57 +LastEditTime: 2022-08-24 10:31:04 Discription: use defaultdict to define Q table Environment: ''' @@ -15,50 +15,52 @@ import torch from collections import defaultdict class QLearning(object): - def __init__(self, - n_actions,cfg): - self.n_actions = n_actions - self.lr = cfg.lr # 学习率 - self.gamma = cfg.gamma - self.epsilon = cfg.epsilon_start + def __init__(self,cfg): + self.n_actions = cfg['n_actions'] + self.lr = cfg['lr'] + self.gamma = cfg['gamma'] + self.epsilon = cfg['epsilon_start'] self.sample_count = 0 - self.epsilon_start = cfg.epsilon_start - self.epsilon_end = cfg.epsilon_end - self.epsilon_decay = cfg.epsilon_decay - self.Q_table = defaultdict(lambda: np.zeros(n_actions)) # 用嵌套字典存放状态->动作->状态-动作值(Q值)的映射,即Q表 - def sample(self, state): - ''' 采样动作,训练时用 + self.epsilon_start = cfg['epsilon_start'] + self.epsilon_end = cfg['epsilon_end'] + self.epsilon_decay = cfg['epsilon_decay'] + self.Q_table = defaultdict(lambda: np.zeros(self.n_actions)) # use nested dictionary to represent Q(s,a), here set all Q(s,a)=0 initially, not like pseudo code + def sample_action(self, state): + ''' sample action with e-greedy policy while training ''' self.sample_count += 1 + # epsilon must decay(linear,exponential and etc.) for balancing exploration and exploitation self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \ - math.exp(-1. * self.sample_count / self.epsilon_decay) # epsilon是会递减的,这里选择指数递减 - # e-greedy 策略 + math.exp(-1. * self.sample_count / self.epsilon_decay) if np.random.uniform(0, 1) > self.epsilon: - action = np.argmax(self.Q_table[str(state)]) # 选择Q(s,a)最大对应的动作 + action = np.argmax(self.Q_table[str(state)]) # choose action corresponding to the maximum q value else: - action = np.random.choice(self.n_actions) # 随机选择动作 + action = np.random.choice(self.n_actions) # choose action randomly return action - def predict(self,state): - ''' 预测或选择动作,测试时用 + def predict_action(self,state): + ''' predict action while testing ''' action = np.argmax(self.Q_table[str(state)]) return action def update(self, state, action, reward, next_state, done): Q_predict = self.Q_table[str(state)][action] - if done: # 终止状态 + if done: # terminal state Q_target = reward else: Q_target = reward + self.gamma * np.max(self.Q_table[str(next_state)]) self.Q_table[str(state)][action] += self.lr * (Q_target - Q_predict) - def save(self,path): + def save_model(self,path): import dill + from pathlib import Path + # create path + Path(path).mkdir(parents=True, exist_ok=True) torch.save( obj=self.Q_table, f=path+"Qleaning_model.pkl", pickle_module=dill ) - print("保存模型成功!") - def load(self, path): + print("Model saved!") + def load_model(self, path): import dill self.Q_table =torch.load(f=path+'Qleaning_model.pkl',pickle_module=dill) - print("加载模型成功!") \ No newline at end of file + print("Mode loaded!") \ No newline at end of file diff --git a/projects/codes/QLearning/task0.py b/projects/codes/QLearning/task0.py deleted file mode 100644 index 15100d5..0000000 --- a/projects/codes/QLearning/task0.py +++ /dev/null @@ -1,127 +0,0 @@ -#!/usr/bin/env python -# coding=utf-8 -''' -Author: John -Email: johnjim0816@gmail.com -Date: 2020-09-11 23:03:00 -LastEditor: John -LastEditTime: 2022-08-10 11:25:56 -Discription: -Environment: -''' -import sys -import os -curr_path = os.path.dirname(os.path.abspath(__file__)) # 当前文件所在绝对路径 -parent_path = os.path.dirname(curr_path) # 父路径 -sys.path.append(parent_path) # 添加路径到系统路径 - -import gym -import torch -import datetime -import argparse -from envs.gridworld_env import CliffWalkingWapper -from qlearning import QLearning -from common.utils import plot_rewards,save_args -from common.utils import save_results,make_dir - -def get_args(): - """ - """ - curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # 获取当前时间 - parser = argparse.ArgumentParser(description="hyperparameters") - parser.add_argument('--algo_name',default='Q-learning',type=str,help="name of algorithm") - parser.add_argument('--env_name',default='CliffWalking-v0',type=str,help="name of environment") - parser.add_argument('--train_eps',default=400,type=int,help="episodes of training") # 训练的回合数 - parser.add_argument('--test_eps',default=20,type=int,help="episodes of testing") # 测试的回合数 - parser.add_argument('--gamma',default=0.90,type=float,help="discounted factor") # 折扣因子 - parser.add_argument('--epsilon_start',default=0.95,type=float,help="initial value of epsilon") # e-greedy策略中初始epsilon - parser.add_argument('--epsilon_end',default=0.01,type=float,help="final value of epsilon") # e-greedy策略中的终止epsilon - parser.add_argument('--epsilon_decay',default=300,type=int,help="decay rate of epsilon") # e-greedy策略中epsilon的衰减率 - parser.add_argument('--lr',default=0.1,type=float,help="learning rate") - parser.add_argument('--device',default='cpu',type=str,help="cpu or cuda") - parser.add_argument('--result_path',default=curr_path + "/outputs/" + parser.parse_args().env_name + \ - '/' + curr_time + '/results/',type=str ) - parser.add_argument('--model_path',default=curr_path + "/outputs/" + parser.parse_args().env_name + \ - '/' + curr_time + '/models/',type=str,help="path to save models") - parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not") - args = parser.parse_args() - return args -curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # 获取当前时间 - -def train(cfg,env,agent): - print('开始训练!') - print(f'环境:{cfg.env_name}, 算法:{cfg.algo_name}, 设备:{cfg.device}') - rewards = [] # 记录奖励 - for i_ep in range(cfg.train_eps): - ep_reward = 0 # 记录每个回合的奖励 - state = env.reset() # 重置环境,即开始新的回合 - while True: - action = agent.sample(state) # 根据算法采样一个动作 - next_state, reward, done, _ = env.step(action) # 与环境进行一次动作交互 - agent.update(state, action, reward, next_state, done) # Q学习算法更新 - state = next_state # 更新状态 - ep_reward += reward - if done: - break - rewards.append(ep_reward) - print(f"回合:{i_ep+1}/{cfg.train_eps},奖励:{ep_reward:.1f},Epsilon:{agent.epsilon}") - print('完成训练!') - return {"rewards":rewards} - -def test(cfg,env,agent): - print('开始测试!') - print(f'环境:{cfg.env_name}, 算法:{cfg.algo_name}, 设备:{cfg.device}') - rewards = [] # 记录所有回合的奖励 - for i_ep in range(cfg.test_eps): - ep_reward = 0 # 记录每个episode的reward - state = env.reset() # 重置环境, 重新开一局(即开始新的一个回合) - while True: - action = agent.predict(state) # 根据算法选择一个动作 - next_state, reward, done, _ = env.step(action) # 与环境进行一个交互 - state = next_state # 更新状态 - ep_reward += reward - if done: - break - rewards.append(ep_reward) - print(f"回合数:{i_ep+1}/{cfg.test_eps}, 奖励:{ep_reward:.1f}") - print('完成测试!') - return {"rewards":rewards} - -def env_agent_config(cfg,seed=1): - '''创建环境和智能体 - Args: - cfg ([type]): [description] - seed (int, optional): 随机种子. Defaults to 1. - Returns: - env [type]: 环境 - agent : 智能体 - ''' - env = gym.make(cfg.env_name) - env = CliffWalkingWapper(env) - env.seed(seed) # 设置随机种子 - n_states = env.observation_space.n # 状态维度 - n_actions = env.action_space.n # 动作维度 - print(f"状态数:{n_states},动作数:{n_actions}") - agent = QLearning(n_actions,cfg) - return env,agent -if __name__ == "__main__": - cfg = get_args() - # 训练 - env, agent = env_agent_config(cfg) - res_dic = train(cfg, env, agent) - make_dir(cfg.result_path, cfg.model_path) - save_args(cfg) # save parameters - agent.save(path=cfg.model_path) # save model - save_results(res_dic, tag='train', - path=cfg.result_path) - plot_rewards(res_dic['rewards'], cfg, tag="train") - # 测试 - env, agent = env_agent_config(cfg) - agent.load(path=cfg.model_path) # 导入模型 - res_dic = test(cfg, env, agent) - save_results(res_dic, tag='test', - path=cfg.result_path) # 保存结果 - plot_rewards(res_dic['rewards'], cfg, tag="test") # 画出结果 - - - diff --git a/projects/codes/SAC-S/sac.py b/projects/codes/SAC-S/sac.py new file mode 100644 index 0000000..6351c3d --- /dev/null +++ b/projects/codes/SAC-S/sac.py @@ -0,0 +1,27 @@ +import torch +import torch.optim as optim +import torch.nn as nn +import numpy as np +class SAC: + def __init__(self,n_actions,models,memory,cfg): + self.device = cfg.device + self.value_net = models['ValueNet'].to(self.device) # $\psi$ + self.target_value_net = models['ValueNet'].to(self.device) # $\bar{\psi}$ + self.soft_q_net = models['SoftQNet'].to(self.device) # $\theta$ + self.policy_net = models['PolicyNet'].to(self.device) # $\phi$ + self.value_optimizer = optim.Adam(self.value_net.parameters(), lr=cfg.value_lr) + self.soft_q_optimizer = optim.Adam(self.soft_q_net.parameters(), lr=cfg.soft_q_lr) + self.policy_optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg.policy_lr) + for target_param, param in zip(self.target_value_net.parameters(), self.value_net.parameters()): + target_param.data.copy_(param.data) + self.value_criterion = nn.MSELoss() + self.soft_q_criterion = nn.MSELoss() + def update(self): + # sample a batch of transitions from replay buffer + state_batch, action_batch, reward_batch, next_state_batch, done_batch = self.memory.sample( + self.batch_size) + state_batch = torch.tensor(np.array(state_batch), device=self.device, dtype=torch.float) # shape(batchsize,n_states) + action_batch = torch.tensor(action_batch, device=self.device).unsqueeze(1) # shape(batchsize,1) + reward_batch = torch.tensor(reward_batch, device=self.device, dtype=torch.float).unsqueeze(1) # shape(batchsize) + next_state_batch = torch.tensor(np.array(next_state_batch), device=self.device, dtype=torch.float) # shape(batchsize,n_states) + done_batch = torch.tensor(np.float32(done_batch), device=self.device).unsqueeze(1) # shape(batchsize,1) diff --git a/projects/codes/SAC/sac.py b/projects/codes/SAC/sac.py deleted file mode 100644 index e907520..0000000 --- a/projects/codes/SAC/sac.py +++ /dev/null @@ -1,4 +0,0 @@ - -class SAC: - def __init__(self,n_actions,model,memory,cfg): - pass \ No newline at end of file diff --git a/projects/codes/SAC/sacd_cnn.py b/projects/codes/SAC/sacd_cnn.py new file mode 100644 index 0000000..e69de29 diff --git a/projects/codes/common/utils.py b/projects/codes/common/utils.py index e63d1e6..e4f75ed 100644 --- a/projects/codes/common/utils.py +++ b/projects/codes/common/utils.py @@ -5,7 +5,7 @@ Author: John Email: johnjim0816@gmail.com Date: 2021-03-12 16:02:24 LastEditor: John -LastEditTime: 2022-08-22 17:41:28 +LastEditTime: 2022-08-24 10:31:30 Discription: Environment: ''' @@ -64,14 +64,14 @@ def smooth(data, weight=0.9): def plot_rewards(rewards,cfg,path=None,tag='train'): sns.set() plt.figure() # 创建一个图形实例,方便同时多画几个图 - plt.title(f"{tag}ing curve on {cfg.device} of {cfg.algo_name} for {cfg.env_name}") + plt.title(f"{tag}ing curve on {cfg['device']} of {cfg['algo_name']} for {cfg['env_name']}") plt.xlabel('epsiodes') plt.plot(rewards, label='rewards') plt.plot(smooth(rewards), label='smoothed') plt.legend() - if cfg.save_fig: + if cfg['save_fig']: plt.savefig(f"{path}/{tag}ing_curve.png") - if cfg.show_fig: + if cfg['show_fig']: plt.show() def plot_losses(losses, algo="DQN", save=True, path='./'): @@ -110,12 +110,21 @@ def del_empty_dir(*paths): if not os.listdir(os.path.join(path, dir)): os.removedirs(os.path.join(path, dir)) +class NpEncoder(json.JSONEncoder): + def default(self, obj): + if isinstance(obj, np.integer): + return int(obj) + if isinstance(obj, np.floating): + return float(obj) + if isinstance(obj, np.ndarray): + return obj.tolist() + return json.JSONEncoder.default(self, obj) + def save_args(args,path=None): - # 保存参数 - args_dict = vars(args) + # save parameters Path(path).mkdir(parents=True, exist_ok=True) with open(f"{path}/params.json", 'w') as fp: - json.dump(args_dict, fp) + json.dump(args, fp,cls=NpEncoder) print("Parameters saved!") def all_seed(env,seed = 1): diff --git a/projects/codes/envs/README.md b/projects/codes/envs/README.md index e93fba0..d30725b 100644 --- a/projects/codes/envs/README.md +++ b/projects/codes/envs/README.md @@ -1,6 +1,18 @@ -## 环境汇总 +# 环境说明汇总 + +## 算法SAR一览 + +说明:SAR分别指状态(S)、动作(A)以及奖励(R),下表的Reward Range表示每回合能获得的奖励范围,Steps表示环境中每回合的最大步数 + +| Environment ID | Observation Space | Action Space | Reward Range | Steps | +| :--------------------------------: | :---------------: | :----------: | :----------: | :------: | +| CartPole-v0 | Box(4,) | Discrete(2) | [0,200] | 200 | +| CartPole-v1 | Box(4,) | Discrete(2) | [0,500] | 500 | +| CliffWalking-v0 | Discrete(48) | Discrete(4) | [-inf,-13] | [13,inf] | +| FrozenLake-v1(*is_slippery*=False) | Discrete(16) | Discrete(4) | 0 or 1 | [6,info] | + +## 环境描述 [OpenAI Gym](./gym_info.md) [MuJoCo](./mujoco_info.md) - diff --git a/projects/codes/scripts/DQN_task0.sh b/projects/codes/scripts/DQN_task0.sh new file mode 100644 index 0000000..ead51b9 --- /dev/null +++ b/projects/codes/scripts/DQN_task0.sh @@ -0,0 +1,15 @@ +# run DQN on CartPole-v0 +# source conda, if you are already in proper conda environment, then comment the codes util "conda activate easyrl" + +if [ -f "$HOME/anaconda3/etc/profile.d/conda.sh" ]; then + echo "source file at ~/anaconda3/etc/profile.d/conda.sh" + source ~/anaconda3/etc/profile.d/conda.sh +elif [ -f "$HOME/opt/anaconda3/etc/profile.d/conda.sh" ]; then + echo "source file at ~/opt/anaconda3/etc/profile.d/conda.sh" + source ~/opt/anaconda3/etc/profile.d/conda.sh +else + echo 'please manually config the conda source path' +fi +conda activate easyrl # easyrl here can be changed to another name of conda env that you have created +codes_dir=$(dirname $(dirname $(readlink -f "$0"))) # "codes" path +python $codes_dir/DQN/main.py \ No newline at end of file diff --git a/projects/codes/scripts/DQN_task1.sh b/projects/codes/scripts/DQN_task1.sh new file mode 100644 index 0000000..d9ca67d --- /dev/null +++ b/projects/codes/scripts/DQN_task1.sh @@ -0,0 +1,16 @@ +''' +run DQN on CartPole-v1, not finished yet +''' +# source conda, if you are already in proper conda environment, then comment the codes util "conda activate easyrl" +if [ -f "$HOME/anaconda3/etc/profile.d/conda.sh" ]; then + echo "source file at ~/anaconda3/etc/profile.d/conda.sh" + source ~/anaconda3/etc/profile.d/conda.sh +elif [ -f "$HOME/opt/anaconda3/etc/profile.d/conda.sh" ]; then + echo "source file at ~/opt/anaconda3/etc/profile.d/conda.sh" + source ~/opt/anaconda3/etc/profile.d/conda.sh +else + echo 'please manually config the conda source path' +fi +conda activate easyrl # easyrl here can be changed to another name of conda env that you have created +codes_dir=$(dirname $(dirname $(readlink -f "$0"))) # "codes" path +python $codes_dir/DQN/main.py --env_name CartPole-v1 --train_eps 500 --epsilon_decay 1000 --memory_capacity 200000 --batch_size 128 --device cuda \ No newline at end of file diff --git a/projects/codes/scripts/DQN_task2.sh b/projects/codes/scripts/DQN_task2.sh new file mode 100644 index 0000000..83cd297 --- /dev/null +++ b/projects/codes/scripts/DQN_task2.sh @@ -0,0 +1,15 @@ +# run DQN on Acrobot-v1, not the best tuned parameters + +# source conda, if you are already in proper conda environment, then comment the codes util "conda activate easyrl" +if [ -f "$HOME/anaconda3/etc/profile.d/conda.sh" ]; then + echo "source file at ~/anaconda3/etc/profile.d/conda.sh" + source ~/anaconda3/etc/profile.d/conda.sh +elif [ -f "$HOME/opt/anaconda3/etc/profile.d/conda.sh" ]; then + echo "source file at ~/opt/anaconda3/etc/profile.d/conda.sh" + source ~/opt/anaconda3/etc/profile.d/conda.sh +else + echo 'please manually config the conda source path' +fi +conda activate easyrl # easyrl here can be changed to another name of conda env that you have created +codes_dir=$(dirname $(dirname $(readlink -f "$0"))) # "codes" path +python $codes_dir/DQN/main.py --env_name Acrobot-v1 --train_eps 100 --epsilon_decay 1500 --lr 0.002 --memory_capacity 200000 --batch_size 128 --device cuda \ No newline at end of file diff --git a/projects/codes/scripts/Qlearning_task0.sh b/projects/codes/scripts/Qlearning_task0.sh new file mode 100644 index 0000000..7ed9089 --- /dev/null +++ b/projects/codes/scripts/Qlearning_task0.sh @@ -0,0 +1,14 @@ + +# source conda, if you are already in proper conda environment, then comment the codes util "conda activate easyrl" +if [ -f "$HOME/anaconda3/etc/profile.d/conda.sh" ]; then + echo "source file at ~/anaconda3/etc/profile.d/conda.sh" + source ~/anaconda3/etc/profile.d/conda.sh +elif [ -f "$HOME/opt/anaconda3/etc/profile.d/conda.sh" ]; then + echo "source file at ~/opt/anaconda3/etc/profile.d/conda.sh" + source ~/opt/anaconda3/etc/profile.d/conda.sh +else + echo 'please manually config the conda source path' +fi +conda activate easyrl # easyrl here can be changed to another name of conda env that you have created +codes_dir=$(dirname $(dirname $(readlink -f "$0"))) # "codes" path +python $codes_dir/QLearning/main.py --device cpu \ No newline at end of file diff --git a/projects/codes/scripts/Qlearning_task1.sh b/projects/codes/scripts/Qlearning_task1.sh new file mode 100644 index 0000000..57cc23d --- /dev/null +++ b/projects/codes/scripts/Qlearning_task1.sh @@ -0,0 +1,14 @@ + +# source conda, if you are already in proper conda environment, then comment the codes util "conda activate easyrl" +if [ -f "$HOME/anaconda3/etc/profile.d/conda.sh" ]; then + echo "source file at ~/anaconda3/etc/profile.d/conda.sh" + source ~/anaconda3/etc/profile.d/conda.sh +elif [ -f "$HOME/opt/anaconda3/etc/profile.d/conda.sh" ]; then + echo "source file at ~/opt/anaconda3/etc/profile.d/conda.sh" + source ~/opt/anaconda3/etc/profile.d/conda.sh +else + echo 'please manually config the conda source path' +fi +conda activate easyrl # easyrl here can be changed to another name of conda env that you have created +codes_dir=$(dirname $(dirname $(readlink -f "$0"))) # "codes" path +python $codes_dir/QLearning/main.py --env_name FrozenLake-v1 --train_eps 800 --epsilon_start 0.70 --epsilon_end 0.1 --epsilon_decay 2000 --gamma 0.9 --lr 0.9 --device cpu \ No newline at end of file diff --git a/projects/notebooks/figs/dqn_pseu.png b/projects/notebooks/figs/dqn_pseu.png index 7ad89f5..591405b 100644 Binary files a/projects/notebooks/figs/dqn_pseu.png and b/projects/notebooks/figs/dqn_pseu.png differ diff --git a/projects/requirements.txt b/projects/requirements.txt index 8db643c..13ee9d0 100644 --- a/projects/requirements.txt +++ b/projects/requirements.txt @@ -1,11 +1,8 @@ gym==0.21.0 -torch==1.10.0 -torchvision==0.11.0 -torchaudio==0.10.0 ipykernel==6.15.1 jupyter==1.0.0 matplotlib==3.5.2 seaborn==0.11.2 dill==0.3.5.1 argparse==1.4.0 -pandas==1.3.5 \ No newline at end of file +pandas==1.3.5