Merge branch 'master' of github.com:datawhalechina/easy-rl

This commit is contained in:
qiwang067
2021-05-10 13:55:00 +08:00
211 changed files with 2683 additions and 2993 deletions

3
.gitignore vendored
View File

@@ -1,3 +1,4 @@
.DS_STORE
__pycache__
.vscode
.vscode
test.py

437
LICENSE
View File

@@ -1,437 +0,0 @@
Attribution-NonCommercial-ShareAlike 4.0 International
=======================================================================
Creative Commons Corporation ("Creative Commons") is not a law firm and
does not provide legal services or legal advice. Distribution of
Creative Commons public licenses does not create a lawyer-client or
other relationship. Creative Commons makes its licenses and related
information available on an "as-is" basis. Creative Commons gives no
warranties regarding its licenses, any material licensed under their
terms and conditions, or any related information. Creative Commons
disclaims all liability for damages resulting from their use to the
fullest extent possible.
Using Creative Commons Public Licenses
Creative Commons public licenses provide a standard set of terms and
conditions that creators and other rights holders may use to share
original works of authorship and other material subject to copyright
and certain other rights specified in the public license below. The
following considerations are for informational purposes only, are not
exhaustive, and do not form part of our licenses.
Considerations for licensors: Our public licenses are
intended for use by those authorized to give the public
permission to use material in ways otherwise restricted by
copyright and certain other rights. Our licenses are
irrevocable. Licensors should read and understand the terms
and conditions of the license they choose before applying it.
Licensors should also secure all rights necessary before
applying our licenses so that the public can reuse the
material as expected. Licensors should clearly mark any
material not subject to the license. This includes other CC-
licensed material, or material used under an exception or
limitation to copyright. More considerations for licensors:
wiki.creativecommons.org/Considerations_for_licensors
Considerations for the public: By using one of our public
licenses, a licensor grants the public permission to use the
licensed material under specified terms and conditions. If
the licensor's permission is not necessary for any reason--for
example, because of any applicable exception or limitation to
copyright--then that use is not regulated by the license. Our
licenses grant only permissions under copyright and certain
other rights that a licensor has authority to grant. Use of
the licensed material may still be restricted for other
reasons, including because others have copyright or other
rights in the material. A licensor may make special requests,
such as asking that all changes be marked or described.
Although not required by our licenses, you are encouraged to
respect those requests where reasonable. More considerations
for the public:
wiki.creativecommons.org/Considerations_for_licensees
=======================================================================
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
Public License
By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International Public License
("Public License"). To the extent this Public License may be
interpreted as a contract, You are granted the Licensed Rights in
consideration of Your acceptance of these terms and conditions, and the
Licensor grants You such rights in consideration of benefits the
Licensor receives from making the Licensed Material available under
these terms and conditions.
Section 1 -- Definitions.
a. Adapted Material means material subject to Copyright and Similar
Rights that is derived from or based upon the Licensed Material
and in which the Licensed Material is translated, altered,
arranged, transformed, or otherwise modified in a manner requiring
permission under the Copyright and Similar Rights held by the
Licensor. For purposes of this Public License, where the Licensed
Material is a musical work, performance, or sound recording,
Adapted Material is always produced where the Licensed Material is
synched in timed relation with a moving image.
b. Adapter's License means the license You apply to Your Copyright
and Similar Rights in Your contributions to Adapted Material in
accordance with the terms and conditions of this Public License.
c. BY-NC-SA Compatible License means a license listed at
creativecommons.org/compatiblelicenses, approved by Creative
Commons as essentially the equivalent of this Public License.
d. Copyright and Similar Rights means copyright and/or similar rights
closely related to copyright including, without limitation,
performance, broadcast, sound recording, and Sui Generis Database
Rights, without regard to how the rights are labeled or
categorized. For purposes of this Public License, the rights
specified in Section 2(b)(1)-(2) are not Copyright and Similar
Rights.
e. Effective Technological Measures means those measures that, in the
absence of proper authority, may not be circumvented under laws
fulfilling obligations under Article 11 of the WIPO Copyright
Treaty adopted on December 20, 1996, and/or similar international
agreements.
f. Exceptions and Limitations means fair use, fair dealing, and/or
any other exception or limitation to Copyright and Similar Rights
that applies to Your use of the Licensed Material.
g. License Elements means the license attributes listed in the name
of a Creative Commons Public License. The License Elements of this
Public License are Attribution, NonCommercial, and ShareAlike.
h. Licensed Material means the artistic or literary work, database,
or other material to which the Licensor applied this Public
License.
i. Licensed Rights means the rights granted to You subject to the
terms and conditions of this Public License, which are limited to
all Copyright and Similar Rights that apply to Your use of the
Licensed Material and that the Licensor has authority to license.
j. Licensor means the individual(s) or entity(ies) granting rights
under this Public License.
k. NonCommercial means not primarily intended for or directed towards
commercial advantage or monetary compensation. For purposes of
this Public License, the exchange of the Licensed Material for
other material subject to Copyright and Similar Rights by digital
file-sharing or similar means is NonCommercial provided there is
no payment of monetary compensation in connection with the
exchange.
l. Share means to provide material to the public by any means or
process that requires permission under the Licensed Rights, such
as reproduction, public display, public performance, distribution,
dissemination, communication, or importation, and to make material
available to the public including in ways that members of the
public may access the material from a place and at a time
individually chosen by them.
m. Sui Generis Database Rights means rights other than copyright
resulting from Directive 96/9/EC of the European Parliament and of
the Council of 11 March 1996 on the legal protection of databases,
as amended and/or succeeded, as well as other essentially
equivalent rights anywhere in the world.
n. You means the individual or entity exercising the Licensed Rights
under this Public License. Your has a corresponding meaning.
Section 2 -- Scope.
a. License grant.
1. Subject to the terms and conditions of this Public License,
the Licensor hereby grants You a worldwide, royalty-free,
non-sublicensable, non-exclusive, irrevocable license to
exercise the Licensed Rights in the Licensed Material to:
a. reproduce and Share the Licensed Material, in whole or
in part, for NonCommercial purposes only; and
b. produce, reproduce, and Share Adapted Material for
NonCommercial purposes only.
2. Exceptions and Limitations. For the avoidance of doubt, where
Exceptions and Limitations apply to Your use, this Public
License does not apply, and You do not need to comply with
its terms and conditions.
3. Term. The term of this Public License is specified in Section
6(a).
4. Media and formats; technical modifications allowed. The
Licensor authorizes You to exercise the Licensed Rights in
all media and formats whether now known or hereafter created,
and to make technical modifications necessary to do so. The
Licensor waives and/or agrees not to assert any right or
authority to forbid You from making technical modifications
necessary to exercise the Licensed Rights, including
technical modifications necessary to circumvent Effective
Technological Measures. For purposes of this Public License,
simply making modifications authorized by this Section 2(a)
(4) never produces Adapted Material.
5. Downstream recipients.
a. Offer from the Licensor -- Licensed Material. Every
recipient of the Licensed Material automatically
receives an offer from the Licensor to exercise the
Licensed Rights under the terms and conditions of this
Public License.
b. Additional offer from the Licensor -- Adapted Material.
Every recipient of Adapted Material from You
automatically receives an offer from the Licensor to
exercise the Licensed Rights in the Adapted Material
under the conditions of the Adapter's License You apply.
c. No downstream restrictions. You may not offer or impose
any additional or different terms or conditions on, or
apply any Effective Technological Measures to, the
Licensed Material if doing so restricts exercise of the
Licensed Rights by any recipient of the Licensed
Material.
6. No endorsement. Nothing in this Public License constitutes or
may be construed as permission to assert or imply that You
are, or that Your use of the Licensed Material is, connected
with, or sponsored, endorsed, or granted official status by,
the Licensor or others designated to receive attribution as
provided in Section 3(a)(1)(A)(i).
b. Other rights.
1. Moral rights, such as the right of integrity, are not
licensed under this Public License, nor are publicity,
privacy, and/or other similar personality rights; however, to
the extent possible, the Licensor waives and/or agrees not to
assert any such rights held by the Licensor to the limited
extent necessary to allow You to exercise the Licensed
Rights, but not otherwise.
2. Patent and trademark rights are not licensed under this
Public License.
3. To the extent possible, the Licensor waives any right to
collect royalties from You for the exercise of the Licensed
Rights, whether directly or through a collecting society
under any voluntary or waivable statutory or compulsory
licensing scheme. In all other cases the Licensor expressly
reserves any right to collect such royalties, including when
the Licensed Material is used other than for NonCommercial
purposes.
Section 3 -- License Conditions.
Your exercise of the Licensed Rights is expressly made subject to the
following conditions.
a. Attribution.
1. If You Share the Licensed Material (including in modified
form), You must:
a. retain the following if it is supplied by the Licensor
with the Licensed Material:
i. identification of the creator(s) of the Licensed
Material and any others designated to receive
attribution, in any reasonable manner requested by
the Licensor (including by pseudonym if
designated);
ii. a copyright notice;
iii. a notice that refers to this Public License;
iv. a notice that refers to the disclaimer of
warranties;
v. a URI or hyperlink to the Licensed Material to the
extent reasonably practicable;
b. indicate if You modified the Licensed Material and
retain an indication of any previous modifications; and
c. indicate the Licensed Material is licensed under this
Public License, and include the text of, or the URI or
hyperlink to, this Public License.
2. You may satisfy the conditions in Section 3(a)(1) in any
reasonable manner based on the medium, means, and context in
which You Share the Licensed Material. For example, it may be
reasonable to satisfy the conditions by providing a URI or
hyperlink to a resource that includes the required
information.
3. If requested by the Licensor, You must remove any of the
information required by Section 3(a)(1)(A) to the extent
reasonably practicable.
b. ShareAlike.
In addition to the conditions in Section 3(a), if You Share
Adapted Material You produce, the following conditions also apply.
1. The Adapter's License You apply must be a Creative Commons
license with the same License Elements, this version or
later, or a BY-NC-SA Compatible License.
2. You must include the text of, or the URI or hyperlink to, the
Adapter's License You apply. You may satisfy this condition
in any reasonable manner based on the medium, means, and
context in which You Share Adapted Material.
3. You may not offer or impose any additional or different terms
or conditions on, or apply any Effective Technological
Measures to, Adapted Material that restrict exercise of the
rights granted under the Adapter's License You apply.
Section 4 -- Sui Generis Database Rights.
Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:
a. for the avoidance of doubt, Section 2(a)(1) grants You the right
to extract, reuse, reproduce, and Share all or a substantial
portion of the contents of the database for NonCommercial purposes
only;
b. if You include all or a substantial portion of the database
contents in a database in which You have Sui Generis Database
Rights, then the database in which You have Sui Generis Database
Rights (but not its individual contents) is Adapted Material,
including for purposes of Section 3(b); and
c. You must comply with the conditions in Section 3(a) if You Share
all or a substantial portion of the contents of the database.
For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.
Section 5 -- Disclaimer of Warranties and Limitation of Liability.
a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
c. The disclaimer of warranties and limitation of liability provided
above shall be interpreted in a manner that, to the extent
possible, most closely approximates an absolute disclaimer and
waiver of all liability.
Section 6 -- Term and Termination.
a. This Public License applies for the term of the Copyright and
Similar Rights licensed here. However, if You fail to comply with
this Public License, then Your rights under this Public License
terminate automatically.
b. Where Your right to use the Licensed Material has terminated under
Section 6(a), it reinstates:
1. automatically as of the date the violation is cured, provided
it is cured within 30 days of Your discovery of the
violation; or
2. upon express reinstatement by the Licensor.
For the avoidance of doubt, this Section 6(b) does not affect any
right the Licensor may have to seek remedies for Your violations
of this Public License.
c. For the avoidance of doubt, the Licensor may also offer the
Licensed Material under separate terms or conditions or stop
distributing the Licensed Material at any time; however, doing so
will not terminate this Public License.
d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
License.
Section 7 -- Other Terms and Conditions.
a. The Licensor shall not be bound by any additional or different
terms or conditions communicated by You unless expressly agreed.
b. Any arrangements, understandings, or agreements regarding the
Licensed Material not stated herein are separate from and
independent of the terms and conditions of this Public License.
Section 8 -- Interpretation.
a. For the avoidance of doubt, this Public License does not, and
shall not be interpreted to, reduce, limit, restrict, or impose
conditions on any use of the Licensed Material that could lawfully
be made without permission under this Public License.
b. To the extent possible, if any provision of this Public License is
deemed unenforceable, it shall be automatically reformed to the
minimum extent necessary to make it enforceable. If the provision
cannot be reformed, it shall be severed from this Public License
without affecting the enforceability of the remaining terms and
conditions.
c. No term or condition of this Public License will be waived and no
failure to comply consented to unless expressly agreed to by the
Licensor.
d. Nothing in this Public License constitutes or may be interpreted
as a limitation upon, or waiver of, any privileges and immunities
that apply to the Licensor or You, including from the legal
processes of any jurisdiction or authority.
=======================================================================
Creative Commons is not a party to its public
licenses. Notwithstanding, Creative Commons may elect to apply one of
its public licenses to material it publishes and in those instances
will be considered the “Licensor.” The text of the Creative Commons
public licenses is dedicated to the public domain under the CC0 Public
Domain Dedication. Except for the limited purpose of indicating that
material is shared under a Creative Commons public license or as
otherwise permitted by the Creative Commons policies published at
creativecommons.org/policies, Creative Commons does not authorize the
use of the trademark "Creative Commons" or any other trademark or logo
of Creative Commons without its prior written consent including,
without limitation, in connection with any unauthorized modifications
to any of its public licenses or any other arrangements,
understandings, or agreements concerning use of licensed material. For
the avoidance of doubt, this paragraph does not form part of the
public licenses.
Creative Commons may be contacted at creativecommons.org.

View File

@@ -1,69 +0,0 @@
# Easy-RL
李宏毅老师的《深度强化学习》是强化学习领域经典的中文视频之一。李老师幽默风趣的上课风格让晦涩难懂的强化学习理论变得轻松易懂,他会通过很多有趣的例子来讲解强化学习理论。比如老师经常会用玩 Atari 游戏的例子来讲解强化学习算法。此外,为了教程的完整性,我们整理了周博磊老师的《强化学习纲要》、李科浇老师的《百度强化学习》以及多个强化学习的经典资料作为补充。对于想入门强化学习又想看中文讲解的人来说绝对是非常推荐的。
## 使用说明
* 第 4 章到第 11 章为[李宏毅《深度强化学习》](http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLDS18.html)的部分;
* 第 1 章和第 2 章根据[《强化学习纲要》](https://github.com/zhoubolei/introRL)整理而来;
* 第 3 章和第 12 章根据[《百度强化学习》](https://aistudio.baidu.com/aistudio/education/group/info/1335) 整理而来。
## 在线阅读(内容实时更新)
地址https://datawhalechina.github.io/easy-rl/
## 内容导航
| 章节 | 习题 | 相关项目 |
| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| [第一章 强化学习概述](https://datawhalechina.github.io/easy-rl/#/chapter1/chapter1) | [第一章 习题](https://datawhalechina.github.io/easy-rl/#/chapter1/chapter1_questions&keywords) | |
| [第二章 马尔可夫决策过程 (MDP)](https://datawhalechina.github.io/easy-rl/#/chapter2/chapter2) | [第二章 习题](https://datawhalechina.github.io/easy-rl/#/chapter2/chapter2_questions&keywords) | |
| [第三章 表格型方法](https://datawhalechina.github.io/easy-rl/#/chapter3/chapter3) | [第三章 习题](https://datawhalechina.github.io/easy-rl/#/chapter3/chapter3_questions&keywords) | [Q-learning算法实战](https://datawhalechina.github.io/easy-rl/#/chapter3/project1) |
| [第四章 策略梯度](https://datawhalechina.github.io/easy-rl/#/chapter4/chapter4) | [第四章 习题](https://datawhalechina.github.io/easy-rl/#/chapter4/chapter4_questions&keywords) | |
| [第五章 近端策略优化 (PPO) 算法](https://datawhalechina.github.io/easy-rl/#/chapter5/chapter5) | [第五章 习题](https://datawhalechina.github.io/easy-rl/#/chapter5/chapter5_questions&keywords) | |
| [第六章 DQN (基本概念)](https://datawhalechina.github.io/easy-rl/#/chapter6/chapter6) | [第六章 习题](https://datawhalechina.github.io/easy-rl/#/chapter6/chapter6_questions&keywords) | |
| [第七章 DQN (进阶技巧)](https://datawhalechina.github.io/easy-rl/#/chapter7/chapter7) | [第七章 习题](https://datawhalechina.github.io/easy-rl/#/chapter7/chapter7_questions&keywords) | [DQN算法实战](https://datawhalechina.github.io/easy-rl/#/chapter7/project2) |
| [第八章 DQN (连续动作)](https://datawhalechina.github.io/easy-rl/#/chapter8/chapter8) | [第八章 习题](https://datawhalechina.github.io/easy-rl/#/chapter8/chapter8_questions&keywords) | |
| [第九章 演员-评论家算法](https://datawhalechina.github.io/easy-rl/#/chapter9/chapter9) | [第九章 习题](https://datawhalechina.github.io/easy-rl/#/chapter9/chapter9_questions&keywords) | |
| [第十章 稀疏奖励](https://datawhalechina.github.io/easy-rl/#/chapter10/chapter10) | [第十章 习题](https://datawhalechina.github.io/easy-rl/#/chapter10/chapter10_questions&keywords) | |
| [第十一章 模仿学习](https://datawhalechina.github.io/easy-rl/#/chapter11/chapter11) | [第十一章 习题](https://datawhalechina.github.io/easy-rl/#/chapter11/chapter11_questions&keywords) | |
| [第十二章 深度确定性策略梯度 (DDPG) 算法](https://datawhalechina.github.io/easy-rl/#/chapter12/chapter12) | [第十二章 习题](https://datawhalechina.github.io/easy-rl/#/chapter12/chapter12_questions&keywords) | [DDPG算法实战](https://datawhalechina.github.io/easy-rl/#/chapter12/project3) |
| [第十三章 AlphaStar 论文解读](https://datawhalechina.github.io/easy-rl/#/chapter13/chapter13) | | |
## 算法实战
[点击](https://github.com/datawhalechina/easy-rl/tree/master/codes)或者跳转```codes```文件夹下进入算法实战
## 贡献者
<table border="0">
<tbody>
<tr align="center" >
<td>
<a href="https://github.com/qiwang067"><img width="70" height="70" src="https://github.com/qiwang067.png?s=40" alt="pic"></a><br>
<a href="https://github.com/qiwang067">Qi Wang</a>
<p>教程设计(第1~12章)<br> 中国科学院大学</p>
</td>
<td>
<a href="https://github.com/yyysjz1997"><img width="70" height="70" src="https://github.com/yyysjz1997.png?s=40" alt="pic"></a><br>
<a href="https://github.com/yyysjz1997">David Young</a>
<p>习题设计&第13章 <br> 清华大学</p>
</td>
<td>
<a href="https://github.com/JohnJim0816"><img width="70" height="70" src="https://github.com/JohnJim0816.png?s=40" alt="pic"></a><br>
<a href="https://github.com/JohnJim0816">John Jim</a>
<p>算法实战<br> 北京大学</p>
</td>
</tr>
</tbody>
</table>
## 致谢
特别感谢 [@Sm1les](https://github.com/Sm1les)、[@LSGOMYP](https://github.com/LSGOMYP) 对本项目的帮助与支持。
## 关注我们
<div align=center><img src="https://raw.githubusercontent.com/datawhalechina/pumpkin-book/master/res/qrcode.jpeg" width = "250" height = "270" alt="Datawhale是一个专注AI领域的开源组织以“for the learner和学习者一起成长”为愿景构建对学习者最有价值的开源学习社区。关注我们一起学习成长。"></div>
## LICENSE
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://img.shields.io/badge/license-CC%20BY--NC--SA%204.0-lightgrey" /></a><br />本作品采用<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议</a>进行许可。

View File

@@ -0,0 +1,5 @@
## A2C
https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f

View File

@@ -1,32 +1,27 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2020-11-03 20:47:09
LastEditor: John
LastEditTime: 2021-03-20 17:41:21
Date: 2021-05-03 22:16:08
LastEditor: JiangJi
LastEditTime: 2021-05-03 22:23:48
Discription:
Environment:
'''
from A2C.model import ActorCritic
import torch.optim as optim
from A2C.model import ActorCritic
class A2C:
def __init__(self,state_dim, action_dim, cfg):
self.gamma = 0.99
self.model = ActorCritic(state_dim, action_dim, hidden_dim=cfg.hidden_dim).to(cfg.device)
self.optimizer = optim.Adam(self.model.parameters(),lr=cfg.lr)
def choose_action(self, state):
dist, value = self.model(state)
action = dist.sample()
return action
def __init__(self,state_dim,action_dim,cfg) -> None:
self.gamma = cfg.gamma
self.device = cfg.device
self.model = ActorCritic(state_dim, action_dim, cfg.hidden_size).to(self.device)
self.optimizer = optim.Adam(self.model.parameters())
def compute_returns(self,next_value, rewards, masks):
R = next_value
returns = []
for step in reversed(range(len(rewards))):
R = rewards[step] + self.gamma * R * masks[step]
returns.insert(0, R)
return returns
def update(self):
pass
return returns

View File

@@ -1,48 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Email: johnjim0816@gmail.com
Date: 2020-10-30 15:39:37
LastEditor: John
LastEditTime: 2021-03-17 20:19:14
Discription:
Environment:
'''
import gym
from A2C.multiprocessing_env import SubprocVecEnv
# num_envs = 16
# env = "Pendulum-v0"
def make_envs(num_envs=16,env="Pendulum-v0"):
''' 创建多个子环境
'''
num_envs = 16
env = "CartPole-v0"
def make_env():
def _thunk():
env = gym.make(env)
return env
return _thunk
envs = [make_env() for i in range(num_envs)]
envs = SubprocVecEnv(envs)
return envs
# if __name__ == "__main__":
# num_envs = 16
# env = "CartPole-v0"
# def make_env():
# def _thunk():
# env = gym.make(env)
# return env
# return _thunk
# envs = [make_env() for i in range(num_envs)]
# envs = SubprocVecEnv(envs)
if __name__ == "__main__":
envs = make_envs(num_envs=16,env="CartPole-v0")

View File

@@ -1,106 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-11 20:58:21
@LastEditor: John
LastEditTime: 2021-04-05 11:14:39
@Discription:
@Environment: python 3.7.9
'''
import sys,os
curr_path = os.path.dirname(__file__)
parent_path=os.path.dirname(curr_path)
sys.path.append(parent_path) # add current terminal path to sys.path
import torch
import gym
import datetime
from A2C.agent import A2C
from common.utils import save_results,make_dir,del_empty_dir
SEQUENCE = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # 获取当前时间
SAVED_MODEL_PATH = os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"+SEQUENCE+'/' # 生成保存的模型路径
if not os.path.exists(os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"):
os.mkdir(os.path.split(os.path.abspath(__file__))[0]+"/saved_model/")
if not os.path.exists(SAVED_MODEL_PATH):
os.mkdir(SAVED_MODEL_PATH)
RESULT_PATH = os.path.split(os.path.abspath(__file__))[0]+"/results/"+SEQUENCE+'/' # 存储reward的路径
if not os.path.exists(os.path.split(os.path.abspath(__file__))[0]+"/results/"):
os.mkdir(os.path.split(os.path.abspath(__file__))[0]+"/results/")
if not os.path.exists(RESULT_PATH):
os.mkdir(RESULT_PATH)
class A2CConfig:
def __init__(self):
self.gamma = 0.99
self.lr = 3e-4 # learnning rate
self.actor_lr = 1e-4 # learnning rate of actor network
self.memory_capacity = 10000 # capacity of replay memory
self.batch_size = 128
self.train_eps = 200
self.train_steps = 200
self.eval_eps = 200
self.eval_steps = 200
self.target_update = 4
self.hidden_dim=256
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def train(cfg,env,agent):
print('Start to train ! ')
for i_episode in range(cfg.train_eps):
state = env.reset()
log_probs = []
values = []
rewards = []
masks = []
entropy = 0
ep_reward = 0
for i_step in range(cfg.train_steps):
state = torch.FloatTensor(state).to(cfg.device)
dist, value = agent.model(state)
action = dist.sample()
next_state, reward, done, _ = env.step(action.cpu().numpy())
ep_reward+=reward
state = next_state
log_prob = dist.log_prob(action)
entropy += dist.entropy().mean()
log_probs.append(log_prob)
values.append(value)
rewards.append(torch.FloatTensor(reward).unsqueeze(1).to(cfg.device))
masks.append(torch.FloatTensor(1 - done).unsqueeze(1).to(cfg.device))
if done:
break
print('Episode:{}/{}, Reward:{}, Steps:{}, Done:{}'.format(i_episode+1,cfg.train_eps,ep_reward,i_step+1,done))
next_state = torch.FloatTensor(next_state).to(cfg.device)
_, next_value =agent.model(next_state)
returns = agent.compute_returns(next_value, rewards, masks)
log_probs = torch.cat(log_probs)
returns = torch.cat(returns).detach()
values = torch.cat(values)
advantage = returns - values
actor_loss = -(log_probs * advantage.detach()).mean()
critic_loss = advantage.pow(2).mean()
loss = actor_loss + 0.5 * critic_loss - 0.001 * entropy
agent.optimizer.zero_grad()
loss.backward()
agent.optimizer.step()
print('Complete training')
if __name__ == "__main__":
cfg = A2CConfig()
env = gym.make('CartPole-v0')
env.seed(1) # set random seed for env
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = A2C(state_dim, action_dim, cfg)
train(cfg,env,agent)

View File

@@ -1,36 +1,36 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2020-11-03 20:45:25
LastEditor: John
LastEditTime: 2021-03-20 17:41:33
Date: 2021-05-03 21:38:54
LastEditor: JiangJi
LastEditTime: 2021-05-03 21:40:06
Discription:
Environment:
'''
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
class ActorCritic(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=256):
def __init__(self, num_inputs, num_outputs, hidden_size, std=0.0):
super(ActorCritic, self).__init__()
self.critic = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Linear(num_inputs, hidden_size),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
nn.Linear(hidden_size, 1)
)
self.actor = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Linear(num_inputs, hidden_size),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
nn.Linear(hidden_size, num_outputs),
nn.Softmax(dim=1),
)
def forward(self, x):
value = self.critic(x)
print(x)
probs = self.actor(x)
dist = Categorical(probs)
return dist, value

View File

@@ -1,153 +0,0 @@
#This code is from openai baseline
#https://github.com/openai/baselines/tree/master/baselines/common/vec_env
import numpy as np
from multiprocessing import Process, Pipe
def worker(remote, parent_remote, env_fn_wrapper):
parent_remote.close()
env = env_fn_wrapper.x()
while True:
cmd, data = remote.recv()
if cmd == 'step':
ob, reward, done, info = env.step(data)
if done:
ob = env.reset()
remote.send((ob, reward, done, info))
elif cmd == 'reset':
ob = env.reset()
remote.send(ob)
elif cmd == 'reset_task':
ob = env.reset_task()
remote.send(ob)
elif cmd == 'close':
remote.close()
break
elif cmd == 'get_spaces':
remote.send((env.observation_space, env.action_space))
else:
raise NotImplementedError
class VecEnv(object):
"""
An abstract asynchronous, vectorized environment.
"""
def __init__(self, num_envs, observation_space, action_space):
self.num_envs = num_envs
self.observation_space = observation_space
self.action_space = action_space
def reset(self):
"""
Reset all the environments and return an array of
observations, or a tuple of observation arrays.
If step_async is still doing work, that work will
be cancelled and step_wait() should not be called
until step_async() is invoked again.
"""
pass
def step_async(self, actions):
"""
Tell all the environments to start taking a step
with the given actions.
Call step_wait() to get the results of the step.
You should not call this if a step_async run is
already pending.
"""
pass
def step_wait(self):
"""
Wait for the step taken with step_async().
Returns (obs, rews, dones, infos):
- obs: an array of observations, or a tuple of
arrays of observations.
- rews: an array of rewards
- dones: an array of "episode done" booleans
- infos: a sequence of info objects
"""
pass
def close(self):
"""
Clean up the environments' resources.
"""
pass
def step(self, actions):
self.step_async(actions)
return self.step_wait()
class CloudpickleWrapper(object):
"""
Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
"""
def __init__(self, x):
self.x = x
def __getstate__(self):
import cloudpickle
return cloudpickle.dumps(self.x)
def __setstate__(self, ob):
import pickle
self.x = pickle.loads(ob)
class SubprocVecEnv(VecEnv):
def __init__(self, env_fns, spaces=None):
"""
envs: list of gym environments to run in subprocesses
"""
self.waiting = False
self.closed = False
nenvs = len(env_fns)
self.nenvs = nenvs
self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
for p in self.ps:
p.daemon = True # if the main process crashes, we should not cause things to hang
p.start()
for remote in self.work_remotes:
remote.close()
self.remotes[0].send(('get_spaces', None))
observation_space, action_space = self.remotes[0].recv()
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
def step_async(self, actions):
for remote, action in zip(self.remotes, actions):
remote.send(('step', action))
self.waiting = True
def step_wait(self):
results = [remote.recv() for remote in self.remotes]
self.waiting = False
obs, rews, dones, infos = zip(*results)
return np.stack(obs), np.stack(rews), np.stack(dones), infos
def reset(self):
for remote in self.remotes:
remote.send(('reset', None))
return np.stack([remote.recv() for remote in self.remotes])
def reset_task(self):
for remote in self.remotes:
remote.send(('reset_task', None))
return np.stack([remote.recv() for remote in self.remotes])
def close(self):
if self.closed:
return
if self.waiting:
for remote in self.remotes:
remote.recv()
for remote in self.remotes:
remote.send(('close', None))
for p in self.ps:
p.join()
self.closed = True
def __len__(self):
return self.nenvs

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

120
codes/A2C/task0_train.py Normal file
View File

@@ -0,0 +1,120 @@
import sys,os
curr_path = os.path.dirname(__file__)
parent_path = os.path.dirname(curr_path)
sys.path.append(parent_path) # add current terminal path to sys.path
import gym
import numpy as np
import torch
import torch.optim as optim
import datetime
from common.multiprocessing_env import SubprocVecEnv
from A2C.model import ActorCritic
from common.utils import save_results, make_dir
from common.plot import plot_rewards
curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time
class A2CConfig:
def __init__(self) -> None:
self.algo='A2C'
self.env= 'CartPole-v0'
self.result_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/results/' # path to save results
self.model_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/models/' # path to save models
self.n_envs = 8
self.gamma = 0.99
self.hidden_size = 256
self.lr = 1e-3 # learning rate
self.max_frames = 30000
self.n_steps = 5
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def make_envs(env_name):
def _thunk():
env = gym.make(env_name)
env.seed(2)
return env
return _thunk
def test_env(env,model,vis=False):
state = env.reset()
if vis: env.render()
done = False
total_reward = 0
while not done:
state = torch.FloatTensor(state).unsqueeze(0).to(cfg.device)
dist, _ = model(state)
next_state, reward, done, _ = env.step(dist.sample().cpu().numpy()[0])
state = next_state
if vis: env.render()
total_reward += reward
return total_reward
def compute_returns(next_value, rewards, masks, gamma=0.99):
R = next_value
returns = []
for step in reversed(range(len(rewards))):
R = rewards[step] + gamma * R * masks[step]
returns.insert(0, R)
return returns
def train(cfg,envs):
env = gym.make(cfg.env) # a single env
env.seed(10)
state_dim = envs.observation_space.shape[0]
action_dim = envs.action_space.n
model = ActorCritic(state_dim, action_dim, cfg.hidden_size).to(cfg.device)
optimizer = optim.Adam(model.parameters())
frame_idx = 0
test_rewards = []
test_ma_rewards = []
state = envs.reset()
while frame_idx < cfg.max_frames:
log_probs = []
values = []
rewards = []
masks = []
entropy = 0
# rollout trajectory
for _ in range(cfg.n_steps):
state = torch.FloatTensor(state).to(cfg.device)
dist, value = model(state)
action = dist.sample()
next_state, reward, done, _ = envs.step(action.cpu().numpy())
log_prob = dist.log_prob(action)
entropy += dist.entropy().mean()
log_probs.append(log_prob)
values.append(value)
rewards.append(torch.FloatTensor(reward).unsqueeze(1).to(cfg.device))
masks.append(torch.FloatTensor(1 - done).unsqueeze(1).to(cfg.device))
state = next_state
frame_idx += 1
if frame_idx % 100 == 0:
test_reward = np.mean([test_env(env,model) for _ in range(10)])
print(f"frame_idx:{frame_idx}, test_reward:{test_reward}")
test_rewards.append(test_reward)
if test_ma_rewards:
test_ma_rewards.append(0.9*test_ma_rewards[-1]+0.1*test_reward)
else:
test_ma_rewards.append(test_reward)
# plot(frame_idx, test_rewards)
next_state = torch.FloatTensor(next_state).to(cfg.device)
_, next_value = model(next_state)
returns = compute_returns(next_value, rewards, masks)
log_probs = torch.cat(log_probs)
returns = torch.cat(returns).detach()
values = torch.cat(values)
advantage = returns - values
actor_loss = -(log_probs * advantage.detach()).mean()
critic_loss = advantage.pow(2).mean()
loss = actor_loss + 0.5 * critic_loss - 0.001 * entropy
optimizer.zero_grad()
loss.backward()
optimizer.step()
return test_rewards, test_ma_rewards
if __name__ == "__main__":
cfg = A2CConfig()
envs = [make_envs(cfg.env) for i in range(cfg.n_envs)]
envs = SubprocVecEnv(envs) # 8 env
rewards,ma_rewards = train(cfg,envs)
make_dir(cfg.result_path,cfg.model_path)
save_results(rewards,ma_rewards,tag='train',path=cfg.result_path)
plot_rewards(rewards,ma_rewards,tag="train",env=cfg.env,algo = cfg.algo,path=cfg.result_path)

View File

@@ -1,162 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Email: johnjim0816@gmail.com
Date: 2021-03-20 17:43:17
LastEditor: John
LastEditTime: 2021-04-05 11:19:20
Discription:
Environment:
'''
import sys
import torch
import gym
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable
import matplotlib.pyplot as plt
import pandas as pd
learning_rate = 3e-4
# Constants
GAMMA = 0.99
class A2CConfig:
''' hyperparameters
'''
def __init__(self):
self.gamma = 0.99
self.lr = 3e-4 # learnning rate
self.actor_lr = 1e-4 # learnning rate of actor network
self.memory_capacity = 10000 # capacity of replay memory
self.batch_size = 128
self.train_eps = 3000
self.train_steps = 200
self.eval_eps = 200
self.eval_steps = 200
self.target_update = 4
self.hidden_dim = 256
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class ActorCritic(nn.Module):
def __init__(self, n_states, n_actions, hidden_dim, learning_rate=3e-4):
super(ActorCritic, self).__init__()
self.n_actions = n_actions
self.critic_linear1 = nn.Linear(n_states, hidden_dim)
self.critic_linear2 = nn.Linear(hidden_dim, 1)
self.actor_linear1 = nn.Linear(n_states, hidden_dim)
self.actor_linear2 = nn.Linear(hidden_dim, n_actions)
def forward(self, state):
state = Variable(torch.from_numpy(state).float().unsqueeze(0))
value = F.relu(self.critic_linear1(state))
value = self.critic_linear2(value)
policy_dist = F.relu(self.actor_linear1(state))
policy_dist = F.softmax(self.actor_linear2(policy_dist), dim=1)
return value, policy_dist
class A2C:
def __init__(self,n_states,n_actions,cfg):
self.model = ActorCritic(n_states, n_actions, cfg.hidden_dim)
self.optimizer = optim.Adam(self.model.parameters(), lr=cfg.lr)
def choose_action(self,state):
pass
def update(self):
pass
def train(cfg,env,agent):
n_states = env.observation_space.shape[0]
n_actions = env.action_space.n
actor_critic = ActorCritic(n_states, n_actions, cfg.hidden_dim)
ac_optimizer = optim.Adam(actor_critic.parameters(), lr=learning_rate)
all_lengths = []
average_lengths = []
all_rewards = []
entropy_term = 0
for episode in range(cfg.train_eps):
log_probs = []
values = []
rewards = []
state = env.reset()
for steps in range(cfg.train_steps):
value, policy_dist = actor_critic.forward(state)
value = value.detach().numpy()[0,0]
dist = policy_dist.detach().numpy()
action = np.random.choice(n_actions, p=np.squeeze(dist))
log_prob = torch.log(policy_dist.squeeze(0)[action])
entropy = -np.sum(np.mean(dist) * np.log(dist))
new_state, reward, done, _ = env.step(action)
rewards.append(reward)
values.append(value)
log_probs.append(log_prob)
entropy_term += entropy
state = new_state
if done or steps == cfg.train_steps-1:
Qval, _ = actor_critic.forward(new_state)
Qval = Qval.detach().numpy()[0,0]
all_rewards.append(np.sum(rewards))
all_lengths.append(steps)
average_lengths.append(np.mean(all_lengths[-10:]))
if episode % 10 == 0:
sys.stdout.write("episode: {}, reward: {}, total length: {}, average length: {} \n".format(episode, np.sum(rewards), steps+1, average_lengths[-1]))
break
# compute Q values
Qvals = np.zeros_like(values)
for t in reversed(range(len(rewards))):
Qval = rewards[t] + GAMMA * Qval
Qvals[t] = Qval
#update actor critic
values = torch.FloatTensor(values)
Qvals = torch.FloatTensor(Qvals)
log_probs = torch.stack(log_probs)
advantage = Qvals - values
actor_loss = (-log_probs * advantage).mean()
critic_loss = 0.5 * advantage.pow(2).mean()
ac_loss = actor_loss + critic_loss + 0.001 * entropy_term
ac_optimizer.zero_grad()
ac_loss.backward()
ac_optimizer.step()
# Plot results
smoothed_rewards = pd.Series.rolling(pd.Series(all_rewards), 10).mean()
smoothed_rewards = [elem for elem in smoothed_rewards]
plt.plot(all_rewards)
plt.plot(smoothed_rewards)
plt.plot()
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.show()
plt.plot(all_lengths)
plt.plot(average_lengths)
plt.xlabel('Episode')
plt.ylabel('Episode length')
plt.show()
if __name__ == "__main__":
cfg = A2CConfig()
env = gym.make("CartPole-v0")
n_states = env.observation_space.shape[0]
n_actions = env.action_space.n
agent = A2C(n_states,n_actions,cfg)
train(cfg,env,agent)

View File

@@ -1,32 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Email: johnjim0816@gmail.com
Date: 2020-10-15 21:31:19
LastEditor: John
LastEditTime: 2020-11-03 17:05:48
Discription:
Environment:
'''
import os
import numpy as np
import datetime
SEQUENCE = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
SAVED_MODEL_PATH = os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"+SEQUENCE+'/'
RESULT_PATH = os.path.split(os.path.abspath(__file__))[0]+"/results/"+SEQUENCE+'/'
def save_results(rewards,moving_average_rewards,ep_steps,path=RESULT_PATH):
if not os.path.exists(path): # 检测是否存在文件夹
os.mkdir(path)
np.save(RESULT_PATH+'rewards_train.npy', rewards)
np.save(RESULT_PATH+'moving_average_rewards_train.npy', moving_average_rewards)
np.save(RESULT_PATH+'steps_train.npy',ep_steps )
def save_model(agent,model_path='./saved_model'):
if not os.path.exists(model_path): # 检测是否存在文件夹
os.mkdir(model_path)
agent.save_model(model_path+'checkpoint.pth')
print('model saved')

View File

@@ -1,5 +1,7 @@
# DDPG
#TODO
## 伪代码
![image-20210320151900695](assets/image-20210320151900695.png)

View File

@@ -5,7 +5,7 @@
@Email: johnjim0816@gmail.com
@Date: 2020-06-09 20:25:52
@LastEditor: John
LastEditTime: 2021-03-31 00:56:32
LastEditTime: 2021-05-04 14:50:17
@Discription:
@Environment: python 3.7.7
'''
@@ -26,6 +26,7 @@ class DDPG:
self.target_critic = Critic(state_dim, action_dim, cfg.hidden_dim).to(cfg.device)
self.target_actor = Actor(state_dim, action_dim, cfg.hidden_dim).to(cfg.device)
# copy parameters to target net
for target_param, param in zip(self.target_critic.parameters(), self.critic.parameters()):
target_param.data.copy_(param.data)
for target_param, param in zip(self.target_actor.parameters(), self.actor.parameters()):
@@ -42,7 +43,6 @@ class DDPG:
def choose_action(self, state):
state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
action = self.actor(state)
# torch.detach()用于切断反向传播
return action.detach().cpu().numpy()[0, 0]
def update(self):
@@ -50,13 +50,13 @@ class DDPG:
return
state, action, reward, next_state, done = self.memory.sample(
self.batch_size)
# 将所有变量转为张量
# convert variables to Tensor
state = torch.FloatTensor(state).to(self.device)
next_state = torch.FloatTensor(next_state).to(self.device)
action = torch.FloatTensor(action).to(self.device)
reward = torch.FloatTensor(reward).unsqueeze(1).to(self.device)
done = torch.FloatTensor(np.float32(done)).unsqueeze(1).to(self.device)
# 注意critic将(s_t,a)作为输入
policy_loss = self.critic(state, self.actor(state))
policy_loss = -policy_loss.mean()
next_action = self.target_actor(next_state)

View File

@@ -1,94 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-11 20:58:21
@LastEditor: John
LastEditTime: 2021-04-29 01:58:50
@Discription:
@Environment: python 3.7.7
'''
import sys,os
from pathlib import Path
import sys,os
curr_path = os.path.dirname(__file__)
parent_path=os.path.dirname(curr_path)
sys.path.append(parent_path) # add current terminal path to sys.path
import torch
import gym
import numpy as np
import datetime
from DDPG.agent import DDPG
from DDPG.env import NormalizedActions,OUNoise
from common.plot import plot_rewards
from common.utils import save_results
SEQUENCE = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time
SAVED_MODEL_PATH = curr_path+"/saved_model/"+SEQUENCE+'/' # path to save model
if not os.path.exists(curr_path+"/saved_model/"): os.mkdir(curr_path+"/saved_model/")
if not os.path.exists(SAVED_MODEL_PATH): os.mkdir(SAVED_MODEL_PATH)
RESULT_PATH = curr_path+"/results/"+SEQUENCE+'/' # path to save rewards
if not os.path.exists(curr_path+"/results/"): os.mkdir(curr_path+"/results/")
if not os.path.exists(RESULT_PATH): os.mkdir(RESULT_PATH)
class DDPGConfig:
def __init__(self):
self.env = 'Pendulum-v0'
self.algo = 'DDPG'
self.gamma = 0.99
self.critic_lr = 1e-3
self.actor_lr = 1e-4
self.memory_capacity = 10000
self.batch_size = 128
self.train_eps =300
self.eval_eps = 200
self.eval_steps = 200
self.target_update = 4
self.hidden_dim = 30
self.soft_tau=1e-2
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def train(cfg,env,agent):
print('Start to train ! ')
ou_noise = OUNoise(env.action_space) # action noise
rewards = []
ma_rewards = [] # moving average rewards
ep_steps = []
for i_episode in range(cfg.train_eps):
state = env.reset()
ou_noise.reset()
done = False
ep_reward = 0
i_step = 0
while not done:
i_step += 1
action = agent.choose_action(state)
action = ou_noise.get_action(action, i_step) # 即paper中的random process
next_state, reward, done, _ = env.step(action)
ep_reward += reward
agent.memory.push(state, action, reward, next_state, done)
agent.update()
state = next_state
print('Episode:{}/{}, Reward:{}'.format(i_episode+1,cfg.train_eps,ep_reward))
ep_steps.append(i_step)
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(0.9*ma_rewards[-1]+0.1*ep_reward)
else:
ma_rewards.append(ep_reward)
print('Complete training')
return rewards,ma_rewards
if __name__ == "__main__":
cfg = DDPGConfig()
env = NormalizedActions(gym.make("Pendulum-v0"))
env.seed(1) # 设置env随机种子
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
agent = DDPG(state_dim,action_dim,cfg)
rewards,ma_rewards = train(cfg,env,agent)
agent.save(path=SAVED_MODEL_PATH)
save_results(rewards,ma_rewards,tag='train',path=RESULT_PATH)
plot_rewards(rewards,ma_rewards,tag="train",algo = cfg.algo,path=RESULT_PATH)

Binary file not shown.

After

Width:  |  Height:  |  Size: 61 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 69 KiB

135
codes/DDPG/task0_train.py Normal file
View File

@@ -0,0 +1,135 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-11 20:58:21
@LastEditor: John
LastEditTime: 2021-05-04 14:49:45
@Discription:
@Environment: python 3.7.7
'''
import sys,os
curr_path = os.path.dirname(__file__)
parent_path = os.path.dirname(curr_path)
sys.path.append(parent_path) # add current terminal path to sys.path
import datetime
import gym
import torch
from DDPG.env import NormalizedActions, OUNoise
from DDPG.agent import DDPG
from common.utils import save_results,make_dir
from common.plot import plot_rewards
curr_time = datetime.datetime.now().strftime(
"%Y%m%d-%H%M%S") # obtain current time
class DDPGConfig:
def __init__(self):
self.algo = 'DDPG'
self.env = 'Pendulum-v0' # env name
self.result_path = curr_path+"/outputs/" + self.env + \
'/'+curr_time+'/results/' # path to save results
self.model_path = curr_path+"/outputs/" + self.env + \
'/'+curr_time+'/models/' # path to save results
self.gamma = 0.99
self.critic_lr = 1e-3
self.actor_lr = 1e-4
self.memory_capacity = 10000
self.batch_size = 128
self.train_eps = 300
self.eval_eps = 50
self.eval_steps = 200
self.target_update = 4
self.hidden_dim = 30
self.soft_tau = 1e-2
self.device = torch.device(
"cuda" if torch.cuda.is_available() else "cpu")
def env_agent_config(cfg,seed=1):
env = NormalizedActions(gym.make(cfg.env))
env.seed(seed)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
agent = DDPG(state_dim,action_dim,cfg)
return env,agent
def train(cfg, env, agent):
print('Start to train ! ')
print(f'Env:{cfg.env}, Algorithm:{cfg.algo}, Device:{cfg.device}')
ou_noise = OUNoise(env.action_space) # action noise
rewards = []
ma_rewards = [] # moving average rewards
for i_episode in range(cfg.train_eps):
state = env.reset()
ou_noise.reset()
done = False
ep_reward = 0
i_step = 0
while not done:
i_step += 1
action = agent.choose_action(state)
action = ou_noise.get_action(
action, i_step) # 即paper中的random process
next_state, reward, done, _ = env.step(action)
ep_reward += reward
agent.memory.push(state, action, reward, next_state, done)
agent.update()
state = next_state
print('Episode:{}/{}, Reward:{}'.format(i_episode+1, cfg.train_eps, ep_reward))
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(0.9*ma_rewards[-1]+0.1*ep_reward)
else:
ma_rewards.append(ep_reward)
print('Complete training')
return rewards, ma_rewards
def eval(cfg, env, agent):
print('Start to Eval ! ')
print(f'Env:{cfg.env}, Algorithm:{cfg.algo}, Device:{cfg.device}')
rewards = []
ma_rewards = [] # moving average rewards
for i_episode in range(cfg.eval_eps):
state = env.reset()
done = False
ep_reward = 0
i_step = 0
while not done:
i_step += 1
action = agent.choose_action(state)
next_state, reward, done, _ = env.step(action)
ep_reward += reward
state = next_state
print('Episode:{}/{}, Reward:{}'.format(i_episode+1, cfg.train_eps, ep_reward))
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(0.9*ma_rewards[-1]+0.1*ep_reward)
else:
ma_rewards.append(ep_reward)
print('Complete Eval')
return rewards, ma_rewards
if __name__ == "__main__":
cfg = DDPGConfig()
# train
env,agent = env_agent_config(cfg,seed=1)
rewards, ma_rewards = train(cfg, env, agent)
make_dir(cfg.result_path, cfg.model_path)
agent.save(path=cfg.model_path)
save_results(rewards, ma_rewards, tag='train', path=cfg.result_path)
plot_rewards(rewards, ma_rewards, tag="train",
algo=cfg.algo, path=cfg.result_path)
# eval
env,agent = env_agent_config(cfg,seed=10)
agent.load(path=cfg.model_path)
rewards,ma_rewards = eval(cfg,env,agent)
save_results(rewards,ma_rewards,tag='eval',path=cfg.result_path)
plot_rewards(rewards,ma_rewards,tag="eval",env=cfg.env,algo = cfg.algo,path=cfg.result_path)

View File

@@ -1,6 +1,7 @@
# DQN
## 原理简介
DQN是Q-leanning算法的优化和延伸Q-leaning中使用有限的Q表存储值的信息而DQN中则用神经网络替代Q表存储信息这样更适用于高维的情况相关知识基础可参考[datawhale李宏毅笔记-Q学习](https://datawhalechina.github.io/easy-rl/#/chapter6/chapter6)。
论文方面主要可以参考两篇一篇就是2013年谷歌DeepMind团队的[Playing Atari with Deep Reinforcement Learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)一篇是也是他们团队后来在Nature杂志上发表的[Human-level control through deep reinforcement learning](https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf)。后者在算法层面增加target q-net也可以叫做Nature DQN。
@@ -15,7 +16,7 @@ https://blog.csdn.net/JohnJim0/article/details/109557173)
<img src="assets/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0pvaG5KaW0w,size_16,color_FFFFFF,t_70.png" alt="img" style="zoom:50%;" />
## 代码实
## 代码实
### RL接口
@@ -24,23 +25,26 @@ https://blog.csdn.net/JohnJim0/article/details/109557173)
for i_episode in range(MAX_EPISODES):
state = env.reset() # reset环境状态
for i_step in range(MAX_STEPS):
action = agent.choose_action(state) # 根据当前环境state选择action
next_state, reward, done, _ = env.step(action) # 更新环境参数
agent.memory.push(state, action, reward, next_state, done) # 将state等这些transition存入memory
agent.update() # 每步更新网络
state = next_state # 跳转到下一个状态
if done:
break
action = agent.choose_action(state) # 根据当前环境state选择action
next_state, reward, done, _ = env.step(action) # 更新环境参数
agent.memory.push(state, action, reward, next_state, done) # 将state等这些transition存入memory
agent.update() # 每步更新网络
state = next_state # 跳转到下一个状态
if done:
break
```
如上首先需要循环多个episode训练在每个episode中首先需要重置环境然后开始探索每个episode加一个MAX_STEPS(也可以使用while not done, 加这个max_steps有时是因为比如gym环境训练目标就是在200个step下达到200的reward),接下来的流程如下:
每个episode加一个MAX_STEPS也可以使用while not done, 加这个max_steps有时是因为比如gym环境训练目标就是在200个step下达到200的reward或者是当完成一个episode的步数较多时也可以设置基本流程跟所有伪代码一致如下:
1. agent选择动作
2. 环境根据agent的动作反馈出新的state和reward
2. 环境根据agent的动作反馈出next_state和reward
3. agent进行更新如有memory就会将transition(包含staterewardaction等)存入memory中
4. 跳转到下一个状态
如果提前done了就跳出for循环进行下一个episode的训练。
5. 如果done了就跳出循环进行下一个episode的训练。
想要实现完整的算法还需要创建QnetReplaybuffer等类
### 两个Q网络
前面讲了Nature DQN中有两个Q网络一个是policy_net一个是延时更新的target_net两个网络的结构是一模一样的如下(见```model.py```)
上文讲了Nature DQN中有两个Q网络一个是policy_net一个是延时更新的target_net两个网络的结构是一模一样的如下(见```model.py```)注意DQN使用的Qnet就是全连接网络即FCH
```python
import torch.nn as nn
import torch.nn.functional as F
@@ -62,30 +66,12 @@ class FCN(nn.Module):
x = F.relu(self.fc2(x))
return self.fc3(x)
```
输入为state输出为action注意根据state和action的维度调整隐藏层的层数这里设为128
输入为state_dim输出为action_dim包含一个128维度的隐藏层这里根据需要可增加隐藏层维度和数量然后一般使用relu激活函数这里跟深度学习的网路设置是一样的。
### Replay Buffer
然后就是Replay Memory了其作用主要是是克服经验数据的相关性correlated data和非平稳分布non-stationary distribution问题实现如下(见```memory.py```)
在```agent.py```中我们定义强化学习算法,包括```choose_action```和```update```两个主要函数,初始化中:
```python
self.policy_net = FCN(state_dim, action_dim).to(self.device)
self.target_net = FCN(state_dim, action_dim).to(self.device)
# target_net的初始模型参数完全复制policy_net
self.target_net.load_state_dict(self.policy_net.state_dict())
self.target_net.eval() # 不启用 BatchNormalization 和 Dropout
# 可查parameters()与state_dict()的区别前者require_grad=True
```
可以看到policy_net跟target_net结构和初始参数一样但在更新的时候target是每隔一段episode更新的如下(见```main.py```)
```python
# 更新target network复制DQN中的所有weights and biases
if i_episode % cfg.target_update == 0:
agent.target_net.load_state_dict(agent.policy_net.state_dict())
```
可以调整```cfg.target_update```注意该变量不要调得太大否则会收敛很慢我们最后保存的模型也是这个target_net如下(见```agent.py```)
```python
def save_model(self,path):
torch.save(self.target_net.state_dict(), path)
```
### Replay Memory
然后就是Replay Memory了如下(见```memory.py```)
```python
import random
import numpy as np
@@ -111,11 +97,120 @@ class ReplayBuffer:
def __len__(self):
return len(self.buffer)
```
其实比较简单主要包括push和sample两个步骤push是将transitions放到memory中sample是从memory随机抽取一些transition。
最后结果如下:
参数capacity表示buffer的容量主要包括push和sample两个步骤push是将transitions放到memory中sample是从memory随机抽取一些transition。
![rewards_curve_train](assets/rewards_curve_train.png)
### Agent类
在```agent.py```中我们定义强化学习算法类,包括```choose_action```(选择动作使用e-greedy策略时会多一个```predict```函数,下面会将到)和```update```(更新)等函数。
在类中建立两个网络以及optimizer和memory
```python
self.policy_net = MLP(state_dim, action_dim,hidden_dim=cfg.hidden_dim).to(self.device)
self.target_net = MLP(state_dim, action_dim,hidden_dim=cfg.hidden_dim).to(self.device)
for target_param, param in zip(self.target_net.parameters(),self.policy_net.parameters()): # copy params from policy net
target_param.data.copy_(param.data)
self.optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg.lr)
self.memory = ReplayBuffer(cfg.memory_capacity)
```
然后是选择action
```python
def choose_action(self, state):
'''选择动作
'''
self.frame_idx += 1
if random.random() > self.epsilon(self.frame_idx):
action = self.predict(state)
else:
action = random.randrange(self.action_dim)
return action
```
这里使用e-greedy策略即设置一个参数epsilon如果生成的随机数大于epsilon就根据网络预测的选择action否则还是随机选择action这个epsilon是会逐渐减小的可以使用线性或者指数减小的方式但不会减小到零这样在训练稳定时还能保持一定的探索这部分可以学习探索与利用(exploration and exploition)相关知识。
上面讲到的预测函数其实就是根据state选取q值最大的action如下
```python
def predict(self,state):
with torch.no_grad():
state = torch.tensor([state], device=self.device, dtype=torch.float32)
q_values = self.policy_net(state)
action = q_values.max(1)[1].item()
```
然后是更新函数了:
```python
def update(self):
if len(self.memory) < self.batch_size:
return
# 从memory中随机采样transition
state_batch, action_batch, reward_batch, next_state_batch, done_batch = self.memory.sample(
self.batch_size)
'''转为张量
例如tensor([[-4.5543e-02, -2.3910e-01, 1.8344e-02, 2.3158e-01],...,[-1.8615e-02, -2.3921e-01, -1.1791e-02, 2.3400e-01]])'''
state_batch = torch.tensor(
state_batch, device=self.device, dtype=torch.float)
action_batch = torch.tensor(action_batch, device=self.device).unsqueeze(
1) # 例如tensor([[1],...,[0]])
reward_batch = torch.tensor(
reward_batch, device=self.device, dtype=torch.float) # tensor([1., 1.,...,1])
next_state_batch = torch.tensor(
next_state_batch, device=self.device, dtype=torch.float)
done_batch = torch.tensor(np.float32(
done_batch), device=self.device)
'''计算当前(s_t,a)对应的Q(s_t, a)'''
'''torch.gather:对于a=torch.Tensor([[1,2],[3,4]]),那么a.gather(1,torch.Tensor([[0],[1]]))=torch.Tensor([[1],[3]])'''
q_values = self.policy_net(state_batch).gather(
dim=1, index=action_batch) # 等价于self.forward
# 计算所有next states的V(s_{t+1})即通过target_net中选取reward最大的对应states
next_q_values = self.target_net(next_state_batch).max(
1)[0].detach() # 比如tensor([ 0.0060, -0.0171,...,])
# 计算 expected_q_value
# 对于终止状态此时done_batch[0]=1, 对应的expected_q_value等于reward
expected_q_values = reward_batch + \
self.gamma * next_q_values * (1-done_batch)
# self.loss = F.smooth_l1_loss(q_values,expected_q_values.unsqueeze(1)) # 计算 Huber loss
loss = nn.MSELoss()(q_values, expected_q_values.unsqueeze(1)) # 计算 均方误差loss
# 优化模型
self.optimizer.zero_grad() # zero_grad清除上一步所有旧的gradients from the last step
# loss.backward()使用backpropagation计算loss相对于所有parameters(需要gradients)的微分
loss.backward()
# for param in self.policy_net.parameters(): # clip防止梯度爆炸
# param.grad.data.clamp_(-1, 1)
self.optimizer.step() # 更新模型
```
更新遵循伪代码的以下部分:
<img src="assets/image-20210507162813393.png" alt="image-20210507162813393" style="zoom:50%;" />
首先从replay buffer中选取一个batch的数据计算loss然后进行minibatch SGD。
然后是保存与加载模型的部分,如下:
```python
def save(self, path):
torch.save(self.target_net.state_dict(), path+'dqn_checkpoint.pth')
def load(self, path):
self.target_net.load_state_dict(torch.load(path+'dqn_checkpoint.pth'))
for target_param, param in zip(self.target_net.parameters(), self.policy_net.parameters()):
param.data.copy_(target_param.data)
```
### 实验结果
训练结果如下:
<img src="assets/train_rewards_curve.png" alt="train_rewards_curve" style="zoom: 67%;" />
<img src="assets/eval_rewards_curve.png" alt="eval_rewards_curve" style="zoom:67%;" />
## 参考

View File

@@ -5,7 +5,7 @@
@Email: johnjim0816@gmail.com
@Date: 2020-06-12 00:50:49
@LastEditor: John
LastEditTime: 2021-03-30 17:01:26
LastEditTime: 2021-05-07 16:30:05
@Discription:
@Environment: python 3.7.7
'''
@@ -35,34 +35,29 @@ class DQN:
(cfg.epsilon_start - cfg.epsilon_end) * \
math.exp(-1. * frame_idx / cfg.epsilon_decay)
self.batch_size = cfg.batch_size
self.policy_net = MLP(state_dim, action_dim,
hidden_dim=cfg.hidden_dim).to(self.device)
self.target_net = MLP(state_dim, action_dim,
hidden_dim=cfg.hidden_dim).to(self.device)
self.policy_net = MLP(state_dim, action_dim,hidden_dim=cfg.hidden_dim).to(self.device)
self.target_net = MLP(state_dim, action_dim,hidden_dim=cfg.hidden_dim).to(self.device)
for target_param, param in zip(self.target_net.parameters(),self.policy_net.parameters()): # copy params from policy net
target_param.data.copy_(param.data)
self.optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg.lr)
self.loss = 0
self.memory = ReplayBuffer(cfg.memory_capacity)
def choose_action(self, state):
'''选择动作
'''
self.frame_idx += 1
if random.random() > self.epsilon(self.frame_idx):
with torch.no_grad():
# 先转为张量便于丢给神经网络,state元素数据原本为float64
# 注意state=torch.tensor(state).unsqueeze(0)跟state=torch.tensor([state])等价
state = torch.tensor(
[state], device=self.device, dtype=torch.float32)
# 如tensor([[-0.0798, -0.0079]], grad_fn=<AddmmBackward>)
q_value = self.policy_net(state)
# tensor.max(1)返回每行的最大值以及对应的下标,
# 如torch.return_types.max(values=tensor([10.3587]),indices=tensor([0]))
# 所以tensor.max(1)[1]返回最大值对应的下标即action
action = q_value.max(1)[1].item()
action = self.predict(state)
else:
action = random.randrange(self.action_dim)
return action
def predict(self,state):
with torch.no_grad():
state = torch.tensor([state], device=self.device, dtype=torch.float32)
q_values = self.policy_net(state)
action = q_values.max(1)[1].item()
return action
def update(self):
if len(self.memory) < self.batch_size:
@@ -95,11 +90,11 @@ class DQN:
expected_q_values = reward_batch + \
self.gamma * next_q_values * (1-done_batch)
# self.loss = F.smooth_l1_loss(q_values,expected_q_values.unsqueeze(1)) # 计算 Huber loss
self.loss = nn.MSELoss()(q_values, expected_q_values.unsqueeze(1)) # 计算 均方误差loss
loss = nn.MSELoss()(q_values, expected_q_values.unsqueeze(1)) # 计算 均方误差loss
# 优化模型
self.optimizer.zero_grad() # zero_grad清除上一步所有旧的gradients from the last step
# loss.backward()使用backpropagation计算loss相对于所有parameters(需要gradients)的微分
self.loss.backward()
loss.backward()
# for param in self.policy_net.parameters(): # clip防止梯度爆炸
# param.grad.data.clamp_(-1, 1)
self.optimizer.step() # 更新模型
@@ -109,3 +104,5 @@ class DQN:
def load(self, path):
self.target_net.load_state_dict(torch.load(path+'dqn_checkpoint.pth'))
for target_param, param in zip(self.target_net.parameters(), self.policy_net.parameters()):
param.data.copy_(target_param.data)

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

270
codes/DQN/task0_train.ipynb Normal file

File diff suppressed because one or more lines are too long

View File

@@ -5,7 +5,7 @@
@Email: johnjim0816@gmail.com
@Date: 2020-06-12 00:48:57
@LastEditor: John
LastEditTime: 2021-04-29 02:02:12
LastEditTime: 2021-05-05 16:49:15
@Discription:
@Environment: python 3.7.7
'''
@@ -14,20 +14,17 @@ curr_path = os.path.dirname(__file__)
parent_path = os.path.dirname(curr_path)
sys.path.append(parent_path) # add current terminal path to sys.path
import datetime
import torch
import gym
import torch
import datetime
from common.utils import save_results, make_dir, del_empty_dir
from common.utils import save_results, make_dir
from common.plot import plot_rewards
from DQN.agent import DQN
curr_time = datetime.datetime.now().strftime(
"%Y%m%d-%H%M%S") # obtain current time
class DQNConfig:
def __init__(self):
self.algo = "DQN" # name of algo
@@ -35,43 +32,53 @@ class DQNConfig:
self.result_path = curr_path+"/outputs/" + self.env + \
'/'+curr_time+'/results/' # path to save results
self.model_path = curr_path+"/outputs/" + self.env + \
'/'+curr_time+'/models/' # path to save results
'/'+curr_time+'/models/' # path to save models
self.train_eps = 300 # max trainng episodes
self.eval_eps = 50 # number of episodes for evaluating
self.gamma = 0.95
self.epsilon_start = 1 # e-greedy策略的初始epsilon
self.epsilon_start = 0.90 # start epsilon of e-greedy policy
self.epsilon_end = 0.01
self.epsilon_decay = 500
self.lr = 0.0001 # learning rate
self.memory_capacity = 10000 # Replay Memory容量
self.batch_size = 32
self.train_eps = 300 # 训练的episode数目
self.target_update = 2 # target net的更新频率
self.eval_eps = 20 # 测试的episode数目
self.memory_capacity = 100000 # capacity of Replay Memory
self.batch_size = 64
self.target_update = 4 # update frequency of target net
self.device = torch.device(
"cuda" if torch.cuda.is_available() else "cpu") # 检测gpu
self.hidden_dim = 256 # 神经网络隐藏层维度
"cuda" if torch.cuda.is_available() else "cpu") # check gpu
self.hidden_dim = 256 # hidden size of net
def env_agent_config(cfg,seed=1):
env = gym.make(cfg.env)
env.seed(seed)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = DQN(state_dim,action_dim,cfg)
return env,agent
def train(cfg, env, agent):
print('Start to train !')
print(f'Env:{cfg.env}, Algorithm:{cfg.algo}, Device:{cfg.device}')
rewards = []
ma_rewards = [] # moveing average reward
for i_episode in range(cfg.train_eps):
for i_ep in range(cfg.train_eps):
state = env.reset()
done = False
ep_reward = 0
while not done:
while True:
action = agent.choose_action(state)
next_state, reward, done, _ = env.step(action)
ep_reward += reward
agent.memory.push(state, action, reward, next_state, done)
state = next_state
agent.update()
if i_episode % cfg.target_update == 0:
if done:
break
if (i_ep+1) % cfg.target_update == 0:
agent.target_net.load_state_dict(agent.policy_net.state_dict())
print('Episode:{}/{}, Reward:{}'.format(i_episode+1, cfg.train_eps, ep_reward))
if (i_ep+1)%10 == 0:
print('Episode:{}/{}, Reward:{}'.format(i_ep+1, cfg.train_eps, ep_reward))
rewards.append(ep_reward)
# 计算滑动窗口的reward
# save ma rewards
if ma_rewards:
ma_rewards.append(0.9*ma_rewards[-1]+0.1*ep_reward)
else:
@@ -79,17 +86,45 @@ def train(cfg, env, agent):
print('Complete training')
return rewards, ma_rewards
def eval(cfg,env,agent):
print('Start to eval !')
print(f'Env:{cfg.env}, Algorithm:{cfg.algo}, Device:{cfg.device}')
rewards = []
ma_rewards = [] # moving average rewards
for i_ep in range(cfg.eval_eps):
ep_reward = 0 # reward per episode
state = env.reset()
while True:
action = agent.predict(state)
next_state, reward, done, _ = env.step(action)
state = next_state
ep_reward += reward
if done:
break
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)
else:
ma_rewards.append(ep_reward)
if (i_ep+1)%10 == 10:
print(f"Episode:{i_ep+1}/{cfg.eval_eps}, reward:{ep_reward:.1f}")
print('Complete evaling')
return rewards,ma_rewards
if __name__ == "__main__":
cfg = DQNConfig()
env = gym.make(cfg.env)
env.seed(1)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = DQN(state_dim, action_dim, cfg)
# train
env,agent = env_agent_config(cfg,seed=1)
rewards, ma_rewards = train(cfg, env, agent)
make_dir(cfg.result_path, cfg.model_path)
agent.save(path=cfg.model_path)
save_results(rewards, ma_rewards, tag='train', path=cfg.result_path)
plot_rewards(rewards, ma_rewards, tag="train",
algo=cfg.algo, path=cfg.result_path)
# eval
env,agent = env_agent_config(cfg,seed=10)
agent.load(path=cfg.model_path)
rewards,ma_rewards = eval(cfg,env,agent)
save_results(rewards,ma_rewards,tag='eval',path=cfg.result_path)
plot_rewards(rewards,ma_rewards,tag="eval",env=cfg.env,algo = cfg.algo,path=cfg.result_path)

View File

@@ -1,2 +0,0 @@
# DQN with cnn
原理与[DQN](../DQN)相同,只是将神经网络换成卷积神经网络,用于二维观测信息(state或obervation)

View File

@@ -1,107 +0,0 @@
import random
import math
import torch
import torch.optim as optim
import torch.nn.functional as F
from DQN_cnn.memory import ReplayBuffer
from DQN_cnn.model import CNN
class DQNcnn:
def __init__(self, screen_height,screen_width, action_dim, cfg):
self.device = cfg.device
self.action_dim = action_dim
self.gamma = cfg.gamma
# e-greedy策略相关参数
self.actions_count = 0
self.epsilon = 0
self.epsilon_start = cfg.epsilon_start
self.epsilon_end = cfg.epsilon_end
self.epsilon_decay = cfg.epsilon_decay
self.batch_size = cfg.batch_size
self.policy_net = CNN(screen_height, screen_width,
action_dim).to(self.device)
self.target_net = CNN(screen_height, screen_width,
action_dim).to(self.device)
self.target_net.load_state_dict(self.policy_net.state_dict()) # target_net的初始模型参数完全复制policy_net
self.target_net.eval() # 不启用 BatchNormalization 和 Dropout
self.optimizer = optim.RMSprop(self.policy_net.parameters(),lr = cfg.lr) # 可查parameters()与state_dict()的区别前者require_grad=True
self.loss = 0
self.memory = ReplayBuffer(cfg.memory_capacity)
def choose_action(self, state):
'''选择动作
Args:
state [array]: [description]
Returns:
action [array]: [description]
'''
self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
math.exp(-1. * self.actions_count / self.epsilon_decay)
self.actions_count += 1
if random.random() > self.epsilon:
with torch.no_grad():
q_value = self.policy_net(state) # q_value比如tensor([[-0.2522, 0.3887]])
# tensor.max(1)返回每行的最大值以及对应的下标,
# 如torch.return_types.max(values=tensor([10.3587]),indices=tensor([0]))
# 所以tensor.max(1)[1]返回最大值对应的下标即action
action = q_value.max(1)[1].view(1, 1) # 注意这里action是个张量如tensor([1])
return action
else:
return torch.tensor([[random.randrange(self.action_dim)]], device=self.device, dtype=torch.long)
def update(self):
if len(self.memory) < self.batch_size:
return
transitions = self.memory.sample(self.batch_size)
# Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
# detailed explanation). This converts batch-array of Transitions
# to Transition of batch-arrays.
batch = self.memory.Transition(*zip(*transitions))
# Compute a mask of non-final states and concatenate the batch elements
# (a final state would've been the one after which simulation ended)
non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
batch.state_)), device=self.device, dtype=torch.bool)
non_final_state_s = torch.cat([s for s in batch.state_
if s is not None])
state_batch = torch.cat(batch.state)
action_batch = torch.cat(batch.action)
reward_batch = torch.cat(batch.reward) # tensor([1., 1.,...,])
# Compute Q(s_t, a) - the model computes Q(s_t), then we select the
# columns of actions taken. These are the actions which would've been taken
# for each batch state according to policy_net
state_action_values = self.policy_net(
state_batch).gather(1, action_batch) #tensor([[ 1.1217],...,[ 0.8314]])
# Compute V(s_{t+1}) for all next states.
# Expected values of actions for non_final_state_s are computed based
# on the "older" target_net; selecting their best reward with max(1)[0].
# This is merged based on the mask, such that we'll have either the expected
# state value or 0 in case the state was final.
state__values = torch.zeros(self.batch_size, device=self.device)
state__values[non_final_mask] = self.target_net(
non_final_state_s).max(1)[0].detach()
# Compute the expected Q values
expected_state_action_values = (state__values * self.gamma) + reward_batch # tensor([0.9685, 0.9683,...,])
# Compute Huber loss
self.loss = F.smooth_l1_loss(
state_action_values, expected_state_action_values.unsqueeze(1)) # .unsqueeze增加一个维度
# Optimize the model
self.optimizer.zero_grad() # zero_grad clears old gradients from the last step (otherwise youd just accumulate the gradients from all loss.backward() calls).
self.loss.backward() # loss.backward() computes the derivative of the loss w.r.t. the parameters (or anything requiring gradients) using backpropagation.
for param in self.policy_net.parameters(): # clip防止梯度爆炸
param.grad.data.clamp_(-1, 1)
self.optimizer.step() # causes the optimizer to take a step based on the gradients of the parameters.
if __name__ == "__main__":
dqn = DQN()

View File

@@ -1,66 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-11 10:02:35
@LastEditor: John
@LastEditTime: 2020-06-11 16:57:34
@Discription:
@Environment: python 3.7.7
'''
import numpy as np
import torch
import torchvision.transforms as T
from PIL import Image
resize = T.Compose([T.ToPILImage(),
T.Resize(40, interpolation=Image.CUBIC),
T.ToTensor()])
def get_cart_location(env,screen_width):
world_width = env.x_threshold * 2
scale = screen_width / world_width
return int(env.state[0] * scale + screen_width / 2.0) # MIDDLE OF CART
def get_screen(env,device):
# Returned screen requested by gym is 400x600x3, but is sometimes larger
# such as 800x1200x3. Transpose it into torch order (CHW).
screen = env.render(mode='rgb_array').transpose((2, 0, 1))
# Cart is in the lower half, so strip off the top and bottom of the screen
_, screen_height, screen_width = screen.shape
screen = screen[:, int(screen_height*0.4):int(screen_height * 0.8)]
view_width = int(screen_width * 0.6)
cart_location = get_cart_location(env,screen_width)
if cart_location < view_width // 2:
slice_range = slice(view_width)
elif cart_location > (screen_width - view_width // 2):
slice_range = slice(-view_width, None)
else:
slice_range = slice(cart_location - view_width // 2,
cart_location + view_width // 2)
# Strip off the edges, so that we have a square image centered on a cart
screen = screen[:, :, slice_range]
# Convert to float, rescale, convert to torch tensor
# (this doesn't require a copy)
screen = np.ascontiguousarray(screen, dtype=np.float32) / 255
screen = torch.from_numpy(screen)
# Resize, and add a batch dimension (BCHW)
return resize(screen).unsqueeze(0).to(device)
if __name__ == "__main__":
import gym
env = gym.make('CartPole-v0').unwrapped
# if gpu is to be used
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
env.reset()
import matplotlib.pyplot as plt
plt.figure()
plt.imshow(get_screen(env,device).cpu().squeeze(0).permute(1, 2, 0).numpy(),
interpolation='none')
plt.title('Example extracted screen')
plt.show()

View File

@@ -1,112 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-11 10:01:09
@LastEditor: John
LastEditTime: 2021-04-05 11:06:23
@Discription:
@Environment: python 3.7.7
'''
import sys,os
curr_path = os.path.dirname(__file__)
parent_path=os.path.dirname(curr_path)
sys.path.append(parent_path) # add current terminal path to sys.path
import gym
import torch
import datetime
from DQN_cnn.env import get_screen
from DQN_cnn.agent import DQNcnn
from common.plot import plot_rewards
from common.utils import save_results
SEQUENCE = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time
SAVED_MODEL_PATH = curr_path+"/saved_model/"+SEQUENCE+'/' # path to save model
if not os.path.exists(curr_path+"/saved_model/"):
os.mkdir(curr_path+"/saved_model/")
if not os.path.exists(SAVED_MODEL_PATH):
os.mkdir(SAVED_MODEL_PATH)
RESULT_PATH = curr_path+"/results/"+SEQUENCE+'/' # path to save rewards
if not os.path.exists(curr_path+"/results/"):
os.mkdir(curr_path+"/results/")
if not os.path.exists(RESULT_PATH):
os.mkdir(RESULT_PATH)
class DQNcnnConfig:
def __init__(self) -> None:
self.algo = "DQN_cnn" # name of algo
self.gamma = 0.99
self.epsilon_start = 0.95 # e-greedy策略的初始epsilon
self.epsilon_end = 0.05
self.epsilon_decay = 200
self.lr = 0.01 # leanring rate
self.memory_capacity = 10000 # Replay Memory容量
self.batch_size = 64
self.train_eps = 250 # 训练的episode数目
self.train_steps = 200 # 训练每个episode的最大长度
self.target_update = 4 # target net的更新频率
self.eval_eps = 20 # 测试的episode数目
self.eval_steps = 200 # 测试每个episode的最大长度
self.hidden_dim = 128 # 神经网络隐藏层维度
self.device = torch.device(
"cuda" if torch.cuda.is_available() else "cpu") # if gpu is to be used
def train(cfg, env, agent):
rewards = []
ma_rewards = []
for i_episode in range(cfg.train_eps):
# Initialize the environment and state
env.reset()
last_screen = get_screen(env, cfg.device)
current_screen = get_screen(env, cfg.device)
state = current_screen - last_screen
ep_reward = 0
for i_step in range(cfg.train_steps+1):
# Select and perform an action
action = agent.choose_action(state)
_, reward, done, _ = env.step(action.item())
ep_reward += reward
reward = torch.tensor([reward], device=cfg.device)
# Observe new state
last_screen = current_screen
current_screen = get_screen(env, cfg.device)
if done:
break
state_ = current_screen - last_screen
# Store the transition in memory
agent.memory.push(state, action, state_, reward)
# Move to the next state
state = state_
# Perform one step of the optimization (on the target network)
agent.update()
# Update the target network, copying all weights and biases in DQN
if i_episode % cfg.target_update == 0:
agent.target_net.load_state_dict(agent.policy_net.state_dict())
print('Episode:{}/{}, Reward:{}, Steps:{}, Explore:{:.2f}, Done:{}'.format(i_episode+1,cfg.train_eps,ep_reward,i_step+1,agent.epsilon,done))
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(0.9*ma_rewards[-1]+0.1*ep_reward)
else:
ma_rewards.append(ep_reward)
return rewards,ma_rewards
if __name__ == "__main__":
cfg = DQNcnnConfig()
# Get screen size so that we can initialize layers correctly based on shape
# returned from AI gym. Typical dimensions at this point are close to 3x40x90
# which is the result of a clamped and down-scaled render buffer in get_screen(env,device)
# 因为这里环境的state需要从默认的向量改为图像所以要unwrapped更改state
env = gym.make('CartPole-v0').unwrapped
env.reset()
init_screen = get_screen(env, cfg.device)
_, _, screen_height, screen_width = init_screen.shape
# Get number of actions from gym action space
action_dim = env.action_space.n
agent = DQNcnn(screen_height, screen_width,
action_dim, cfg)
rewards,ma_rewards = train(cfg,env,agent)
save_results(rewards,ma_rewards,tag='train',path=RESULT_PATH)
plot_rewards(rewards,ma_rewards,tag="train",algo = cfg.algo,path=RESULT_PATH)

View File

@@ -1,35 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-11 09:42:44
@LastEditor: John
LastEditTime: 2021-03-23 20:38:41
@Discription:
@Environment: python 3.7.7
'''
from collections import namedtuple
import random
class ReplayBuffer(object):
def __init__(self, capacity):
self.capacity = capacity
self.buffer = []
self.position = 0
self.Transition = namedtuple('Transition',
('state', 'action', 'state_', 'reward'))
def push(self, *args):
"""Saves a transition."""
if len(self.buffer) < self.capacity:
self.buffer.append(None)
self.buffer[self.position] = self.Transition(*args)
self.position = (self.position + 1) % self.capacity
def sample(self, batch_size):
return random.sample(self.buffer, batch_size)
def __len__(self):
return len(self.buffer)

View File

@@ -1,41 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-11 12:18:12
@LastEditor: John
@LastEditTime: 2020-06-11 17:23:45
@Discription:
@Environment: python 3.7.7
'''
import torch.nn as nn
import torch.nn.functional as F
class CNN(nn.Module):
def __init__(self, h, w, n_outputs):
super(CNN, self).__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=5, stride=2)
self.bn1 = nn.BatchNorm2d(16)
self.conv2 = nn.Conv2d(16, 32, kernel_size=5, stride=2)
self.bn2 = nn.BatchNorm2d(32)
self.conv3 = nn.Conv2d(32, 32, kernel_size=5, stride=2)
self.bn3 = nn.BatchNorm2d(32)
# Number of Linear input connections depends on output of conv2d layers
# and therefore the input image size, so compute it.
def conv2d_size_out(size, kernel_size = 5, stride = 2):
return (size - (kernel_size - 1) - 1) // stride + 1
convw = conv2d_size_out(conv2d_size_out(conv2d_size_out(w)))
convh = conv2d_size_out(conv2d_size_out(conv2d_size_out(h)))
linear_input_size = convw * convh * 32
self.head = nn.Linear(linear_input_size, n_outputs)
# Called with either one element to determine next action, or a batch
# during optimization. Returns tensor([[left0exp,right0exp]...]).
def forward(self, x):
x = F.relu(self.bn1(self.conv1(x)))
x = F.relu(self.bn2(self.conv2(x)))
x = F.relu(self.bn3(self.conv3(x)))
return self.head(x.view(x.size(0), -1))

View File

@@ -5,7 +5,7 @@
@Email: johnjim0816@gmail.com
@Date: 2020-06-12 00:50:49
@LastEditor: John
LastEditTime: 2021-03-28 11:07:35
LastEditTime: 2021-05-04 22:28:06
@Discription:
@Environment: python 3.7.7
'''
@@ -35,22 +35,16 @@ class DoubleDQN:
self.batch_size = cfg.batch_size
self.policy_net = MLP(state_dim, action_dim,hidden_dim=cfg.hidden_dim).to(self.device)
self.target_net = MLP(state_dim, action_dim,hidden_dim=cfg.hidden_dim).to(self.device)
# target_net的初始模型参数完全复制policy_net
self.target_net.load_state_dict(self.policy_net.state_dict())
self.target_net.eval() # 不启用 BatchNormalization 和 Dropout
# target_net copy from policy_net
for target_param, param in zip(self.target_net.parameters(), self.policy_net.parameters()):
target_param.data.copy_(param.data)
# self.target_net.eval() # 不启用 BatchNormalization 和 Dropout
# 可查parameters()与state_dict()的区别前者require_grad=True
self.optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg.lr)
self.loss = 0
self.memory = ReplayBuffer(cfg.memory_capacity)
def choose_action(self, state):
'''选择动作
'''
self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
math.exp(-1. * self.actions_count / self.epsilon_decay)
self.actions_count += 1
if random.random() > self.epsilon:
with torch.no_grad():
def predict(self,state):
with torch.no_grad():
# 先转为张量便于丢给神经网络,state元素数据原本为float64
# 注意state=torch.tensor(state).unsqueeze(0)跟state=torch.tensor([state])等价
state = torch.tensor(
@@ -61,6 +55,15 @@ class DoubleDQN:
# 如torch.return_types.max(values=tensor([10.3587]),indices=tensor([0]))
# 所以tensor.max(1)[1]返回最大值对应的下标即action
action = q_value.max(1)[1].item()
return action
def choose_action(self, state):
'''选择动作
'''
self.actions_count += 1
self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
math.exp(-1. * self.actions_count / self.epsilon_decay)
if random.random() > self.epsilon:
action = self.predict(state)
else:
action = random.randrange(self.action_dim)
return action
@@ -71,7 +74,7 @@ class DoubleDQN:
# 从memory中随机采样transition
state_batch, action_batch, reward_batch, next_state_batch, done_batch = self.memory.sample(
self.batch_size)
### 转为张量 ###
# convert to tensor
state_batch = torch.tensor(
state_batch, device=self.device, dtype=torch.float)
action_batch = torch.tensor(action_batch, device=self.device).unsqueeze(
@@ -82,8 +85,7 @@ class DoubleDQN:
next_state_batch, device=self.device, dtype=torch.float)
done_batch = torch.tensor(np.float32(
done_batch), device=self.device).unsqueeze(1) # 将bool转为float然后转为张量
done_batch), device=self.device) # 将bool转为float然后转为张量
# 计算当前(s_t,a)对应的Q(s_t, a)
q_values = self.policy_net(state_batch)
next_q_values = self.policy_net(next_state_batch)
@@ -102,7 +104,7 @@ class DoubleDQN:
next_state_batch)
# 选出Q(s_t, a)对应的action代入到next_target_values获得target net对应的next_q_value即Q(s_t|a=argmax Q(s_t, a))
next_target_q_value = next_target_values.gather(1, torch.max(next_q_values, 1)[1].unsqueeze(1)).squeeze(1)
q_target = reward_batch + self.gamma * next_target_q_value * (1-done_batch[0])
q_target = reward_batch + self.gamma * next_target_q_value * (1-done_batch)
self.loss = nn.MSELoss()(q_value, q_target.unsqueeze(1)) # 计算 均方误差loss
# 优化模型
self.optimizer.zero_grad() # zero_grad清除上一步所有旧的gradients from the last step
@@ -113,7 +115,9 @@ class DoubleDQN:
self.optimizer.step() # 更新模型
def save(self,path):
torch.save(self.target_net.state_dict(), path+'DoubleDQN_checkpoint.pth')
torch.save(self.target_net.state_dict(), path+'checkpoint.pth')
def load(self,path):
self.target_net.load_state_dict(torch.load(path+'DoubleDQN_checkpoint.pth'))
self.target_net.load_state_dict(torch.load(path+'checkpoint.pth'))
for target_param, param in zip(self.target_net.parameters(), self.policy_net.parameters()):
param.data.copy_(target_param.data)

View File

@@ -1,93 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-12 00:48:57
@LastEditor: John
LastEditTime: 2021-03-28 11:05:14
@Discription:
@Environment: python 3.7.7
'''
import sys,os
sys.path.append(os.getcwd()) # add current terminal path
import gym
import torch
import datetime
from DoubleDQN.agent import DoubleDQN
from common.plot import plot_rewards
from common.utils import save_results
SEQUENCE = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # 获取当前时间
SAVED_MODEL_PATH = os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"+SEQUENCE+'/' # 生成保存的模型路径
if not os.path.exists(os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"):
os.mkdir(os.path.split(os.path.abspath(__file__))[0]+"/saved_model/")
if not os.path.exists(SAVED_MODEL_PATH):
os.mkdir(SAVED_MODEL_PATH)
RESULT_PATH = os.path.split(os.path.abspath(__file__))[0]+"/results/"+SEQUENCE+'/' # 存储reward的路径
if not os.path.exists(os.path.split(os.path.abspath(__file__))[0]+"/results/"):
os.mkdir(os.path.split(os.path.abspath(__file__))[0]+"/results/")
if not os.path.exists(RESULT_PATH):
os.mkdir(RESULT_PATH)
class DoubleDQNConfig:
def __init__(self):
self.algo = "Double DQN" # name of algo
self.gamma = 0.99
self.epsilon_start = 0.9 # e-greedy策略的初始epsilon
self.epsilon_end = 0.01
self.epsilon_decay = 200
self.lr = 0.01 # 学习率
self.memory_capacity = 10000 # Replay Memory容量
self.batch_size = 128
self.train_eps = 300 # 训练的episode数目
self.train_steps = 200 # 训练每个episode的最大长度
self.target_update = 2 # target net的更新频率
self.eval_eps = 20 # 测试的episode数目
self.eval_steps = 200 # 测试每个episode的最大长度
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # 检测gpu
self.hidden_dim = 128 # 神经网络隐藏层维度
def train(cfg,env,agent):
print('Start to train !')
rewards,ma_rewards = [],[]
ep_steps = []
for i_episode in range(cfg.train_eps):
state = env.reset() # reset环境状态
ep_reward = 0
for i_step in range(cfg.train_steps):
action = agent.choose_action(state) # 根据当前环境state选择action
next_state, reward, done, _ = env.step(action) # 更新环境参数
ep_reward += reward
agent.memory.push(state, action, reward, next_state, done) # 将state等这些transition存入memory
state = next_state # 跳转到下一个状态
agent.update() # 每步更新网络
if done:
break
# 更新target network复制DQN中的所有weights and biases
if i_episode % cfg.target_update == 0:
agent.target_net.load_state_dict(agent.policy_net.state_dict())
print('Episode:{}/{}, Reward:{}, Steps:{}, Done:{}'.format(i_episode+1,cfg.train_eps,ep_reward,i_step,done))
ep_steps.append(i_step)
rewards.append(ep_reward)
# 计算滑动窗口的reward
if ma_rewards:
ma_rewards.append(
0.9*ma_rewards[-1]+0.1*ep_reward)
else:
ma_rewards.append(ep_reward)
print('Complete training')
return rewards,ma_rewards
if __name__ == "__main__":
cfg = DoubleDQNConfig()
env = gym.make('CartPole-v0').unwrapped # 可google为什么unwrapped gym此处一般不需要
env.seed(1) # 设置env随机种子
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = DoubleDQN(state_dim,action_dim,cfg)
rewards,ma_rewards = train(cfg,env,agent)
agent.save(path=SAVED_MODEL_PATH)
save_results(rewards,ma_rewards,tag='train',path=RESULT_PATH)
plot_rewards(rewards,ma_rewards,tag="train",algo = cfg.algo,path=RESULT_PATH)

Binary file not shown.

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 55 KiB

View File

@@ -0,0 +1,194 @@
{
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
},
"orig_nbformat": 2,
"kernelspec": {
"name": "python3710jvsc74a57bd0366e1054dee9d4501b0eb8f87335afd3c67fc62db6ee611bbc7f8f5a1fefe232",
"display_name": "Python 3.7.10 64-bit ('py37': conda)"
},
"metadata": {
"interpreter": {
"hash": "366e1054dee9d4501b0eb8f87335afd3c67fc62db6ee611bbc7f8f5a1fefe232"
}
}
},
"nbformat": 4,
"nbformat_minor": 2,
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"from pathlib import Path\n",
"curr_path = str(Path().absolute())\n",
"parent_path = str(Path().absolute().parent)\n",
"sys.path.append(parent_path) # add current terminal path to sys.path"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import gym\n",
"import torch\n",
"import datetime\n",
"from DoubleDQN.agent import DoubleDQN\n",
"from common.plot import plot_rewards\n",
"from common.utils import save_results, make_dir\n",
"\n",
"curr_time = datetime.datetime.now().strftime(\n",
" \"%Y%m%d-%H%M%S\") # obtain current time"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class DoubleDQNConfig:\n",
" def __init__(self):\n",
" self.algo = \"DoubleDQN\" # name of algo\n",
" self.env = 'CartPole-v0' # env name\n",
" self.result_path = curr_path+\"/outputs/\" + self.env + \\\n",
" '/'+curr_time+'/results/' # path to save results\n",
" self.model_path = curr_path+\"/outputs/\" + self.env + \\\n",
" '/'+curr_time+'/models/' # path to save models\n",
" self.train_eps = 200 # max tranng episodes\n",
" self.eval_eps = 50 # max evaling episodes\n",
" self.gamma = 0.95\n",
" self.epsilon_start = 1 # start epsilon of e-greedy policy\n",
" self.epsilon_end = 0.01 \n",
" self.epsilon_decay = 500\n",
" self.lr = 0.001 # learning rate\n",
" self.memory_capacity = 100000 # capacity of Replay Memory\n",
" self.batch_size = 64\n",
" self.target_update = 2 # update frequency of target net\n",
" self.device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\") # check gpu\n",
" self.hidden_dim = 256 # hidden size of net"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def env_agent_config(cfg,seed=1):\n",
" env = gym.make(cfg.env) \n",
" env.seed(seed)\n",
" state_dim = env.observation_space.shape[0]\n",
" action_dim = env.action_space.n\n",
" agent = DoubleDQN(state_dim,action_dim,cfg)\n",
" return env,agent"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def train(cfg,env,agent):\n",
" print('Start to train !')\n",
" rewards,ma_rewards = [],[]\n",
" for i_ep in range(cfg.train_eps):\n",
" state = env.reset() \n",
" ep_reward = 0\n",
" while True:\n",
" action = agent.choose_action(state) \n",
" next_state, reward, done, _ = env.step(action)\n",
" ep_reward += reward\n",
" agent.memory.push(state, action, reward, next_state, done) \n",
" state = next_state \n",
" agent.update() \n",
" if done:\n",
" break\n",
" if i_ep % cfg.target_update == 0:\n",
" agent.target_net.load_state_dict(agent.policy_net.state_dict())\n",
" if (i_ep+1)%10 == 0:\n",
" print(f'Episode:{i_ep+1}/{cfg.train_eps}, Reward:{ep_reward}')\n",
" rewards.append(ep_reward)\n",
" if ma_rewards:\n",
" ma_rewards.append(\n",
" 0.9*ma_rewards[-1]+0.1*ep_reward)\n",
" else:\n",
" ma_rewards.append(ep_reward) \n",
" print('Complete training')\n",
" return rewards,ma_rewards"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def eval(cfg,env,agent):\n",
" print('Start to eval !')\n",
" print(f'Env:{cfg.env}, Algorithm:{cfg.algo}, Device:{cfg.device}')\n",
" rewards = [] \n",
" ma_rewards = []\n",
" for i_ep in range(cfg.eval_eps):\n",
" state = env.reset() \n",
" ep_reward = 0 \n",
" while True:\n",
" action = agent.predict(state) \n",
" next_state, reward, done, _ = env.step(action) \n",
" state = next_state \n",
" ep_reward += reward\n",
" if done:\n",
" break\n",
" rewards.append(ep_reward)\n",
" if ma_rewards:\n",
" ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)\n",
" else:\n",
" ma_rewards.append(ep_reward)\n",
" print(f\"Episode:{i_ep+1}/{cfg.eval_eps}, reward:{ep_reward:.1f}\")\n",
" print('Complete evaling')\n",
" return rewards,ma_rewards "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if __name__ == \"__main__\":\n",
" cfg = DoubleDQNConfig()\n",
" # train\n",
" env,agent = env_agent_config(cfg,seed=1)\n",
" rewards, ma_rewards = train(cfg, env, agent)\n",
" make_dir(cfg.result_path, cfg.model_path)\n",
" agent.save(path=cfg.model_path)\n",
" save_results(rewards, ma_rewards, tag='train', path=cfg.result_path)\n",
" plot_rewards(rewards, ma_rewards, tag=\"train\",\n",
" algo=cfg.algo, path=cfg.result_path)\n",
"\n",
" # eval\n",
" env,agent = env_agent_config(cfg,seed=10)\n",
" agent.load(path=cfg.model_path)\n",
" rewards,ma_rewards = eval(cfg,env,agent)\n",
" save_results(rewards,ma_rewards,tag='eval',path=cfg.result_path)\n",
" plot_rewards(rewards,ma_rewards,tag=\"eval\",env=cfg.env,algo = cfg.algo,path=cfg.result_path)"
]
}
]
}

View File

@@ -0,0 +1,123 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-12 00:48:57
@LastEditor: John
LastEditTime: 2021-05-04 22:26:59
@Discription:
@Environment: python 3.7.7
'''
import sys,os
curr_path = os.path.dirname(__file__)
parent_path = os.path.dirname(curr_path)
sys.path.append(parent_path) # add current terminal path to sys.path
import gym
import torch
import datetime
from DoubleDQN.agent import DoubleDQN
from common.plot import plot_rewards
from common.utils import save_results, make_dir
curr_time = datetime.datetime.now().strftime(
"%Y%m%d-%H%M%S") # obtain current time
class DoubleDQNConfig:
def __init__(self):
self.algo = "DoubleDQN" # name of algo
self.env = 'CartPole-v0' # env name
self.result_path = curr_path+"/outputs/" + self.env + \
'/'+curr_time+'/results/' # path to save results
self.model_path = curr_path+"/outputs/" + self.env + \
'/'+curr_time+'/models/' # path to save models
self.train_eps = 200 # max tranng episodes
self.eval_eps = 50 # max evaling episodes
self.gamma = 0.95
self.epsilon_start = 1 # start epsilon of e-greedy policy
self.epsilon_end = 0.01
self.epsilon_decay = 500
self.lr = 0.001 # learning rate
self.memory_capacity = 100000 # capacity of Replay Memory
self.batch_size = 64
self.target_update = 2 # update frequency of target net
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # check gpu
self.hidden_dim = 256 # hidden size of net
def env_agent_config(cfg,seed=1):
env = gym.make(cfg.env)
env.seed(seed)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = DoubleDQN(state_dim,action_dim,cfg)
return env,agent
def train(cfg,env,agent):
print('Start to train !')
rewards,ma_rewards = [],[]
for i_ep in range(cfg.train_eps):
state = env.reset()
ep_reward = 0
while True:
action = agent.choose_action(state)
next_state, reward, done, _ = env.step(action)
ep_reward += reward
agent.memory.push(state, action, reward, next_state, done)
state = next_state
agent.update()
if done:
break
if i_ep % cfg.target_update == 0:
agent.target_net.load_state_dict(agent.policy_net.state_dict())
print(f'Episode:{i_ep+1}/{cfg.train_eps}, Reward:{ep_reward},Epsilon:{agent.epsilon:.2f}')
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(
0.9*ma_rewards[-1]+0.1*ep_reward)
else:
ma_rewards.append(ep_reward)
print('Complete training')
return rewards,ma_rewards
def eval(cfg,env,agent):
print('Start to eval !')
print(f'Env:{cfg.env}, Algorithm:{cfg.algo}, Device:{cfg.device}')
rewards = []
ma_rewards = []
for i_ep in range(cfg.eval_eps):
state = env.reset()
ep_reward = 0
while True:
action = agent.predict(state)
next_state, reward, done, _ = env.step(action)
state = next_state
ep_reward += reward
if done:
break
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)
else:
ma_rewards.append(ep_reward)
print(f"Episode:{i_ep+1}/{cfg.eval_eps}, reward:{ep_reward:.1f}")
print('Complete evaling')
return rewards,ma_rewards
if __name__ == "__main__":
cfg = DoubleDQNConfig()
# train
env,agent = env_agent_config(cfg,seed=1)
rewards, ma_rewards = train(cfg, env, agent)
make_dir(cfg.result_path, cfg.model_path)
agent.save(path=cfg.model_path)
save_results(rewards, ma_rewards, tag='train', path=cfg.result_path)
plot_rewards(rewards, ma_rewards, tag="train",
algo=cfg.algo, path=cfg.result_path)
# eval
env,agent = env_agent_config(cfg,seed=10)
agent.load(path=cfg.model_path)
rewards,ma_rewards = eval(cfg,env,agent)
save_results(rewards,ma_rewards,tag='eval',path=cfg.result_path)
plot_rewards(rewards,ma_rewards,tag="eval",env=cfg.env,algo = cfg.algo,path=cfg.result_path)

View File

@@ -1,21 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Email: johnjim0816@gmail.com
Date: 2020-10-15 21:28:00
LastEditor: John
LastEditTime: 2020-10-15 21:50:30
Discription:
Environment:
'''
import os
import numpy as np
def save_results(rewards,moving_average_rewards,ep_steps,tag='train',result_path='./results'):
if not os.path.exists(result_path): # 检测是否存在文件夹
os.mkdir(result_path)
np.save(result_path+'rewards_'+tag+'.npy', rewards)
np.save(result_path+'moving_average_rewards_'+tag+'.npy', moving_average_rewards)
np.save(result_path+'steps_'+tag+'.npy',ep_steps )

View File

@@ -5,7 +5,7 @@ Author: John
Email: johnjim0816@gmail.com
Date: 2021-03-24 22:18:18
LastEditor: John
LastEditTime: 2021-03-31 14:51:09
LastEditTime: 2021-05-04 22:39:34
Discription:
Environment:
'''
@@ -65,11 +65,11 @@ class HierarchicalDQN:
if self.batch_size > len(self.memory):
return
state_batch, action_batch, reward_batch, next_state_batch, done_batch = self.memory.sample(self.batch_size)
state_batch = torch.tensor(state_batch,dtype=torch.float)
action_batch = torch.tensor(action_batch,dtype=torch.int64).unsqueeze(1)
reward_batch = torch.tensor(reward_batch,dtype=torch.float)
next_state_batch = torch.tensor(next_state_batch, dtype=torch.float)
done_batch = torch.tensor(np.float32(done_batch))
state_batch = torch.tensor(state_batch,device=self.device,dtype=torch.float)
action_batch = torch.tensor(action_batch,device=self.device,dtype=torch.int64).unsqueeze(1)
reward_batch = torch.tensor(reward_batch,device=self.device,dtype=torch.float)
next_state_batch = torch.tensor(next_state_batch,device=self.device, dtype=torch.float)
done_batch = torch.tensor(np.float32(done_batch),device=self.device)
q_values = self.policy_net(state_batch).gather(dim=1, index=action_batch).squeeze(1)
next_state_values = self.policy_net(next_state_batch).max(1)[0].detach()
expected_q_values = reward_batch + 0.99 * next_state_values * (1-done_batch)
@@ -79,17 +79,17 @@ class HierarchicalDQN:
for param in self.policy_net.parameters(): # clip防止梯度爆炸
param.grad.data.clamp_(-1, 1)
self.optimizer.step()
self.loss_numpy = loss.detach().numpy()
self.loss_numpy = loss.detach().cpu().numpy()
self.losses.append(self.loss_numpy)
def update_meta(self):
if self.batch_size > len(self.meta_memory):
return
state_batch, action_batch, reward_batch, next_state_batch, done_batch = self.meta_memory.sample(self.batch_size)
state_batch = torch.tensor(state_batch,dtype=torch.float)
action_batch = torch.tensor(action_batch,dtype=torch.int64).unsqueeze(1)
reward_batch = torch.tensor(reward_batch,dtype=torch.float)
next_state_batch = torch.tensor(next_state_batch, dtype=torch.float)
done_batch = torch.tensor(np.float32(done_batch))
state_batch = torch.tensor(state_batch,device=self.device,dtype=torch.float)
action_batch = torch.tensor(action_batch,device=self.device,dtype=torch.int64).unsqueeze(1)
reward_batch = torch.tensor(reward_batch,device=self.device,dtype=torch.float)
next_state_batch = torch.tensor(next_state_batch,device=self.device, dtype=torch.float)
done_batch = torch.tensor(np.float32(done_batch),device=self.device)
q_values = self.meta_policy_net(state_batch).gather(dim=1, index=action_batch).squeeze(1)
next_state_values = self.meta_policy_net(next_state_batch).max(1)[0].detach()
expected_q_values = reward_batch + 0.99 * next_state_values * (1-done_batch)
@@ -99,7 +99,7 @@ class HierarchicalDQN:
for param in self.meta_policy_net.parameters(): # clip防止梯度爆炸
param.grad.data.clamp_(-1, 1)
self.meta_optimizer.step()
self.meta_loss_numpy = meta_loss.detach().numpy()
self.meta_loss_numpy = meta_loss.detach().cpu().numpy()
self.meta_losses.append(self.meta_loss_numpy)
def save(self, path):

View File

@@ -5,7 +5,7 @@ Author: John
Email: johnjim0816@gmail.com
Date: 2021-03-29 10:37:32
LastEditor: John
LastEditTime: 2021-03-31 14:58:49
LastEditTime: 2021-05-04 22:35:56
Discription:
Environment:
'''
@@ -21,27 +21,23 @@ import numpy as np
import torch
import gym
from common.utils import save_results
from common.plot import plot_rewards,plot_losses
from common.utils import save_results,make_dir
from common.plot import plot_rewards
from HierarchicalDQN.agent import HierarchicalDQN
SEQUENCE = datetime.datetime.now().strftime(
curr_time = datetime.datetime.now().strftime(
"%Y%m%d-%H%M%S") # obtain current time
SAVED_MODEL_PATH = curr_path+"/saved_model/"+SEQUENCE+'/' # path to save model
if not os.path.exists(curr_path+"/saved_model/"):
os.mkdir(curr_path+"/saved_model/")
if not os.path.exists(SAVED_MODEL_PATH):
os.mkdir(SAVED_MODEL_PATH)
RESULT_PATH = curr_path+"/results/"+SEQUENCE+'/' # path to save rewards
if not os.path.exists(curr_path+"/results/"):
os.mkdir(curr_path+"/results/")
if not os.path.exists(RESULT_PATH):
os.mkdir(RESULT_PATH)
class HierarchicalDQNConfig:
def __init__(self):
self.algo = "H-DQN" # name of algo
self.env = 'CartPole-v0'
self.result_path = curr_path+"/outputs/" + self.env + \
'/'+curr_time+'/results/' # path to save results
self.model_path = curr_path+"/outputs/" + self.env + \
'/'+curr_time+'/models/' # path to save models
self.train_eps = 300 # 训练的episode数目
self.eval_eps = 50 # 测试的episode数目
self.gamma = 0.99
self.epsilon_start = 1 # start epsilon of e-greedy policy
self.epsilon_end = 0.01
@@ -49,19 +45,25 @@ class HierarchicalDQNConfig:
self.lr = 0.0001 # learning rate
self.memory_capacity = 10000 # Replay Memory capacity
self.batch_size = 32
self.train_eps = 300 # 训练的episode数目
self.target_update = 2 # target net的更新频率
self.eval_eps = 20 # 测试的episode数目
self.device = torch.device(
"cuda" if torch.cuda.is_available() else "cpu") # 检测gpu
self.hidden_dim = 256 # dimension of hidden layer
def env_agent_config(cfg,seed=1):
env = gym.make(cfg.env)
env.seed(seed)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = HierarchicalDQN(state_dim,action_dim,cfg)
return env,agent
def train(cfg, env, agent):
print('Start to train !')
print(f'Env:{cfg.env}, Algorithm:{cfg.algo}, Device:{cfg.device}')
rewards = []
ma_rewards = [] # moveing average reward
for i_episode in range(cfg.train_eps):
for i_ep in range(cfg.train_eps):
state = env.reset()
done = False
ep_reward = 0
@@ -83,7 +85,7 @@ def train(cfg, env, agent):
state = next_state
agent.update()
agent.meta_memory.push(meta_state, goal, extrinsic_reward, state, done)
print('Episode:{}/{}, Reward:{}, Loss:{:.2f}, Meta_Loss:{:.2f}'.format(i_episode+1, cfg.train_eps, ep_reward,agent.loss_numpy ,agent.meta_loss_numpy ))
print('Episode:{}/{}, Reward:{}, Loss:{:.2f}, Meta_Loss:{:.2f}'.format(i_ep+1, cfg.train_eps, ep_reward,agent.loss_numpy ,agent.meta_loss_numpy ))
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(
@@ -93,18 +95,52 @@ def train(cfg, env, agent):
print('Complete training')
return rewards, ma_rewards
def eval(cfg, env, agent):
print('Start to eval !')
print(f'Env:{cfg.env}, Algorithm:{cfg.algo}, Device:{cfg.device}')
rewards = []
ma_rewards = [] # moveing average reward
for i_ep in range(cfg.train_eps):
state = env.reset()
done = False
ep_reward = 0
while not done:
goal = agent.set_goal(state)
onehot_goal = agent.to_onehot(goal)
extrinsic_reward = 0
while not done and goal != np.argmax(state):
goal_state = np.concatenate([state, onehot_goal])
action = agent.choose_action(goal_state)
next_state, reward, done, _ = env.step(action)
ep_reward += reward
extrinsic_reward += reward
state = next_state
agent.update()
print(f'Episode:{i_ep+1}/{cfg.train_eps}, Reward:{ep_reward}, Loss:{agent.loss_numpy:.2f}, Meta_Loss:{agent.meta_loss_numpy:.2f}')
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(
0.9*ma_rewards[-1]+0.1*ep_reward)
else:
ma_rewards.append(ep_reward)
print('Complete training')
return rewards, ma_rewards
if __name__ == "__main__":
env = gym.make('CartPole-v0')
env.seed(1)
cfg = HierarchicalDQNConfig()
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = HierarchicalDQN(state_dim, action_dim, cfg)
rewards, ma_rewards = train(cfg, env, agent)
agent.save(path=SAVED_MODEL_PATH)
save_results(rewards, ma_rewards, tag='train', path=RESULT_PATH)
plot_rewards(rewards, ma_rewards, tag="train",
algo=cfg.algo, path=RESULT_PATH)
plot_losses(agent.losses,algo=cfg.algo, path=RESULT_PATH)
# train
env,agent = env_agent_config(cfg,seed=1)
rewards, ma_rewards = train(cfg, env, agent)
make_dir(cfg.result_path, cfg.model_path)
agent.save(path=cfg.model_path)
save_results(rewards, ma_rewards, tag='train', path=cfg.result_path)
plot_rewards(rewards, ma_rewards, tag="train",
algo=cfg.algo, path=cfg.result_path)
# eval
env,agent = env_agent_config(cfg,seed=10)
agent.load(path=cfg.model_path)
rewards,ma_rewards = eval(cfg,env,agent)
save_results(rewards,ma_rewards,tag='eval',path=cfg.result_path)
plot_rewards(rewards,ma_rewards,tag="eval",env=cfg.env,algo = cfg.algo,path=cfg.result_path)

View File

@@ -5,13 +5,14 @@ Author: John
Email: johnjim0816@gmail.com
Date: 2021-03-12 16:14:34
LastEditor: John
LastEditTime: 2021-03-17 12:35:06
LastEditTime: 2021-05-05 16:58:39
Discription:
Environment:
'''
import numpy as np
from collections import defaultdict
import torch
import dill
class FisrtVisitMC:
''' On-Policy First-Visit MC Control
@@ -20,14 +21,14 @@ class FisrtVisitMC:
self.action_dim = action_dim
self.epsilon = cfg.epsilon
self.gamma = cfg.gamma
self.Q = defaultdict(lambda: np.zeros(action_dim))
self.Q_table = defaultdict(lambda: np.zeros(action_dim))
self.returns_sum = defaultdict(float) # sum of returns
self.returns_count = defaultdict(float)
def choose_action(self,state):
''' e-greed policy '''
if state in self.Q.keys():
best_action = np.argmax(self.Q[state])
if state in self.Q_table.keys():
best_action = np.argmax(self.Q_table[state])
action_probs = np.ones(self.action_dim, dtype=float) * self.epsilon / self.action_dim
action_probs[best_action] += (1.0 - self.epsilon)
action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
@@ -48,19 +49,17 @@ class FisrtVisitMC:
# Calculate average return for this state over all sampled episodes
self.returns_sum[sa_pair] += G
self.returns_count[sa_pair] += 1.0
self.Q[state][action] = self.returns_sum[sa_pair] / self.returns_count[sa_pair]
self.Q_table[state][action] = self.returns_sum[sa_pair] / self.returns_count[sa_pair]
def save(self,path):
'''把 Q表格 的数据保存到文件中
'''
import dill
torch.save(
obj=self.Q,
f=path,
obj=self.Q_table,
f=path+"Q_table",
pickle_module=dill
)
def load(self, path):
'''从文件中读取数据到 Q表格
'''
import dill
self.Q =torch.load(f=path,pickle_module=dill)
self.Q_table =torch.load(f=path+"Q_table",pickle_module=dill)

View File

@@ -1,88 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Email: johnjim0816@gmail.com
Date: 2021-03-11 14:26:44
LastEditor: John
LastEditTime: 2021-03-17 12:35:36
Discription:
Environment:
'''
import sys,os
sys.path.append(os.getcwd())
import argparse
import datetime
from envs.racetrack_env import RacetrackEnv
from MonteCarlo.agent import FisrtVisitMC
from common.plot import plot_rewards
from common.utils import save_results
SEQUENCE = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # 获取当前时间
SAVED_MODEL_PATH = os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"+SEQUENCE+'/' # 生成保存的模型路径
if not os.path.exists(os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"): # 检测是否存在文件夹
os.mkdir(os.path.split(os.path.abspath(__file__))[0]+"/saved_model/")
if not os.path.exists(SAVED_MODEL_PATH): # 检测是否存在文件夹
os.mkdir(SAVED_MODEL_PATH)
RESULT_PATH = os.path.split(os.path.abspath(__file__))[0]+"/results/"+SEQUENCE+'/' # 存储reward的路径
if not os.path.exists(os.path.split(os.path.abspath(__file__))[0]+"/results/"): # 检测是否存在文件夹
os.mkdir(os.path.split(os.path.abspath(__file__))[0]+"/results/")
if not os.path.exists(RESULT_PATH): # 检测是否存在文件夹
os.mkdir(RESULT_PATH)
class MCConfig:
def __init__(self):
self.epsilon = 0.15 # epsilon: The probability to select a random action .
self.gamma = 0.9 # gamma: Gamma discount factor.
self.n_episodes = 150
self.n_steps = 2000
def get_mc_args():
'''set parameters
'''
parser = argparse.ArgumentParser()
parser.add_argument("--epsilon", default=0.15, type=float) # epsilon: The probability to select a random action . float between 0 and 1.
parser.add_argument("--gamma", default=0.9, type=float) # gamma: Gamma discount factor.
parser.add_argument("--n_episodes", default=150, type=int)
parser.add_argument("--n_steps", default=2000, type=int)
mc_cfg = parser.parse_args()
return mc_cfg
def mc_train(cfg,env,agent):
rewards = []
ma_rewards = [] # moving average rewards
for i_episode in range(cfg.n_episodes):
one_ep_transition = []
state = env.reset()
ep_reward = 0
while True:
# for t in range(cfg.n_steps):
action = agent.choose_action(state)
next_state, reward, done = env.step(action)
ep_reward+=reward
one_ep_transition.append((state, action, reward))
state = next_state
if done:
break
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)
else:
ma_rewards.append(ep_reward)
agent.update(one_ep_transition)
if (i_episode+1)%10==0:
print("Episode:{}/{}: Reward:{}".format(i_episode+1, mc_cfg.n_episodes,ep_reward))
return rewards,ma_rewards
if __name__ == "__main__":
mc_cfg = MCConfig()
env = RacetrackEnv()
action_dim=9
agent = FisrtVisitMC(action_dim,mc_cfg)
rewards,ma_rewards= mc_train(mc_cfg,env,agent)
save_results(rewards,ma_rewards,tag='train',path=RESULT_PATH)
plot_rewards(rewards,ma_rewards,tag="train",algo = "On-Policy First-Visit MC Control",path=RESULT_PATH)

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 45 KiB

View File

@@ -0,0 +1,118 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Email: johnjim0816@gmail.com
Date: 2021-03-11 14:26:44
LastEditor: John
LastEditTime: 2021-05-05 17:27:50
Discription:
Environment:
'''
import sys,os
curr_path = os.path.dirname(__file__)
parent_path = os.path.dirname(curr_path)
sys.path.append(parent_path) # add current terminal path to sys.path
import torch
import datetime
from common.utils import save_results,make_dir
from common.plot import plot_rewards
from MonteCarlo.agent import FisrtVisitMC
from envs.racetrack_env import RacetrackEnv
curr_time = datetime.datetime.now().strftime(
"%Y%m%d-%H%M%S") # obtain current time
class MCConfig:
def __init__(self):
self.algo = "MC" # name of algo
self.env = 'Racetrack'
self.result_path = curr_path+"/outputs/" + self.env + \
'/'+curr_time+'/results/' # path to save results
self.model_path = curr_path+"/outputs/" + self.env + \
'/'+curr_time+'/models/' # path to save models
# epsilon: The probability to select a random action .
self.epsilon = 0.15
self.gamma = 0.9 # gamma: Gamma discount factor.
self.train_eps = 200
self.device = torch.device(
"cuda" if torch.cuda.is_available() else "cpu") # check gpu
def env_agent_config(cfg,seed=1):
env = RacetrackEnv()
action_dim = 9
agent = FisrtVisitMC(action_dim, cfg)
return env,agent
def train(cfg, env, agent):
print('Start to eval !')
print(f'Env:{cfg.env}, Algorithm:{cfg.algo}, Device:{cfg.device}')
rewards = []
ma_rewards = [] # moving average rewards
for i_ep in range(cfg.train_eps):
state = env.reset()
ep_reward = 0
one_ep_transition = []
while True:
action = agent.choose_action(state)
next_state, reward, done = env.step(action)
ep_reward += reward
one_ep_transition.append((state, action, reward))
state = next_state
if done:
break
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)
else:
ma_rewards.append(ep_reward)
agent.update(one_ep_transition)
if (i_ep+1) % 10 == 0:
print(f"Episode:{i_ep+1}/{cfg.train_eps}: Reward:{ep_reward}")
print('Complete training')
return rewards, ma_rewards
def eval(cfg, env, agent):
print('Start to eval !')
print(f'Env:{cfg.env}, Algorithm:{cfg.algo}, Device:{cfg.device}')
rewards = []
ma_rewards = [] # moving average rewards
for i_ep in range(cfg.train_eps):
state = env.reset()
ep_reward = 0
while True:
action = agent.choose_action(state)
next_state, reward, done = env.step(action)
ep_reward += reward
state = next_state
if done:
break
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)
else:
ma_rewards.append(ep_reward)
if (i_ep+1) % 10 == 0:
print(f"Episode:{i_ep+1}/{cfg.train_eps}: Reward:{ep_reward}")
return rewards, ma_rewards
if __name__ == "__main__":
cfg = MCConfig()
# train
env,agent = env_agent_config(cfg,seed=1)
rewards, ma_rewards = train(cfg, env, agent)
make_dir(cfg.result_path, cfg.model_path)
agent.save(path=cfg.model_path)
save_results(rewards, ma_rewards, tag='train', path=cfg.result_path)
plot_rewards(rewards, ma_rewards, tag="train",
algo=cfg.algo, path=cfg.result_path)
# eval
env,agent = env_agent_config(cfg,seed=10)
agent.load(path=cfg.model_path)
rewards,ma_rewards = eval(cfg,env,agent)
save_results(rewards,ma_rewards,tag='eval',path=cfg.result_path)
plot_rewards(rewards,ma_rewards,tag="eval",env=cfg.env,algo = cfg.algo,path=cfg.result_path)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 65 KiB

Some files were not shown because too many files have changed in this diff Show More