This commit is contained in:
johnjim0816
2021-05-03 23:00:01 +08:00
parent 895094a893
commit 8028f7145e
67 changed files with 738 additions and 1137 deletions

437
LICENSE
View File

@@ -1,437 +0,0 @@
Attribution-NonCommercial-ShareAlike 4.0 International
=======================================================================
Creative Commons Corporation ("Creative Commons") is not a law firm and
does not provide legal services or legal advice. Distribution of
Creative Commons public licenses does not create a lawyer-client or
other relationship. Creative Commons makes its licenses and related
information available on an "as-is" basis. Creative Commons gives no
warranties regarding its licenses, any material licensed under their
terms and conditions, or any related information. Creative Commons
disclaims all liability for damages resulting from their use to the
fullest extent possible.
Using Creative Commons Public Licenses
Creative Commons public licenses provide a standard set of terms and
conditions that creators and other rights holders may use to share
original works of authorship and other material subject to copyright
and certain other rights specified in the public license below. The
following considerations are for informational purposes only, are not
exhaustive, and do not form part of our licenses.
Considerations for licensors: Our public licenses are
intended for use by those authorized to give the public
permission to use material in ways otherwise restricted by
copyright and certain other rights. Our licenses are
irrevocable. Licensors should read and understand the terms
and conditions of the license they choose before applying it.
Licensors should also secure all rights necessary before
applying our licenses so that the public can reuse the
material as expected. Licensors should clearly mark any
material not subject to the license. This includes other CC-
licensed material, or material used under an exception or
limitation to copyright. More considerations for licensors:
wiki.creativecommons.org/Considerations_for_licensors
Considerations for the public: By using one of our public
licenses, a licensor grants the public permission to use the
licensed material under specified terms and conditions. If
the licensor's permission is not necessary for any reason--for
example, because of any applicable exception or limitation to
copyright--then that use is not regulated by the license. Our
licenses grant only permissions under copyright and certain
other rights that a licensor has authority to grant. Use of
the licensed material may still be restricted for other
reasons, including because others have copyright or other
rights in the material. A licensor may make special requests,
such as asking that all changes be marked or described.
Although not required by our licenses, you are encouraged to
respect those requests where reasonable. More considerations
for the public:
wiki.creativecommons.org/Considerations_for_licensees
=======================================================================
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
Public License
By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International Public License
("Public License"). To the extent this Public License may be
interpreted as a contract, You are granted the Licensed Rights in
consideration of Your acceptance of these terms and conditions, and the
Licensor grants You such rights in consideration of benefits the
Licensor receives from making the Licensed Material available under
these terms and conditions.
Section 1 -- Definitions.
a. Adapted Material means material subject to Copyright and Similar
Rights that is derived from or based upon the Licensed Material
and in which the Licensed Material is translated, altered,
arranged, transformed, or otherwise modified in a manner requiring
permission under the Copyright and Similar Rights held by the
Licensor. For purposes of this Public License, where the Licensed
Material is a musical work, performance, or sound recording,
Adapted Material is always produced where the Licensed Material is
synched in timed relation with a moving image.
b. Adapter's License means the license You apply to Your Copyright
and Similar Rights in Your contributions to Adapted Material in
accordance with the terms and conditions of this Public License.
c. BY-NC-SA Compatible License means a license listed at
creativecommons.org/compatiblelicenses, approved by Creative
Commons as essentially the equivalent of this Public License.
d. Copyright and Similar Rights means copyright and/or similar rights
closely related to copyright including, without limitation,
performance, broadcast, sound recording, and Sui Generis Database
Rights, without regard to how the rights are labeled or
categorized. For purposes of this Public License, the rights
specified in Section 2(b)(1)-(2) are not Copyright and Similar
Rights.
e. Effective Technological Measures means those measures that, in the
absence of proper authority, may not be circumvented under laws
fulfilling obligations under Article 11 of the WIPO Copyright
Treaty adopted on December 20, 1996, and/or similar international
agreements.
f. Exceptions and Limitations means fair use, fair dealing, and/or
any other exception or limitation to Copyright and Similar Rights
that applies to Your use of the Licensed Material.
g. License Elements means the license attributes listed in the name
of a Creative Commons Public License. The License Elements of this
Public License are Attribution, NonCommercial, and ShareAlike.
h. Licensed Material means the artistic or literary work, database,
or other material to which the Licensor applied this Public
License.
i. Licensed Rights means the rights granted to You subject to the
terms and conditions of this Public License, which are limited to
all Copyright and Similar Rights that apply to Your use of the
Licensed Material and that the Licensor has authority to license.
j. Licensor means the individual(s) or entity(ies) granting rights
under this Public License.
k. NonCommercial means not primarily intended for or directed towards
commercial advantage or monetary compensation. For purposes of
this Public License, the exchange of the Licensed Material for
other material subject to Copyright and Similar Rights by digital
file-sharing or similar means is NonCommercial provided there is
no payment of monetary compensation in connection with the
exchange.
l. Share means to provide material to the public by any means or
process that requires permission under the Licensed Rights, such
as reproduction, public display, public performance, distribution,
dissemination, communication, or importation, and to make material
available to the public including in ways that members of the
public may access the material from a place and at a time
individually chosen by them.
m. Sui Generis Database Rights means rights other than copyright
resulting from Directive 96/9/EC of the European Parliament and of
the Council of 11 March 1996 on the legal protection of databases,
as amended and/or succeeded, as well as other essentially
equivalent rights anywhere in the world.
n. You means the individual or entity exercising the Licensed Rights
under this Public License. Your has a corresponding meaning.
Section 2 -- Scope.
a. License grant.
1. Subject to the terms and conditions of this Public License,
the Licensor hereby grants You a worldwide, royalty-free,
non-sublicensable, non-exclusive, irrevocable license to
exercise the Licensed Rights in the Licensed Material to:
a. reproduce and Share the Licensed Material, in whole or
in part, for NonCommercial purposes only; and
b. produce, reproduce, and Share Adapted Material for
NonCommercial purposes only.
2. Exceptions and Limitations. For the avoidance of doubt, where
Exceptions and Limitations apply to Your use, this Public
License does not apply, and You do not need to comply with
its terms and conditions.
3. Term. The term of this Public License is specified in Section
6(a).
4. Media and formats; technical modifications allowed. The
Licensor authorizes You to exercise the Licensed Rights in
all media and formats whether now known or hereafter created,
and to make technical modifications necessary to do so. The
Licensor waives and/or agrees not to assert any right or
authority to forbid You from making technical modifications
necessary to exercise the Licensed Rights, including
technical modifications necessary to circumvent Effective
Technological Measures. For purposes of this Public License,
simply making modifications authorized by this Section 2(a)
(4) never produces Adapted Material.
5. Downstream recipients.
a. Offer from the Licensor -- Licensed Material. Every
recipient of the Licensed Material automatically
receives an offer from the Licensor to exercise the
Licensed Rights under the terms and conditions of this
Public License.
b. Additional offer from the Licensor -- Adapted Material.
Every recipient of Adapted Material from You
automatically receives an offer from the Licensor to
exercise the Licensed Rights in the Adapted Material
under the conditions of the Adapter's License You apply.
c. No downstream restrictions. You may not offer or impose
any additional or different terms or conditions on, or
apply any Effective Technological Measures to, the
Licensed Material if doing so restricts exercise of the
Licensed Rights by any recipient of the Licensed
Material.
6. No endorsement. Nothing in this Public License constitutes or
may be construed as permission to assert or imply that You
are, or that Your use of the Licensed Material is, connected
with, or sponsored, endorsed, or granted official status by,
the Licensor or others designated to receive attribution as
provided in Section 3(a)(1)(A)(i).
b. Other rights.
1. Moral rights, such as the right of integrity, are not
licensed under this Public License, nor are publicity,
privacy, and/or other similar personality rights; however, to
the extent possible, the Licensor waives and/or agrees not to
assert any such rights held by the Licensor to the limited
extent necessary to allow You to exercise the Licensed
Rights, but not otherwise.
2. Patent and trademark rights are not licensed under this
Public License.
3. To the extent possible, the Licensor waives any right to
collect royalties from You for the exercise of the Licensed
Rights, whether directly or through a collecting society
under any voluntary or waivable statutory or compulsory
licensing scheme. In all other cases the Licensor expressly
reserves any right to collect such royalties, including when
the Licensed Material is used other than for NonCommercial
purposes.
Section 3 -- License Conditions.
Your exercise of the Licensed Rights is expressly made subject to the
following conditions.
a. Attribution.
1. If You Share the Licensed Material (including in modified
form), You must:
a. retain the following if it is supplied by the Licensor
with the Licensed Material:
i. identification of the creator(s) of the Licensed
Material and any others designated to receive
attribution, in any reasonable manner requested by
the Licensor (including by pseudonym if
designated);
ii. a copyright notice;
iii. a notice that refers to this Public License;
iv. a notice that refers to the disclaimer of
warranties;
v. a URI or hyperlink to the Licensed Material to the
extent reasonably practicable;
b. indicate if You modified the Licensed Material and
retain an indication of any previous modifications; and
c. indicate the Licensed Material is licensed under this
Public License, and include the text of, or the URI or
hyperlink to, this Public License.
2. You may satisfy the conditions in Section 3(a)(1) in any
reasonable manner based on the medium, means, and context in
which You Share the Licensed Material. For example, it may be
reasonable to satisfy the conditions by providing a URI or
hyperlink to a resource that includes the required
information.
3. If requested by the Licensor, You must remove any of the
information required by Section 3(a)(1)(A) to the extent
reasonably practicable.
b. ShareAlike.
In addition to the conditions in Section 3(a), if You Share
Adapted Material You produce, the following conditions also apply.
1. The Adapter's License You apply must be a Creative Commons
license with the same License Elements, this version or
later, or a BY-NC-SA Compatible License.
2. You must include the text of, or the URI or hyperlink to, the
Adapter's License You apply. You may satisfy this condition
in any reasonable manner based on the medium, means, and
context in which You Share Adapted Material.
3. You may not offer or impose any additional or different terms
or conditions on, or apply any Effective Technological
Measures to, Adapted Material that restrict exercise of the
rights granted under the Adapter's License You apply.
Section 4 -- Sui Generis Database Rights.
Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:
a. for the avoidance of doubt, Section 2(a)(1) grants You the right
to extract, reuse, reproduce, and Share all or a substantial
portion of the contents of the database for NonCommercial purposes
only;
b. if You include all or a substantial portion of the database
contents in a database in which You have Sui Generis Database
Rights, then the database in which You have Sui Generis Database
Rights (but not its individual contents) is Adapted Material,
including for purposes of Section 3(b); and
c. You must comply with the conditions in Section 3(a) if You Share
all or a substantial portion of the contents of the database.
For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.
Section 5 -- Disclaimer of Warranties and Limitation of Liability.
a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
c. The disclaimer of warranties and limitation of liability provided
above shall be interpreted in a manner that, to the extent
possible, most closely approximates an absolute disclaimer and
waiver of all liability.
Section 6 -- Term and Termination.
a. This Public License applies for the term of the Copyright and
Similar Rights licensed here. However, if You fail to comply with
this Public License, then Your rights under this Public License
terminate automatically.
b. Where Your right to use the Licensed Material has terminated under
Section 6(a), it reinstates:
1. automatically as of the date the violation is cured, provided
it is cured within 30 days of Your discovery of the
violation; or
2. upon express reinstatement by the Licensor.
For the avoidance of doubt, this Section 6(b) does not affect any
right the Licensor may have to seek remedies for Your violations
of this Public License.
c. For the avoidance of doubt, the Licensor may also offer the
Licensed Material under separate terms or conditions or stop
distributing the Licensed Material at any time; however, doing so
will not terminate this Public License.
d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
License.
Section 7 -- Other Terms and Conditions.
a. The Licensor shall not be bound by any additional or different
terms or conditions communicated by You unless expressly agreed.
b. Any arrangements, understandings, or agreements regarding the
Licensed Material not stated herein are separate from and
independent of the terms and conditions of this Public License.
Section 8 -- Interpretation.
a. For the avoidance of doubt, this Public License does not, and
shall not be interpreted to, reduce, limit, restrict, or impose
conditions on any use of the Licensed Material that could lawfully
be made without permission under this Public License.
b. To the extent possible, if any provision of this Public License is
deemed unenforceable, it shall be automatically reformed to the
minimum extent necessary to make it enforceable. If the provision
cannot be reformed, it shall be severed from this Public License
without affecting the enforceability of the remaining terms and
conditions.
c. No term or condition of this Public License will be waived and no
failure to comply consented to unless expressly agreed to by the
Licensor.
d. Nothing in this Public License constitutes or may be interpreted
as a limitation upon, or waiver of, any privileges and immunities
that apply to the Licensor or You, including from the legal
processes of any jurisdiction or authority.
=======================================================================
Creative Commons is not a party to its public
licenses. Notwithstanding, Creative Commons may elect to apply one of
its public licenses to material it publishes and in those instances
will be considered the “Licensor.” The text of the Creative Commons
public licenses is dedicated to the public domain under the CC0 Public
Domain Dedication. Except for the limited purpose of indicating that
material is shared under a Creative Commons public license or as
otherwise permitted by the Creative Commons policies published at
creativecommons.org/policies, Creative Commons does not authorize the
use of the trademark "Creative Commons" or any other trademark or logo
of Creative Commons without its prior written consent including,
without limitation, in connection with any unauthorized modifications
to any of its public licenses or any other arrangements,
understandings, or agreements concerning use of licensed material. For
the avoidance of doubt, this paragraph does not form part of the
public licenses.
Creative Commons may be contacted at creativecommons.org.

View File

@@ -1,69 +0,0 @@
# Easy-RL
李宏毅老师的《深度强化学习》是强化学习领域经典的中文视频之一。李老师幽默风趣的上课风格让晦涩难懂的强化学习理论变得轻松易懂,他会通过很多有趣的例子来讲解强化学习理论。比如老师经常会用玩 Atari 游戏的例子来讲解强化学习算法。此外,为了教程的完整性,我们整理了周博磊老师的《强化学习纲要》、李科浇老师的《百度强化学习》以及多个强化学习的经典资料作为补充。对于想入门强化学习又想看中文讲解的人来说绝对是非常推荐的。
## 使用说明
* 第 4 章到第 11 章为[李宏毅《深度强化学习》](http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLDS18.html)的部分;
* 第 1 章和第 2 章根据[《强化学习纲要》](https://github.com/zhoubolei/introRL)整理而来;
* 第 3 章和第 12 章根据[《百度强化学习》](https://aistudio.baidu.com/aistudio/education/group/info/1335) 整理而来。
## 在线阅读(内容实时更新)
地址https://datawhalechina.github.io/easy-rl/
## 内容导航
| 章节 | 习题 | 相关项目 |
| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| [第一章 强化学习概述](https://datawhalechina.github.io/easy-rl/#/chapter1/chapter1) | [第一章 习题](https://datawhalechina.github.io/easy-rl/#/chapter1/chapter1_questions&keywords) | |
| [第二章 马尔可夫决策过程 (MDP)](https://datawhalechina.github.io/easy-rl/#/chapter2/chapter2) | [第二章 习题](https://datawhalechina.github.io/easy-rl/#/chapter2/chapter2_questions&keywords) | |
| [第三章 表格型方法](https://datawhalechina.github.io/easy-rl/#/chapter3/chapter3) | [第三章 习题](https://datawhalechina.github.io/easy-rl/#/chapter3/chapter3_questions&keywords) | [Q-learning算法实战](https://datawhalechina.github.io/easy-rl/#/chapter3/project1) |
| [第四章 策略梯度](https://datawhalechina.github.io/easy-rl/#/chapter4/chapter4) | [第四章 习题](https://datawhalechina.github.io/easy-rl/#/chapter4/chapter4_questions&keywords) | |
| [第五章 近端策略优化 (PPO) 算法](https://datawhalechina.github.io/easy-rl/#/chapter5/chapter5) | [第五章 习题](https://datawhalechina.github.io/easy-rl/#/chapter5/chapter5_questions&keywords) | |
| [第六章 DQN (基本概念)](https://datawhalechina.github.io/easy-rl/#/chapter6/chapter6) | [第六章 习题](https://datawhalechina.github.io/easy-rl/#/chapter6/chapter6_questions&keywords) | |
| [第七章 DQN (进阶技巧)](https://datawhalechina.github.io/easy-rl/#/chapter7/chapter7) | [第七章 习题](https://datawhalechina.github.io/easy-rl/#/chapter7/chapter7_questions&keywords) | [DQN算法实战](https://datawhalechina.github.io/easy-rl/#/chapter7/project2) |
| [第八章 DQN (连续动作)](https://datawhalechina.github.io/easy-rl/#/chapter8/chapter8) | [第八章 习题](https://datawhalechina.github.io/easy-rl/#/chapter8/chapter8_questions&keywords) | |
| [第九章 演员-评论家算法](https://datawhalechina.github.io/easy-rl/#/chapter9/chapter9) | [第九章 习题](https://datawhalechina.github.io/easy-rl/#/chapter9/chapter9_questions&keywords) | |
| [第十章 稀疏奖励](https://datawhalechina.github.io/easy-rl/#/chapter10/chapter10) | [第十章 习题](https://datawhalechina.github.io/easy-rl/#/chapter10/chapter10_questions&keywords) | |
| [第十一章 模仿学习](https://datawhalechina.github.io/easy-rl/#/chapter11/chapter11) | [第十一章 习题](https://datawhalechina.github.io/easy-rl/#/chapter11/chapter11_questions&keywords) | |
| [第十二章 深度确定性策略梯度 (DDPG) 算法](https://datawhalechina.github.io/easy-rl/#/chapter12/chapter12) | [第十二章 习题](https://datawhalechina.github.io/easy-rl/#/chapter12/chapter12_questions&keywords) | [DDPG算法实战](https://datawhalechina.github.io/easy-rl/#/chapter12/project3) |
| [第十三章 AlphaStar 论文解读](https://datawhalechina.github.io/easy-rl/#/chapter13/chapter13) | | |
## 算法实战
[点击](https://github.com/datawhalechina/easy-rl/tree/master/codes)或者跳转```codes```文件夹下进入算法实战
## 贡献者
<table border="0">
<tbody>
<tr align="center" >
<td>
<a href="https://github.com/qiwang067"><img width="70" height="70" src="https://github.com/qiwang067.png?s=40" alt="pic"></a><br>
<a href="https://github.com/qiwang067">Qi Wang</a>
<p>教程设计(第1~12章)<br> 中国科学院大学</p>
</td>
<td>
<a href="https://github.com/yyysjz1997"><img width="70" height="70" src="https://github.com/yyysjz1997.png?s=40" alt="pic"></a><br>
<a href="https://github.com/yyysjz1997">David Young</a>
<p>习题设计&第13章 <br> 清华大学</p>
</td>
<td>
<a href="https://github.com/JohnJim0816"><img width="70" height="70" src="https://github.com/JohnJim0816.png?s=40" alt="pic"></a><br>
<a href="https://github.com/JohnJim0816">John Jim</a>
<p>算法实战<br> 北京大学</p>
</td>
</tr>
</tbody>
</table>
## 致谢
特别感谢 [@Sm1les](https://github.com/Sm1les)、[@LSGOMYP](https://github.com/LSGOMYP) 对本项目的帮助与支持。
## 关注我们
<div align=center><img src="https://raw.githubusercontent.com/datawhalechina/pumpkin-book/master/res/qrcode.jpeg" width = "250" height = "270" alt="Datawhale是一个专注AI领域的开源组织以“for the learner和学习者一起成长”为愿景构建对学习者最有价值的开源学习社区。关注我们一起学习成长。"></div>
## LICENSE
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://img.shields.io/badge/license-CC%20BY--NC--SA%204.0-lightgrey" /></a><br />本作品采用<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议</a>进行许可。

View File

@@ -0,0 +1,5 @@
## A2C
https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f

View File

@@ -1,32 +1,27 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2020-11-03 20:47:09
LastEditor: John
LastEditTime: 2021-03-20 17:41:21
Date: 2021-05-03 22:16:08
LastEditor: JiangJi
LastEditTime: 2021-05-03 22:23:48
Discription:
Environment:
'''
from A2C.model import ActorCritic
import torch.optim as optim
from A2C.model import ActorCritic
class A2C:
def __init__(self,state_dim, action_dim, cfg):
self.gamma = 0.99
self.model = ActorCritic(state_dim, action_dim, hidden_dim=cfg.hidden_dim).to(cfg.device)
self.optimizer = optim.Adam(self.model.parameters(),lr=cfg.lr)
def choose_action(self, state):
dist, value = self.model(state)
action = dist.sample()
return action
def __init__(self,state_dim,action_dim,cfg) -> None:
self.gamma = cfg.gamma
self.device = cfg.device
self.model = ActorCritic(state_dim, action_dim, cfg.hidden_size).to(self.device)
self.optimizer = optim.Adam(self.model.parameters())
def compute_returns(self,next_value, rewards, masks):
R = next_value
returns = []
for step in reversed(range(len(rewards))):
R = rewards[step] + self.gamma * R * masks[step]
returns.insert(0, R)
return returns
def update(self):
pass
return returns

View File

@@ -1,48 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Email: johnjim0816@gmail.com
Date: 2020-10-30 15:39:37
LastEditor: John
LastEditTime: 2021-03-17 20:19:14
Discription:
Environment:
'''
import gym
from A2C.multiprocessing_env import SubprocVecEnv
# num_envs = 16
# env = "Pendulum-v0"
def make_envs(num_envs=16,env="Pendulum-v0"):
''' 创建多个子环境
'''
num_envs = 16
env = "CartPole-v0"
def make_env():
def _thunk():
env = gym.make(env)
return env
return _thunk
envs = [make_env() for i in range(num_envs)]
envs = SubprocVecEnv(envs)
return envs
# if __name__ == "__main__":
# num_envs = 16
# env = "CartPole-v0"
# def make_env():
# def _thunk():
# env = gym.make(env)
# return env
# return _thunk
# envs = [make_env() for i in range(num_envs)]
# envs = SubprocVecEnv(envs)
if __name__ == "__main__":
envs = make_envs(num_envs=16,env="CartPole-v0")

View File

@@ -1,106 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-11 20:58:21
@LastEditor: John
LastEditTime: 2021-04-05 11:14:39
@Discription:
@Environment: python 3.7.9
'''
import sys,os
curr_path = os.path.dirname(__file__)
parent_path=os.path.dirname(curr_path)
sys.path.append(parent_path) # add current terminal path to sys.path
import torch
import gym
import datetime
from A2C.agent import A2C
from common.utils import save_results,make_dir,del_empty_dir
SEQUENCE = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # 获取当前时间
SAVED_MODEL_PATH = os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"+SEQUENCE+'/' # 生成保存的模型路径
if not os.path.exists(os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"):
os.mkdir(os.path.split(os.path.abspath(__file__))[0]+"/saved_model/")
if not os.path.exists(SAVED_MODEL_PATH):
os.mkdir(SAVED_MODEL_PATH)
RESULT_PATH = os.path.split(os.path.abspath(__file__))[0]+"/results/"+SEQUENCE+'/' # 存储reward的路径
if not os.path.exists(os.path.split(os.path.abspath(__file__))[0]+"/results/"):
os.mkdir(os.path.split(os.path.abspath(__file__))[0]+"/results/")
if not os.path.exists(RESULT_PATH):
os.mkdir(RESULT_PATH)
class A2CConfig:
def __init__(self):
self.gamma = 0.99
self.lr = 3e-4 # learnning rate
self.actor_lr = 1e-4 # learnning rate of actor network
self.memory_capacity = 10000 # capacity of replay memory
self.batch_size = 128
self.train_eps = 200
self.train_steps = 200
self.eval_eps = 200
self.eval_steps = 200
self.target_update = 4
self.hidden_dim=256
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def train(cfg,env,agent):
print('Start to train ! ')
for i_episode in range(cfg.train_eps):
state = env.reset()
log_probs = []
values = []
rewards = []
masks = []
entropy = 0
ep_reward = 0
for i_step in range(cfg.train_steps):
state = torch.FloatTensor(state).to(cfg.device)
dist, value = agent.model(state)
action = dist.sample()
next_state, reward, done, _ = env.step(action.cpu().numpy())
ep_reward+=reward
state = next_state
log_prob = dist.log_prob(action)
entropy += dist.entropy().mean()
log_probs.append(log_prob)
values.append(value)
rewards.append(torch.FloatTensor(reward).unsqueeze(1).to(cfg.device))
masks.append(torch.FloatTensor(1 - done).unsqueeze(1).to(cfg.device))
if done:
break
print('Episode:{}/{}, Reward:{}, Steps:{}, Done:{}'.format(i_episode+1,cfg.train_eps,ep_reward,i_step+1,done))
next_state = torch.FloatTensor(next_state).to(cfg.device)
_, next_value =agent.model(next_state)
returns = agent.compute_returns(next_value, rewards, masks)
log_probs = torch.cat(log_probs)
returns = torch.cat(returns).detach()
values = torch.cat(values)
advantage = returns - values
actor_loss = -(log_probs * advantage.detach()).mean()
critic_loss = advantage.pow(2).mean()
loss = actor_loss + 0.5 * critic_loss - 0.001 * entropy
agent.optimizer.zero_grad()
loss.backward()
agent.optimizer.step()
print('Complete training')
if __name__ == "__main__":
cfg = A2CConfig()
env = gym.make('CartPole-v0')
env.seed(1) # set random seed for env
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = A2C(state_dim, action_dim, cfg)
train(cfg,env,agent)

View File

@@ -1,36 +1,36 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Author: JiangJi
Email: johnjim0816@gmail.com
Date: 2020-11-03 20:45:25
LastEditor: John
LastEditTime: 2021-03-20 17:41:33
Date: 2021-05-03 21:38:54
LastEditor: JiangJi
LastEditTime: 2021-05-03 21:40:06
Discription:
Environment:
'''
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
class ActorCritic(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=256):
def __init__(self, num_inputs, num_outputs, hidden_size, std=0.0):
super(ActorCritic, self).__init__()
self.critic = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Linear(num_inputs, hidden_size),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
nn.Linear(hidden_size, 1)
)
self.actor = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Linear(num_inputs, hidden_size),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
nn.Linear(hidden_size, num_outputs),
nn.Softmax(dim=1),
)
def forward(self, x):
value = self.critic(x)
print(x)
probs = self.actor(x)
dist = Categorical(probs)
return dist, value

View File

@@ -1,153 +0,0 @@
#This code is from openai baseline
#https://github.com/openai/baselines/tree/master/baselines/common/vec_env
import numpy as np
from multiprocessing import Process, Pipe
def worker(remote, parent_remote, env_fn_wrapper):
parent_remote.close()
env = env_fn_wrapper.x()
while True:
cmd, data = remote.recv()
if cmd == 'step':
ob, reward, done, info = env.step(data)
if done:
ob = env.reset()
remote.send((ob, reward, done, info))
elif cmd == 'reset':
ob = env.reset()
remote.send(ob)
elif cmd == 'reset_task':
ob = env.reset_task()
remote.send(ob)
elif cmd == 'close':
remote.close()
break
elif cmd == 'get_spaces':
remote.send((env.observation_space, env.action_space))
else:
raise NotImplementedError
class VecEnv(object):
"""
An abstract asynchronous, vectorized environment.
"""
def __init__(self, num_envs, observation_space, action_space):
self.num_envs = num_envs
self.observation_space = observation_space
self.action_space = action_space
def reset(self):
"""
Reset all the environments and return an array of
observations, or a tuple of observation arrays.
If step_async is still doing work, that work will
be cancelled and step_wait() should not be called
until step_async() is invoked again.
"""
pass
def step_async(self, actions):
"""
Tell all the environments to start taking a step
with the given actions.
Call step_wait() to get the results of the step.
You should not call this if a step_async run is
already pending.
"""
pass
def step_wait(self):
"""
Wait for the step taken with step_async().
Returns (obs, rews, dones, infos):
- obs: an array of observations, or a tuple of
arrays of observations.
- rews: an array of rewards
- dones: an array of "episode done" booleans
- infos: a sequence of info objects
"""
pass
def close(self):
"""
Clean up the environments' resources.
"""
pass
def step(self, actions):
self.step_async(actions)
return self.step_wait()
class CloudpickleWrapper(object):
"""
Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
"""
def __init__(self, x):
self.x = x
def __getstate__(self):
import cloudpickle
return cloudpickle.dumps(self.x)
def __setstate__(self, ob):
import pickle
self.x = pickle.loads(ob)
class SubprocVecEnv(VecEnv):
def __init__(self, env_fns, spaces=None):
"""
envs: list of gym environments to run in subprocesses
"""
self.waiting = False
self.closed = False
nenvs = len(env_fns)
self.nenvs = nenvs
self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
for p in self.ps:
p.daemon = True # if the main process crashes, we should not cause things to hang
p.start()
for remote in self.work_remotes:
remote.close()
self.remotes[0].send(('get_spaces', None))
observation_space, action_space = self.remotes[0].recv()
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
def step_async(self, actions):
for remote, action in zip(self.remotes, actions):
remote.send(('step', action))
self.waiting = True
def step_wait(self):
results = [remote.recv() for remote in self.remotes]
self.waiting = False
obs, rews, dones, infos = zip(*results)
return np.stack(obs), np.stack(rews), np.stack(dones), infos
def reset(self):
for remote in self.remotes:
remote.send(('reset', None))
return np.stack([remote.recv() for remote in self.remotes])
def reset_task(self):
for remote in self.remotes:
remote.send(('reset_task', None))
return np.stack([remote.recv() for remote in self.remotes])
def close(self):
if self.closed:
return
if self.waiting:
for remote in self.remotes:
remote.recv()
for remote in self.remotes:
remote.send(('close', None))
for p in self.ps:
p.join()
self.closed = True
def __len__(self):
return self.nenvs

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

120
codes/A2C/task0_train.py Normal file
View File

@@ -0,0 +1,120 @@
import sys,os
curr_path = os.path.dirname(__file__)
parent_path = os.path.dirname(curr_path)
sys.path.append(parent_path) # add current terminal path to sys.path
import gym
import numpy as np
import torch
import torch.optim as optim
import datetime
from common.multiprocessing_env import SubprocVecEnv
from A2C.model import ActorCritic
from common.utils import save_results, make_dir
from common.plot import plot_rewards
curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time
class A2CConfig:
def __init__(self) -> None:
self.algo='A2C'
self.env= 'CartPole-v0'
self.result_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/results/' # path to save results
self.model_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/models/' # path to save models
self.n_envs = 8
self.gamma = 0.99
self.hidden_size = 256
self.lr = 1e-3 # learning rate
self.max_frames = 30000
self.n_steps = 5
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def make_envs(env_name):
def _thunk():
env = gym.make(env_name)
env.seed(2)
return env
return _thunk
def test_env(env,model,vis=False):
state = env.reset()
if vis: env.render()
done = False
total_reward = 0
while not done:
state = torch.FloatTensor(state).unsqueeze(0).to(cfg.device)
dist, _ = model(state)
next_state, reward, done, _ = env.step(dist.sample().cpu().numpy()[0])
state = next_state
if vis: env.render()
total_reward += reward
return total_reward
def compute_returns(next_value, rewards, masks, gamma=0.99):
R = next_value
returns = []
for step in reversed(range(len(rewards))):
R = rewards[step] + gamma * R * masks[step]
returns.insert(0, R)
return returns
def train(cfg,envs):
env = gym.make(cfg.env) # a single env
env.seed(10)
state_dim = envs.observation_space.shape[0]
action_dim = envs.action_space.n
model = ActorCritic(state_dim, action_dim, cfg.hidden_size).to(cfg.device)
optimizer = optim.Adam(model.parameters())
frame_idx = 0
test_rewards = []
test_ma_rewards = []
state = envs.reset()
while frame_idx < cfg.max_frames:
log_probs = []
values = []
rewards = []
masks = []
entropy = 0
# rollout trajectory
for _ in range(cfg.n_steps):
state = torch.FloatTensor(state).to(cfg.device)
dist, value = model(state)
action = dist.sample()
next_state, reward, done, _ = envs.step(action.cpu().numpy())
log_prob = dist.log_prob(action)
entropy += dist.entropy().mean()
log_probs.append(log_prob)
values.append(value)
rewards.append(torch.FloatTensor(reward).unsqueeze(1).to(cfg.device))
masks.append(torch.FloatTensor(1 - done).unsqueeze(1).to(cfg.device))
state = next_state
frame_idx += 1
if frame_idx % 100 == 0:
test_reward = np.mean([test_env(env,model) for _ in range(10)])
print(f"frame_idx:{frame_idx}, test_reward:{test_reward}")
test_rewards.append(test_reward)
if test_ma_rewards:
test_ma_rewards.append(0.9*test_ma_rewards[-1]+0.1*test_reward)
else:
test_ma_rewards.append(test_reward)
# plot(frame_idx, test_rewards)
next_state = torch.FloatTensor(next_state).to(cfg.device)
_, next_value = model(next_state)
returns = compute_returns(next_value, rewards, masks)
log_probs = torch.cat(log_probs)
returns = torch.cat(returns).detach()
values = torch.cat(values)
advantage = returns - values
actor_loss = -(log_probs * advantage.detach()).mean()
critic_loss = advantage.pow(2).mean()
loss = actor_loss + 0.5 * critic_loss - 0.001 * entropy
optimizer.zero_grad()
loss.backward()
optimizer.step()
return test_rewards, test_ma_rewards
if __name__ == "__main__":
cfg = A2CConfig()
envs = [make_envs(cfg.env) for i in range(cfg.n_envs)]
envs = SubprocVecEnv(envs) # 8 env
rewards,ma_rewards = train(cfg,envs)
make_dir(cfg.result_path,cfg.model_path)
save_results(rewards,ma_rewards,tag='train',path=cfg.result_path)
plot_rewards(rewards,ma_rewards,tag="train",env=cfg.env,algo = cfg.algo,path=cfg.result_path)

View File

@@ -1,32 +0,0 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Email: johnjim0816@gmail.com
Date: 2020-10-15 21:31:19
LastEditor: John
LastEditTime: 2020-11-03 17:05:48
Discription:
Environment:
'''
import os
import numpy as np
import datetime
SEQUENCE = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
SAVED_MODEL_PATH = os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"+SEQUENCE+'/'
RESULT_PATH = os.path.split(os.path.abspath(__file__))[0]+"/results/"+SEQUENCE+'/'
def save_results(rewards,moving_average_rewards,ep_steps,path=RESULT_PATH):
if not os.path.exists(path): # 检测是否存在文件夹
os.mkdir(path)
np.save(RESULT_PATH+'rewards_train.npy', rewards)
np.save(RESULT_PATH+'moving_average_rewards_train.npy', moving_average_rewards)
np.save(RESULT_PATH+'steps_train.npy',ep_steps )
def save_model(agent,model_path='./saved_model'):
if not os.path.exists(model_path): # 检测是否存在文件夹
os.mkdir(model_path)
agent.save_model(model_path+'checkpoint.pth')
print('model saved')

View File

@@ -1,5 +1,7 @@
# DDPG
#TODO
## 伪代码
![image-20210320151900695](assets/image-20210320151900695.png)

View File

@@ -1,5 +1,5 @@
# DQN
#TODO
## 原理简介
DQN是Q-leanning算法的优化和延伸Q-leaning中使用有限的Q表存储值的信息而DQN中则用神经网络替代Q表存储信息这样更适用于高维的情况相关知识基础可参考[datawhale李宏毅笔记-Q学习](https://datawhalechina.github.io/easy-rl/#/chapter6/chapter6)。

View File

@@ -5,7 +5,7 @@
@Email: johnjim0816@gmail.com
@Date: 2020-06-12 00:50:49
@LastEditor: John
LastEditTime: 2021-03-30 17:01:26
LastEditTime: 2021-04-29 22:19:18
@Discription:
@Environment: python 3.7.7
'''
@@ -39,6 +39,8 @@ class DQN:
hidden_dim=cfg.hidden_dim).to(self.device)
self.target_net = MLP(state_dim, action_dim,
hidden_dim=cfg.hidden_dim).to(self.device)
for target_param, param in zip(self.target_net.parameters(), self.policy_net.parameters()):
target_param.data.copy_(param.data)
self.optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg.lr)
self.loss = 0
self.memory = ReplayBuffer(cfg.memory_capacity)
@@ -48,21 +50,16 @@ class DQN:
'''
self.frame_idx += 1
if random.random() > self.epsilon(self.frame_idx):
with torch.no_grad():
# 先转为张量便于丢给神经网络,state元素数据原本为float64
# 注意state=torch.tensor(state).unsqueeze(0)跟state=torch.tensor([state])等价
state = torch.tensor(
[state], device=self.device, dtype=torch.float32)
# 如tensor([[-0.0798, -0.0079]], grad_fn=<AddmmBackward>)
q_value = self.policy_net(state)
# tensor.max(1)返回每行的最大值以及对应的下标,
# 如torch.return_types.max(values=tensor([10.3587]),indices=tensor([0]))
# 所以tensor.max(1)[1]返回最大值对应的下标即action
action = q_value.max(1)[1].item()
action = self.predict(state)
else:
action = random.randrange(self.action_dim)
return action
def predict(self,state):
with torch.no_grad():
state = torch.tensor([state], device=self.device, dtype=torch.float32)
q_values = self.policy_net(state)
action = q_values.max(1)[1].item()
return action
def update(self):
if len(self.memory) < self.batch_size:
@@ -109,3 +106,5 @@ class DQN:
def load(self, path):
self.target_net.load_state_dict(torch.load(path+'dqn_checkpoint.pth'))
for target_param, param in zip(self.target_net.parameters(), self.policy_net.parameters()):
param.data.copy_(target_param.data)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

View File

@@ -5,7 +5,7 @@
@Email: johnjim0816@gmail.com
@Date: 2020-06-12 00:48:57
@LastEditor: John
LastEditTime: 2021-04-29 02:02:12
LastEditTime: 2021-04-29 22:23:38
@Discription:
@Environment: python 3.7.7
'''
@@ -36,21 +36,28 @@ class DQNConfig:
'/'+curr_time+'/results/' # path to save results
self.model_path = curr_path+"/outputs/" + self.env + \
'/'+curr_time+'/models/' # path to save results
self.train_eps = 300 # 训练的episode数目
self.eval_eps = 50 # number of episodes for evaluating
self.gamma = 0.95
self.epsilon_start = 1 # e-greedy策略的初始epsilon
self.epsilon_start = 0.90 # e-greedy策略的初始epsilon
self.epsilon_end = 0.01
self.epsilon_decay = 500
self.lr = 0.0001 # learning rate
self.memory_capacity = 10000 # Replay Memory容量
self.batch_size = 32
self.train_eps = 300 # 训练的episode数目
self.memory_capacity = 100000 # Replay Memory容量
self.batch_size = 64
self.target_update = 2 # target net的更新频率
self.eval_eps = 20 # 测试的episode数目
self.device = torch.device(
"cuda" if torch.cuda.is_available() else "cpu") # 检测gpu
self.hidden_dim = 256 # 神经网络隐藏层维度
def env_agent_config(cfg,seed=1):
env = gym.make(cfg.env)
env.seed(seed)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = DQN(state_dim,action_dim,cfg)
return env,agent
def train(cfg, env, agent):
print('Start to train !')
print(f'Env:{cfg.env}, Algorithm:{cfg.algo}, Device:{cfg.device}')
@@ -60,13 +67,15 @@ def train(cfg, env, agent):
state = env.reset()
done = False
ep_reward = 0
while not done:
while True:
action = agent.choose_action(state)
next_state, reward, done, _ = env.step(action)
ep_reward += reward
agent.memory.push(state, action, reward, next_state, done)
state = next_state
agent.update()
if done:
break
if i_episode % cfg.target_update == 0:
agent.target_net.load_state_dict(agent.policy_net.state_dict())
print('Episode:{}/{}, Reward:{}'.format(i_episode+1, cfg.train_eps, ep_reward))
@@ -79,17 +88,39 @@ def train(cfg, env, agent):
print('Complete training')
return rewards, ma_rewards
def eval(cfg,env,agent):
rewards = [] # 记录所有episode的reward
ma_rewards = [] # 滑动平均的reward
for i_ep in range(cfg.eval_eps):
ep_reward = 0 # 记录每个episode的reward
state = env.reset() # 重置环境, 重新开一局即开始新的一个episode
while True:
action = agent.predict(state) # 根据算法选择一个动作
next_state, reward, done, _ = env.step(action) # 与环境进行一个交互
state = next_state # 存储上一个观察值
ep_reward += reward
if done:
break
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)
else:
ma_rewards.append(ep_reward)
print(f"Episode:{i_ep+1}/{cfg.eval_eps}, reward:{ep_reward:.1f}")
return rewards,ma_rewards
if __name__ == "__main__":
cfg = DQNConfig()
env = gym.make(cfg.env)
env.seed(1)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = DQN(state_dim, action_dim, cfg)
env,agent = env_agent_config(cfg,seed=1)
rewards, ma_rewards = train(cfg, env, agent)
make_dir(cfg.result_path, cfg.model_path)
agent.save(path=cfg.model_path)
save_results(rewards, ma_rewards, tag='train', path=cfg.result_path)
plot_rewards(rewards, ma_rewards, tag="train",
algo=cfg.algo, path=cfg.result_path)
env,agent = env_agent_config(cfg,seed=10)
agent.load(path=cfg.model_path)
rewards,ma_rewards = eval(cfg,env,agent)
save_results(rewards,ma_rewards,tag='eval',path=cfg.result_path)
plot_rewards(rewards,ma_rewards,tag="eval",env=cfg.env,algo = cfg.algo,path=cfg.result_path)

View File

@@ -0,0 +1,3 @@
# Q-learning
#TODO

View File

@@ -5,8 +5,8 @@ Author: John
Email: johnjim0816@gmail.com
Date: 2020-09-11 23:03:00
LastEditor: John
LastEditTime: 2021-03-26 16:51:01
Discription:
LastEditTime: 2021-04-29 16:59:41
Discription: use defaultdict to define Q table
Environment:
'''
import numpy as np
@@ -15,7 +15,7 @@ import torch
from collections import defaultdict
class QLearning(object):
def __init__(self,
def __init__(self,state_dim,
action_dim,cfg):
self.action_dim = action_dim # dimension of acgtion
self.lr = cfg.lr # learning rate
@@ -26,17 +26,20 @@ class QLearning(object):
self.epsilon_end = cfg.epsilon_end
self.epsilon_decay = cfg.epsilon_decay
self.Q_table = defaultdict(lambda: np.zeros(action_dim)) # A nested dictionary that maps state -> (action -> action-value)
def choose_action(self, state):
self.sample_count += 1
self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
math.exp(-1. * self.sample_count / self.epsilon_decay)
# e-greedy policy
if np.random.uniform(0, 1) > self.epsilon:
action = np.argmax(self.Q_table[str(state)])
action = self.predict(state)
else:
action = np.random.choice(self.action_dim)
return action
def predict(self,state):
action = np.argmax(self.Q_table[str(state)])
return action
def update(self, state, action, reward, next_state, done):
Q_predict = self.Q_table[str(state)][action]
if done:

88
codes/QLearning/agent1.py Normal file
View File

@@ -0,0 +1,88 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Email: johnjim0816@gmail.com
Date: 2020-09-11 23:03:00
LastEditor: John
LastEditTime: 2021-04-29 17:02:00
Discription:
Environment:
'''
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import math
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Email: johnjim0816@gmail.com
Date: 2020-09-11 23:03:00
LastEditor: John
LastEditTime: 2021-04-29 16:45:33
Discription: use np array to define Q table
Environment:
'''
import numpy as np
import math
class QLearning(object):
def __init__(self,
state_dim,action_dim,cfg):
self.action_dim = action_dim # dimension of acgtion
self.lr = cfg.lr # learning rate
self.gamma = cfg.gamma
self.epsilon = 0
self.sample_count = 0
self.epsilon_start = cfg.epsilon_start
self.epsilon_end = cfg.epsilon_end
self.epsilon_decay = cfg.epsilon_decay
self.Q_table = np.zeros((state_dim, action_dim)) # Q表
def choose_action(self, state):
self.sample_count += 1
self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
math.exp(-1. * self.sample_count / self.epsilon_decay)
if np.random.uniform(0, 1) > self.epsilon: # 随机选取0-1之间的值如果大于epsilon就按照贪心策略选取action否则随机选取
action = self.predict(state)
else:
action = np.random.choice(self.action_dim) #有一定概率随机探索选取一个动作
return action
def predict(self, state):
'''根据输入观测值,采样输出的动作值,带探索,测试模型时使用
'''
Q_list = self.Q_table[state, :]
Q_max = np.max(Q_list)
action_list = np.where(Q_list == Q_max)[0]
action = np.random.choice(action_list) # Q_max可能对应多个 action ,可以随机抽取一个
return action
def update(self, state, action, reward, next_state, done):
Q_predict = self.Q_table[state, action]
if done:
Q_target = reward # 没有下一个状态了
else:
Q_target = reward + self.gamma * np.max(
self.Q_table[next_state, :]) # Q_table-learning
self.Q_table[state, action] += self.lr * (Q_target - Q_predict) # 修正q
def save(self,path):
np.save(path+"Q_table.npy", self.Q_table)
def load(self, path):
self.Q_table = np.load(path+"Q_table.npy")

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 40 KiB

View File

@@ -0,0 +1,84 @@
#!/usr/bin/env python
# coding=utf-8
'''
Author: John
Email: johnjim0816@gmail.com
Date: 2020-09-11 23:03:00
LastEditor: John
LastEditTime: 2021-04-29 17:01:43
Discription:
Environment:
'''
import sys,os
curr_path = os.path.dirname(__file__)
parent_path=os.path.dirname(curr_path)
sys.path.append(parent_path) # add current terminal path to sys.path
import gym
import datetime
from envs.gridworld_env import CliffWalkingWapper
from QLearning.agent import QLearning
from common.plot import plot_rewards
from common.utils import save_results
curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time
class QlearningConfig:
'''训练相关参数'''
def __init__(self):
self.algo = 'Qlearning'
self.env = 'CliffWalking-v0' # 0 up, 1 right, 2 down, 3 left
self.result_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/results/' # path to save results
self.model_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/models/' # path to save models
self.train_eps = 300 # 训练的episode数目
self.eval_eps = 30
self.gamma = 0.9 # reward的衰减率
self.epsilon_start = 0.95 # e-greedy策略中初始epsilon
self.epsilon_end = 0.01 # e-greedy策略中的终止epsilon
self.epsilon_decay = 200 # e-greedy策略中epsilon的衰减率
self.lr = 0.1 # learning rate
def env_agent_config(cfg,seed=1):
env = gym.make(cfg.env)
env = CliffWalkingWapper(env)
env.seed(seed)
state_dim = env.observation_space.n
action_dim = env.action_space.n
agent = QLearning(state_dim,action_dim,cfg)
return env,agent
def eval(cfg,env,agent):
# env = gym.make("FrozenLake-v0", is_slippery=False) # 0 left, 1 down, 2 right, 3 up
# env = FrozenLakeWapper(env)
rewards = [] # 记录所有episode的reward
ma_rewards = [] # 滑动平均的reward
for i_ep in range(cfg.eval_eps):
ep_reward = 0 # 记录每个episode的reward
state = env.reset() # 重置环境, 重新开一局即开始新的一个episode
while True:
action = agent.predict(state) # 根据算法选择一个动作
next_state, reward, done, _ = env.step(action) # 与环境进行一个交互
state = next_state # 存储上一个观察值
ep_reward += reward
if done:
break
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)
else:
ma_rewards.append(ep_reward)
print(f"Episode:{i_ep+1}/{cfg.eval_eps}, reward:{ep_reward:.1f}")
return rewards,ma_rewards
if __name__ == "__main__":
cfg = QlearningConfig()
env,agent = env_agent_config(cfg,seed=15)
cfg.model_path = './'+'QLearning/outputs/CliffWalking-v0/20210429-165825/models'+'/'
cfg.result_path = './'+'QLearning/outputs/CliffWalking-v0/20210429-165825/results'+'/'
agent.load(path=cfg.model_path)
rewards,ma_rewards = eval(cfg,env,agent)
save_results(rewards,ma_rewards,tag='eval',path=cfg.result_path)
plot_rewards(rewards,ma_rewards,tag="eval",env=cfg.env,algo = cfg.algo,path=cfg.result_path)

File diff suppressed because one or more lines are too long

View File

@@ -5,11 +5,10 @@ Author: John
Email: johnjim0816@gmail.com
Date: 2020-09-11 23:03:00
LastEditor: John
LastEditTime: 2021-03-31 18:14:59
LastEditTime: 2021-04-29 17:01:08
Discription:
Environment:
'''
import sys,os
curr_path = os.path.dirname(__file__)
parent_path=os.path.dirname(curr_path)
@@ -18,40 +17,41 @@ sys.path.append(parent_path) # add current terminal path to sys.path
import gym
import datetime
from envs.gridworld_env import CliffWalkingWapper, FrozenLakeWapper
from envs.gridworld_env import CliffWalkingWapper
from QLearning.agent import QLearning
from common.plot import plot_rewards
from common.utils import save_results
SEQUENCE = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time
SAVED_MODEL_PATH = curr_path+"/saved_model/"+SEQUENCE+'/' # path to save model
if not os.path.exists(curr_path+"/saved_model/"):
os.mkdir(curr_path+"/saved_model/")
if not os.path.exists(SAVED_MODEL_PATH):
os.mkdir(SAVED_MODEL_PATH)
RESULT_PATH = curr_path+"/results/"+SEQUENCE+'/' # path to save rewards
if not os.path.exists(curr_path+"/results/"):
os.mkdir(curr_path+"/results/")
if not os.path.exists(RESULT_PATH):
os.mkdir(RESULT_PATH)
from common.utils import save_results,make_dir
curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time
class QlearningConfig:
'''训练相关参数'''
def __init__(self):
self.train_eps = 200 # 训练的episode数目
self.algo = 'Qlearning'
self.env = 'CliffWalking-v0' # 0 up, 1 right, 2 down, 3 left
self.result_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/results/' # path to save results
self.model_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/models/' # path to save models
self.train_eps = 300 # 训练的episode数目
self.eval_eps = 30
self.gamma = 0.9 # reward的衰减率
self.epsilon_start = 0.99 # e-greedy策略中初始epsilon
self.epsilon_start = 0.95 # e-greedy策略中初始epsilon
self.epsilon_end = 0.01 # e-greedy策略中的终止epsilon
self.epsilon_decay = 200 # e-greedy策略中epsilon的衰减率
self.lr = 0.1 # learning rate
def env_agent_config(cfg,seed=1):
env = gym.make(cfg.env)
env = CliffWalkingWapper(env)
env.seed(seed)
state_dim = env.observation_space.n
action_dim = env.action_space.n
agent = QLearning(state_dim,action_dim,cfg)
return env,agent
def train(cfg,env,agent):
rewards = []
ma_rewards = [] # moving average reward
steps = [] # 记录所有episode的steps
for i_episode in range(cfg.train_eps):
for i_ep in range(cfg.train_eps):
ep_reward = 0 # 记录每个episode的reward
ep_steps = 0 # 记录每个episode走了多少step
state = env.reset() # 重置环境, 重新开一局即开始新的一个episode
while True:
action = agent.choose_action(state) # 根据算法选择一个动作
@@ -59,55 +59,52 @@ def train(cfg,env,agent):
agent.update(state, action, reward, next_state, done) # Q-learning算法更新
state = next_state # 存储上一个观察值
ep_reward += reward
ep_steps += 1 # 计算step数
if done:
break
steps.append(ep_steps)
rewards.append(ep_reward)
if ma_rewards:
ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)
else:
ma_rewards.append(ep_reward)
print("Episode:{}/{}: reward:{:.1f}".format(i_episode+1, cfg.train_eps,ep_reward))
print("Episode:{}/{}: reward:{:.1f}".format(i_ep+1, cfg.train_eps,ep_reward))
return rewards,ma_rewards
def eval(cfg,env,agent):
# env = gym.make("FrozenLake-v0", is_slippery=False) # 0 left, 1 down, 2 right, 3 up
# env = FrozenLakeWapper(env)
rewards = [] # 记录所有episode的reward
ma_rewards = [] # 滑动平均的reward
steps = [] # 记录所有episode的steps
for i_episode in range(cfg.train_eps):
for i_ep in range(cfg.eval_eps):
ep_reward = 0 # 记录每个episode的reward
ep_steps = 0 # 记录每个episode走了多少step
state = env.reset() # 重置环境, 重新开一局即开始新的一个episode
while True:
action = agent.choose_action(state) # 根据算法选择一个动作
action = agent.predict(state) # 根据算法选择一个动作
next_state, reward, done, _ = env.step(action) # 与环境进行一个交互
state = next_state # 存储上一个观察值
ep_reward += reward
ep_steps += 1 # 计算step数
if done:
break
steps.append(ep_steps)
rewards.append(ep_reward)
# 计算滑动平均的reward
if ma_rewards:
ma_rewards.append(rewards[-1]*0.9+ep_reward*0.1)
ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)
else:
ma_rewards.append(ep_reward)
print("Episode:{}/{}: reward:{:.1f}".format(i_episode+1, cfg.train_eps,ep_reward))
print(f"Episode:{i_ep+1}/{cfg.eval_eps}, reward:{ep_reward:.1f}")
return rewards,ma_rewards
if __name__ == "__main__":
cfg = QlearningConfig()
env = gym.make("CliffWalking-v0") # 0 up, 1 right, 2 down, 3 left
env = CliffWalkingWapper(env)
action_dim = env.action_space.n
agent = QLearning(action_dim,cfg)
env,agent = env_agent_config(cfg,seed=1)
rewards,ma_rewards = train(cfg,env,agent)
agent.save(path=SAVED_MODEL_PATH)
save_results(rewards,ma_rewards,tag='train',path=RESULT_PATH)
plot_rewards(rewards,ma_rewards,tag="train",algo = "On-Policy First-Visit MC Control",path=RESULT_PATH)
make_dir(cfg.result_path,cfg.model_path)
agent.save(path=cfg.model_path)
save_results(rewards,ma_rewards,tag='train',path=cfg.result_path)
plot_rewards(rewards,ma_rewards,tag="train",env=cfg.env,algo = cfg.algo,path=cfg.result_path)
env,agent = env_agent_config(cfg,seed=10)
agent.load(path=cfg.model_path)
rewards,ma_rewards = eval(cfg,env,agent)
save_results(rewards,ma_rewards,tag='eval',path=cfg.result_path)
plot_rewards(rewards,ma_rewards,tag="eval",env=cfg.env,algo = cfg.algo,path=cfg.result_path)

View File

@@ -27,26 +27,25 @@ python 3.7、pytorch 1.6.0-1.7.1、gym 0.17.0-0.18.0
## 算法进度
| 算法名称 | 相关论文材料 | 环境 | 备注 |
| :--------------------------------------: | :----------------------------------------------------------: | ------------------------------------- | :--------------------------------: |
| [On-Policy First-Visit MC](./MonteCarlo) | | [Racetrack](./envs/racetrack_env.md) | |
| [Q-Learning](./QLearning) | | [CliffWalking-v0](./envs/gym_info.md) | |
| [Sarsa](./Sarsa) | | [Racetrack](./envs/racetrack_env.md) | |
| [DQN](./DQN) | [DQN Paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf),[Nature DQN Paper](https://www.nature.com/articles/nature14236) | [CartPole-v0](./envs/gym_info.md) | |
| [DQN-cnn](./DQN_cnn) | [DQN Paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) | [CartPole-v0](./envs/gym_info.md) | 与DQN相比使用了CNN而不是全链接网络 |
| [DoubleDQN](./DoubleDQN) | | [CartPole-v0](./envs/gym_info.md) | |
| [Hierarchical DQN](HierarchicalDQN) | [H-DQN Paper](https://arxiv.org/abs/1604.06057) | [CartPole-v0](./envs/gym_info.md) | |
| [PolicyGradient](./PolicyGradient) | | [CartPole-v0](./envs/gym_info.md) | |
| [A2C](./A2C) | [A3C Paper](https://arxiv.org/abs/1602.01783) | [CartPole-v0](./envs/gym_info.md) | |
| [SAC](./SAC) | [SAC Paper](https://arxiv.org/abs/1801.01290) | [Pendulum-v0](./envs/gym_info.md) | |
| [PPO](./PPO) | [PPO paper](https://arxiv.org/abs/1707.06347) | [CartPole-v0](./envs/gym_info.md) | |
| [DDPG](./DDPG) | [DDPG Paper](https://arxiv.org/abs/1509.02971) | [Pendulum-v0](./envs/gym_info.md) | |
| [TD3](./TD3) | [TD3 Paper](https://arxiv.org/abs/1802.09477) | HalfCheetah-v2 | |
| 算法名称 | 相关论文材料 | 环境 | 备注 |
| :--------------------------------------: | :----------------------------------------------------------: | ----------------------------------------- | :--------------------------------: |
| [On-Policy First-Visit MC](./MonteCarlo) | [medium blog](https://medium.com/analytics-vidhya/monte-carlo-methods-in-reinforcement-learning-part-1-on-policy-methods-1f004d59686a) | [Racetrack](./envs/racetrack_env.md) | |
| [Q-Learning](./QLearning) | [towardsdatascience blog](https://towardsdatascience.com/simple-reinforcement-learning-q-learning-fcddc4b6fe56),[q learning paper](https://ieeexplore.ieee.org/document/8836506) | [CliffWalking-v0](./envs/gym_info.md) | |
| [Sarsa](./Sarsa) | [geeksforgeeks blog](https://www.geeksforgeeks.org/sarsa-reinforcement-learning/) | [Racetrack](./envs/racetrack_env.md) | |
| [DQN](./DQN) | [DQN Paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf),[Nature DQN Paper](https://www.nature.com/articles/nature14236) | [CartPole-v0](./envs/gym_info.md) | |
| [DQN-cnn](./DQN_cnn) | [DQN Paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) | [CartPole-v0](./envs/gym_info.md) | 与DQN相比使用了CNN而不是全链接网络 |
| [DoubleDQN](./DoubleDQN) | [DoubleDQN Paper](https://arxiv.org/abs/1509.06461) | [CartPole-v0](./envs/gym_info.md) | |
| [Hierarchical DQN](HierarchicalDQN) | [H-DQN Paper](https://arxiv.org/abs/1604.06057) | [CartPole-v0](./envs/gym_info.md) | |
| [PolicyGradient](./PolicyGradient) | [Lil'log](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html) | [CartPole-v0](./envs/gym_info.md) | |
| [A2C](./A2C) | [A3C Paper](https://arxiv.org/abs/1602.01783) | [CartPole-v0](./envs/gym_info.md) | |
| [SAC](./SAC) | [SAC Paper](https://arxiv.org/abs/1801.01290) | [Pendulum-v0](./envs/gym_info.md) | |
| [PPO](./PPO) | [PPO paper](https://arxiv.org/abs/1707.06347) | [CartPole-v0](./envs/gym_info.md) | |
| [DDPG](./DDPG) | [DDPG Paper](https://arxiv.org/abs/1509.02971) | [Pendulum-v0](./envs/gym_info.md) | |
| [TD3](./TD3) | [TD3 Paper](https://arxiv.org/abs/1802.09477) | [HalfCheetah-v2]((./envs/mujoco_info.md)) | |
## Refs
[RL-Adventure-2](https://github.com/higgsfield/RL-Adventure-2)
[RL-Adventure](https://github.com/higgsfield/RL-Adventure)
[RL-Adventure](https://github.com/higgsfield/RL-Adventure)

View File

@@ -30,25 +30,26 @@ similar to file with ```eval```, which means to evaluate the agent.
## Schedule
| Name | Related materials | Used Envs | Notes |
| :--------------------------------------: | :----------------------------------------------------------: | ------------------------------------- | :---: |
| [On-Policy First-Visit MC](./MonteCarlo) | | [Racetrack](./envs/racetrack_env.md) | |
| [Q-Learning](./QLearning) | | [CliffWalking-v0](./envs/gym_info.md) | |
| [Sarsa](./Sarsa) | | [Racetrack](./envs/racetrack_env.md) | |
| [DQN](./DQN) | [DQN-paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf),[Nature DQN Paper](https://www.nature.com/articles/nature14236) | [CartPole-v0](./envs/gym_info.md) | |
| [DQN-cnn](./DQN_cnn) | [DQN-paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) | [CartPole-v0](./envs/gym_info.md) | |
| [DoubleDQN](./DoubleDQN) | | [CartPole-v0](./envs/gym_info.md) | |
| [Hierarchical DQN](HierarchicalDQN) | [Hierarchical DQN](https://arxiv.org/abs/1604.06057) | [CartPole-v0](./envs/gym_info.md) | |
| [PolicyGradient](./PolicyGradient) | | [CartPole-v0](./envs/gym_info.md) | |
| [A2C](./A2C) | [A3C Paper](https://arxiv.org/abs/1602.01783) | [CartPole-v0](./envs/gym_info.md) | |
| [SAC](./SAC) | [SAC Paper](https://arxiv.org/abs/1801.01290) | | |
| [PPO](./PPO) | [PPO paper](https://arxiv.org/abs/1707.06347) | [CartPole-v0](./envs/gym_info.md) | |
| [DDPG](./DDPG) | [DDPG Paper](https://arxiv.org/abs/1509.02971) | [Pendulum-v0](./envs/gym_info.md) | |
| [TD3](./TD3) | [TD3 Paper](https://arxiv.org/abs/1802.09477) | HalfCheetah-v2 | |
| Name | Related materials | Used Envs | Notes |
| :--------------------------------------: | :----------------------------------------------------------: | ----------------------------------------- | :---: |
| [On-Policy First-Visit MC](./MonteCarlo) | [medium blog](https://medium.com/analytics-vidhya/monte-carlo-methods-in-reinforcement-learning-part-1-on-policy-methods-1f004d59686a) | [Racetrack](./envs/racetrack_env.md) | |
| [Q-Learning](./QLearning) | [towardsdatascience blog](https://towardsdatascience.com/simple-reinforcement-learning-q-learning-fcddc4b6fe56),[q learning paper](https://ieeexplore.ieee.org/document/8836506) | [CliffWalking-v0](./envs/gym_info.md) | |
| [Sarsa](./Sarsa) | [geeksforgeeks blog](https://www.geeksforgeeks.org/sarsa-reinforcement-learning/) | [Racetrack](./envs/racetrack_env.md) | |
| [DQN](./DQN) | [DQN Paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf),[Nature DQN Paper](https://www.nature.com/articles/nature14236) | [CartPole-v0](./envs/gym_info.md) | |
| [DQN-cnn](./DQN_cnn) | [DQN Paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) | [CartPole-v0](./envs/gym_info.md) | 与DQN相比使用了CNN而不是全链接网络 |
| [DoubleDQN](./DoubleDQN) | [DoubleDQN Paper](https://arxiv.org/abs/1509.06461) | [CartPole-v0](./envs/gym_info.md) | |
| [Hierarchical DQN](HierarchicalDQN) | [H-DQN Paper](https://arxiv.org/abs/1604.06057) | [CartPole-v0](./envs/gym_info.md) | |
| [PolicyGradient](./PolicyGradient) | [Lil'log](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html) | [CartPole-v0](./envs/gym_info.md) | |
| [A2C](./A2C) | [A3C Paper](https://arxiv.org/abs/1602.01783) | [CartPole-v0](./envs/gym_info.md) | |
| [SAC](./SAC) | [SAC Paper](https://arxiv.org/abs/1801.01290) | [Pendulum-v0](./envs/gym_info.md) | |
| [PPO](./PPO) | [PPO paper](https://arxiv.org/abs/1707.06347) | [CartPole-v0](./envs/gym_info.md) | |
| [DDPG](./DDPG) | [DDPG Paper](https://arxiv.org/abs/1509.02971) | [Pendulum-v0](./envs/gym_info.md) | |
| [TD3](./TD3) | [TD3 Paper](https://arxiv.org/abs/1802.09477) | [HalfCheetah-v2]((./envs/mujoco_info.md)) | |
## Refs
[RL-Adventure-2](https://github.com/higgsfield/RL-Adventure-2)
[RL-Adventure](https://github.com/higgsfield/RL-Adventure)
[RL-Adventure](https://github.com/higgsfield/RL-Adventure)

View File

@@ -5,7 +5,7 @@ Author: John
Email: johnjim0816@gmail.com
Date: 2020-10-07 20:57:11
LastEditor: John
LastEditTime: 2021-04-28 10:13:21
LastEditTime: 2021-04-29 15:41:48
Discription:
Environment:
'''
@@ -19,7 +19,7 @@ def plot_rewards(rewards,ma_rewards,tag="train",env='CartPole-v0',algo = "DQN",s
plt.plot(ma_rewards,label='ma rewards')
plt.legend()
if save:
plt.savefig(path+"rewards_curve_{}".format(tag))
plt.savefig(path+"{}_rewards_curve".format(tag))
plt.show()
# def plot_rewards(dic,tag="train",env='CartPole-v0',algo = "DQN",save=True,path='./'):
# sns.set()

View File

@@ -5,7 +5,7 @@ Author: John
Email: johnjim0816@gmail.com
Date: 2021-03-12 16:02:24
LastEditor: John
LastEditTime: 2021-04-13 18:34:20
LastEditTime: 2021-04-29 15:32:38
Discription:
Environment:
'''
@@ -18,8 +18,8 @@ from pathlib import Path
def save_results(rewards,ma_rewards,tag='train',path='./results'):
'''保存reward等结果
'''
np.save(path+'rewards_'+tag+'.npy', rewards)
np.save(path+'ma_rewards_'+tag+'.npy', ma_rewards)
np.save(path+'{}_rewards.npy'.format(tag), rewards)
np.save(path+'{}_ma_rewards.npy'.format(tag), ma_rewards)
print('results saved!')
def make_dir(*paths):

Binary file not shown.

After

Width:  |  Height:  |  Size: 767 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 510 KiB

42
codes/envs/mujoco_info.md Normal file
View File

@@ -0,0 +1,42 @@
# MuJoCo
MuJoCoMulti-Joint dynamics with Contact是一个物理模拟器可以用于机器人控制优化等研究。安装见[Mac安装MuJoCo以及mujoco_py](https://blog.csdn.net/JohnJim0/article/details/115656392?spm=1001.2014.3001.5501)
## HalfCheetah-v2
该环境基于mujoco仿真引擎该环境的目的是使一只两只脚的“猎豹”跑得越快越好(下面图谷歌HalfCheetah-v2的https://gym.openai.com/envs/HalfCheetah-v2/)。
<img src="assets/image-20210429150630806.png" alt="image-20210429150630806" style="zoom:50%;" />
动作空间Box(6,)一只脚需要控制三个关节一共6个关节每个关节的运动范围为[-1, 1]。
状态空间Box(17, ),包含各种状态,每个值的范围为![img](assets/9cd6ae68c9aad008ede4139da358ec26.svg),主要描述“猎豹”本身的姿态等信息。
回报定义:每一步的回报与这一步的中猎豹的速度和猎豹行动的消耗有关,定义回报的代码如下。
```python
def step(self, action):
xposbefore = self.sim.data.qpos[0]
self.do_simulation(action, self.frame_skip)
xposafter = self.sim.data.qpos[0]
ob = self._get_obs()
reward_ctrl = - 0.1 * np.square(action).sum()
reward_run = (xposafter - xposbefore)/self.dt
# =========== reward ===========
reward = reward_ctrl + reward_run
# =========== reward ===========
done = False
return ob, reward, done, dict(reward_run=reward_run, reward_ctrl=reward_ctrl)
```
当猎豹无法控制平衡而倒下时,一个回合(episode)结束。
但是这个环境有一些问题目前经过搜索并不知道一个回合的reward上限实验中训练好的episode能跑出平台之外
<img src="assets/image-20210429150622353.png" alt="image-20210429150622353" style="zoom:50%;" />
加上时间有限所以训练中reward一直处于一个平缓上升的状态本人猜测这可能是gym的一个bug。

View File

@@ -78,7 +78,6 @@ class Agent:
:param points: float, the current points from environment
:param dead: boolean, if the snake is dead
:return: the index of action. 0,1,2,3 indicates up,down,left,right separately
TODO: write your function here.
Return the index of action the snake needs to take, according to the state and points known from environment.
Tips: you need to discretize the state to the state space defined on the webpage first.
(Note that [adjoining_wall_x=0, adjoining_wall_y=0] is also the case when snake runs out of the 480x480 board)