diff --git a/LICENSE b/LICENSE deleted file mode 100644 index cbe5ad1..0000000 --- a/LICENSE +++ /dev/null @@ -1,437 +0,0 @@ -Attribution-NonCommercial-ShareAlike 4.0 International - -======================================================================= - -Creative Commons Corporation ("Creative Commons") is not a law firm and -does not provide legal services or legal advice. Distribution of -Creative Commons public licenses does not create a lawyer-client or -other relationship. Creative Commons makes its licenses and related -information available on an "as-is" basis. Creative Commons gives no -warranties regarding its licenses, any material licensed under their -terms and conditions, or any related information. Creative Commons -disclaims all liability for damages resulting from their use to the -fullest extent possible. - -Using Creative Commons Public Licenses - -Creative Commons public licenses provide a standard set of terms and -conditions that creators and other rights holders may use to share -original works of authorship and other material subject to copyright -and certain other rights specified in the public license below. The -following considerations are for informational purposes only, are not -exhaustive, and do not form part of our licenses. - - Considerations for licensors: Our public licenses are - intended for use by those authorized to give the public - permission to use material in ways otherwise restricted by - copyright and certain other rights. Our licenses are - irrevocable. Licensors should read and understand the terms - and conditions of the license they choose before applying it. - Licensors should also secure all rights necessary before - applying our licenses so that the public can reuse the - material as expected. Licensors should clearly mark any - material not subject to the license. This includes other CC- - licensed material, or material used under an exception or - limitation to copyright. More considerations for licensors: - wiki.creativecommons.org/Considerations_for_licensors - - Considerations for the public: By using one of our public - licenses, a licensor grants the public permission to use the - licensed material under specified terms and conditions. If - the licensor's permission is not necessary for any reason--for - example, because of any applicable exception or limitation to - copyright--then that use is not regulated by the license. Our - licenses grant only permissions under copyright and certain - other rights that a licensor has authority to grant. Use of - the licensed material may still be restricted for other - reasons, including because others have copyright or other - rights in the material. A licensor may make special requests, - such as asking that all changes be marked or described. - Although not required by our licenses, you are encouraged to - respect those requests where reasonable. More considerations - for the public: - wiki.creativecommons.org/Considerations_for_licensees - -======================================================================= - -Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International -Public License - -By exercising the Licensed Rights (defined below), You accept and agree -to be bound by the terms and conditions of this Creative Commons -Attribution-NonCommercial-ShareAlike 4.0 International Public License -("Public License"). To the extent this Public License may be -interpreted as a contract, You are granted the Licensed Rights in -consideration of Your acceptance of these terms and conditions, and the -Licensor grants You such rights in consideration of benefits the -Licensor receives from making the Licensed Material available under -these terms and conditions. - - -Section 1 -- Definitions. - - a. Adapted Material means material subject to Copyright and Similar - Rights that is derived from or based upon the Licensed Material - and in which the Licensed Material is translated, altered, - arranged, transformed, or otherwise modified in a manner requiring - permission under the Copyright and Similar Rights held by the - Licensor. For purposes of this Public License, where the Licensed - Material is a musical work, performance, or sound recording, - Adapted Material is always produced where the Licensed Material is - synched in timed relation with a moving image. - - b. Adapter's License means the license You apply to Your Copyright - and Similar Rights in Your contributions to Adapted Material in - accordance with the terms and conditions of this Public License. - - c. BY-NC-SA Compatible License means a license listed at - creativecommons.org/compatiblelicenses, approved by Creative - Commons as essentially the equivalent of this Public License. - - d. Copyright and Similar Rights means copyright and/or similar rights - closely related to copyright including, without limitation, - performance, broadcast, sound recording, and Sui Generis Database - Rights, without regard to how the rights are labeled or - categorized. For purposes of this Public License, the rights - specified in Section 2(b)(1)-(2) are not Copyright and Similar - Rights. - - e. Effective Technological Measures means those measures that, in the - absence of proper authority, may not be circumvented under laws - fulfilling obligations under Article 11 of the WIPO Copyright - Treaty adopted on December 20, 1996, and/or similar international - agreements. - - f. Exceptions and Limitations means fair use, fair dealing, and/or - any other exception or limitation to Copyright and Similar Rights - that applies to Your use of the Licensed Material. - - g. License Elements means the license attributes listed in the name - of a Creative Commons Public License. The License Elements of this - Public License are Attribution, NonCommercial, and ShareAlike. - - h. Licensed Material means the artistic or literary work, database, - or other material to which the Licensor applied this Public - License. - - i. Licensed Rights means the rights granted to You subject to the - terms and conditions of this Public License, which are limited to - all Copyright and Similar Rights that apply to Your use of the - Licensed Material and that the Licensor has authority to license. - - j. Licensor means the individual(s) or entity(ies) granting rights - under this Public License. - - k. NonCommercial means not primarily intended for or directed towards - commercial advantage or monetary compensation. For purposes of - this Public License, the exchange of the Licensed Material for - other material subject to Copyright and Similar Rights by digital - file-sharing or similar means is NonCommercial provided there is - no payment of monetary compensation in connection with the - exchange. - - l. Share means to provide material to the public by any means or - process that requires permission under the Licensed Rights, such - as reproduction, public display, public performance, distribution, - dissemination, communication, or importation, and to make material - available to the public including in ways that members of the - public may access the material from a place and at a time - individually chosen by them. - - m. Sui Generis Database Rights means rights other than copyright - resulting from Directive 96/9/EC of the European Parliament and of - the Council of 11 March 1996 on the legal protection of databases, - as amended and/or succeeded, as well as other essentially - equivalent rights anywhere in the world. - - n. You means the individual or entity exercising the Licensed Rights - under this Public License. Your has a corresponding meaning. - - -Section 2 -- Scope. - - a. License grant. - - 1. Subject to the terms and conditions of this Public License, - the Licensor hereby grants You a worldwide, royalty-free, - non-sublicensable, non-exclusive, irrevocable license to - exercise the Licensed Rights in the Licensed Material to: - - a. reproduce and Share the Licensed Material, in whole or - in part, for NonCommercial purposes only; and - - b. produce, reproduce, and Share Adapted Material for - NonCommercial purposes only. - - 2. Exceptions and Limitations. For the avoidance of doubt, where - Exceptions and Limitations apply to Your use, this Public - License does not apply, and You do not need to comply with - its terms and conditions. - - 3. Term. The term of this Public License is specified in Section - 6(a). - - 4. Media and formats; technical modifications allowed. The - Licensor authorizes You to exercise the Licensed Rights in - all media and formats whether now known or hereafter created, - and to make technical modifications necessary to do so. The - Licensor waives and/or agrees not to assert any right or - authority to forbid You from making technical modifications - necessary to exercise the Licensed Rights, including - technical modifications necessary to circumvent Effective - Technological Measures. For purposes of this Public License, - simply making modifications authorized by this Section 2(a) - (4) never produces Adapted Material. - - 5. Downstream recipients. - - a. Offer from the Licensor -- Licensed Material. Every - recipient of the Licensed Material automatically - receives an offer from the Licensor to exercise the - Licensed Rights under the terms and conditions of this - Public License. - - b. Additional offer from the Licensor -- Adapted Material. - Every recipient of Adapted Material from You - automatically receives an offer from the Licensor to - exercise the Licensed Rights in the Adapted Material - under the conditions of the Adapter's License You apply. - - c. No downstream restrictions. You may not offer or impose - any additional or different terms or conditions on, or - apply any Effective Technological Measures to, the - Licensed Material if doing so restricts exercise of the - Licensed Rights by any recipient of the Licensed - Material. - - 6. No endorsement. Nothing in this Public License constitutes or - may be construed as permission to assert or imply that You - are, or that Your use of the Licensed Material is, connected - with, or sponsored, endorsed, or granted official status by, - the Licensor or others designated to receive attribution as - provided in Section 3(a)(1)(A)(i). - - b. Other rights. - - 1. Moral rights, such as the right of integrity, are not - licensed under this Public License, nor are publicity, - privacy, and/or other similar personality rights; however, to - the extent possible, the Licensor waives and/or agrees not to - assert any such rights held by the Licensor to the limited - extent necessary to allow You to exercise the Licensed - Rights, but not otherwise. - - 2. Patent and trademark rights are not licensed under this - Public License. - - 3. To the extent possible, the Licensor waives any right to - collect royalties from You for the exercise of the Licensed - Rights, whether directly or through a collecting society - under any voluntary or waivable statutory or compulsory - licensing scheme. In all other cases the Licensor expressly - reserves any right to collect such royalties, including when - the Licensed Material is used other than for NonCommercial - purposes. - - -Section 3 -- License Conditions. - -Your exercise of the Licensed Rights is expressly made subject to the -following conditions. - - a. Attribution. - - 1. If You Share the Licensed Material (including in modified - form), You must: - - a. retain the following if it is supplied by the Licensor - with the Licensed Material: - - i. identification of the creator(s) of the Licensed - Material and any others designated to receive - attribution, in any reasonable manner requested by - the Licensor (including by pseudonym if - designated); - - ii. a copyright notice; - - iii. a notice that refers to this Public License; - - iv. a notice that refers to the disclaimer of - warranties; - - v. a URI or hyperlink to the Licensed Material to the - extent reasonably practicable; - - b. indicate if You modified the Licensed Material and - retain an indication of any previous modifications; and - - c. indicate the Licensed Material is licensed under this - Public License, and include the text of, or the URI or - hyperlink to, this Public License. - - 2. You may satisfy the conditions in Section 3(a)(1) in any - reasonable manner based on the medium, means, and context in - which You Share the Licensed Material. For example, it may be - reasonable to satisfy the conditions by providing a URI or - hyperlink to a resource that includes the required - information. - 3. If requested by the Licensor, You must remove any of the - information required by Section 3(a)(1)(A) to the extent - reasonably practicable. - - b. ShareAlike. - - In addition to the conditions in Section 3(a), if You Share - Adapted Material You produce, the following conditions also apply. - - 1. The Adapter's License You apply must be a Creative Commons - license with the same License Elements, this version or - later, or a BY-NC-SA Compatible License. - - 2. You must include the text of, or the URI or hyperlink to, the - Adapter's License You apply. You may satisfy this condition - in any reasonable manner based on the medium, means, and - context in which You Share Adapted Material. - - 3. You may not offer or impose any additional or different terms - or conditions on, or apply any Effective Technological - Measures to, Adapted Material that restrict exercise of the - rights granted under the Adapter's License You apply. - - -Section 4 -- Sui Generis Database Rights. - -Where the Licensed Rights include Sui Generis Database Rights that -apply to Your use of the Licensed Material: - - a. for the avoidance of doubt, Section 2(a)(1) grants You the right - to extract, reuse, reproduce, and Share all or a substantial - portion of the contents of the database for NonCommercial purposes - only; - - b. if You include all or a substantial portion of the database - contents in a database in which You have Sui Generis Database - Rights, then the database in which You have Sui Generis Database - Rights (but not its individual contents) is Adapted Material, - including for purposes of Section 3(b); and - - c. You must comply with the conditions in Section 3(a) if You Share - all or a substantial portion of the contents of the database. - -For the avoidance of doubt, this Section 4 supplements and does not -replace Your obligations under this Public License where the Licensed -Rights include other Copyright and Similar Rights. - - -Section 5 -- Disclaimer of Warranties and Limitation of Liability. - - a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE - EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS - AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF - ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, - IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, - WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR - PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, - ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT - KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT - ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. - - b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE - TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, - NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, - INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, - COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR - USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN - ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR - DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR - IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. - - c. The disclaimer of warranties and limitation of liability provided - above shall be interpreted in a manner that, to the extent - possible, most closely approximates an absolute disclaimer and - waiver of all liability. - - -Section 6 -- Term and Termination. - - a. This Public License applies for the term of the Copyright and - Similar Rights licensed here. However, if You fail to comply with - this Public License, then Your rights under this Public License - terminate automatically. - - b. Where Your right to use the Licensed Material has terminated under - Section 6(a), it reinstates: - - 1. automatically as of the date the violation is cured, provided - it is cured within 30 days of Your discovery of the - violation; or - - 2. upon express reinstatement by the Licensor. - - For the avoidance of doubt, this Section 6(b) does not affect any - right the Licensor may have to seek remedies for Your violations - of this Public License. - - c. For the avoidance of doubt, the Licensor may also offer the - Licensed Material under separate terms or conditions or stop - distributing the Licensed Material at any time; however, doing so - will not terminate this Public License. - - d. Sections 1, 5, 6, 7, and 8 survive termination of this Public - License. - - -Section 7 -- Other Terms and Conditions. - - a. The Licensor shall not be bound by any additional or different - terms or conditions communicated by You unless expressly agreed. - - b. Any arrangements, understandings, or agreements regarding the - Licensed Material not stated herein are separate from and - independent of the terms and conditions of this Public License. - - -Section 8 -- Interpretation. - - a. For the avoidance of doubt, this Public License does not, and - shall not be interpreted to, reduce, limit, restrict, or impose - conditions on any use of the Licensed Material that could lawfully - be made without permission under this Public License. - - b. To the extent possible, if any provision of this Public License is - deemed unenforceable, it shall be automatically reformed to the - minimum extent necessary to make it enforceable. If the provision - cannot be reformed, it shall be severed from this Public License - without affecting the enforceability of the remaining terms and - conditions. - - c. No term or condition of this Public License will be waived and no - failure to comply consented to unless expressly agreed to by the - Licensor. - - d. Nothing in this Public License constitutes or may be interpreted - as a limitation upon, or waiver of, any privileges and immunities - that apply to the Licensor or You, including from the legal - processes of any jurisdiction or authority. - -======================================================================= - -Creative Commons is not a party to its public -licenses. Notwithstanding, Creative Commons may elect to apply one of -its public licenses to material it publishes and in those instances -will be considered the “Licensor.” The text of the Creative Commons -public licenses is dedicated to the public domain under the CC0 Public -Domain Dedication. Except for the limited purpose of indicating that -material is shared under a Creative Commons public license or as -otherwise permitted by the Creative Commons policies published at -creativecommons.org/policies, Creative Commons does not authorize the -use of the trademark "Creative Commons" or any other trademark or logo -of Creative Commons without its prior written consent including, -without limitation, in connection with any unauthorized modifications -to any of its public licenses or any other arrangements, -understandings, or agreements concerning use of licensed material. For -the avoidance of doubt, this paragraph does not form part of the -public licenses. - -Creative Commons may be contacted at creativecommons.org. diff --git a/README.md b/README.md deleted file mode 100644 index 16b2c92..0000000 --- a/README.md +++ /dev/null @@ -1,69 +0,0 @@ -# Easy-RL - -李宏毅老师的《深度强化学习》是强化学习领域经典的中文视频之一。李老师幽默风趣的上课风格让晦涩难懂的强化学习理论变得轻松易懂,他会通过很多有趣的例子来讲解强化学习理论。比如老师经常会用玩 Atari 游戏的例子来讲解强化学习算法。此外,为了教程的完整性,我们整理了周博磊老师的《强化学习纲要》、李科浇老师的《百度强化学习》以及多个强化学习的经典资料作为补充。对于想入门强化学习又想看中文讲解的人来说绝对是非常推荐的。 - -## 使用说明 - -* 第 4 章到第 11 章为[李宏毅《深度强化学习》](http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLDS18.html)的部分; -* 第 1 章和第 2 章根据[《强化学习纲要》](https://github.com/zhoubolei/introRL)整理而来; -* 第 3 章和第 12 章根据[《百度强化学习》](https://aistudio.baidu.com/aistudio/education/group/info/1335) 整理而来。 - - -## 在线阅读(内容实时更新) -地址:https://datawhalechina.github.io/easy-rl/ - -## 内容导航 -| 章节 | 习题 | 相关项目 | -| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | -| [第一章 强化学习概述](https://datawhalechina.github.io/easy-rl/#/chapter1/chapter1) | [第一章 习题](https://datawhalechina.github.io/easy-rl/#/chapter1/chapter1_questions&keywords) | | -| [第二章 马尔可夫决策过程 (MDP)](https://datawhalechina.github.io/easy-rl/#/chapter2/chapter2) | [第二章 习题](https://datawhalechina.github.io/easy-rl/#/chapter2/chapter2_questions&keywords) | | -| [第三章 表格型方法](https://datawhalechina.github.io/easy-rl/#/chapter3/chapter3) | [第三章 习题](https://datawhalechina.github.io/easy-rl/#/chapter3/chapter3_questions&keywords) | [Q-learning算法实战](https://datawhalechina.github.io/easy-rl/#/chapter3/project1) | -| [第四章 策略梯度](https://datawhalechina.github.io/easy-rl/#/chapter4/chapter4) | [第四章 习题](https://datawhalechina.github.io/easy-rl/#/chapter4/chapter4_questions&keywords) | | -| [第五章 近端策略优化 (PPO) 算法](https://datawhalechina.github.io/easy-rl/#/chapter5/chapter5) | [第五章 习题](https://datawhalechina.github.io/easy-rl/#/chapter5/chapter5_questions&keywords) | | -| [第六章 DQN (基本概念)](https://datawhalechina.github.io/easy-rl/#/chapter6/chapter6) | [第六章 习题](https://datawhalechina.github.io/easy-rl/#/chapter6/chapter6_questions&keywords) | | -| [第七章 DQN (进阶技巧)](https://datawhalechina.github.io/easy-rl/#/chapter7/chapter7) | [第七章 习题](https://datawhalechina.github.io/easy-rl/#/chapter7/chapter7_questions&keywords) | [DQN算法实战](https://datawhalechina.github.io/easy-rl/#/chapter7/project2) | -| [第八章 DQN (连续动作)](https://datawhalechina.github.io/easy-rl/#/chapter8/chapter8) | [第八章 习题](https://datawhalechina.github.io/easy-rl/#/chapter8/chapter8_questions&keywords) | | -| [第九章 演员-评论家算法](https://datawhalechina.github.io/easy-rl/#/chapter9/chapter9) | [第九章 习题](https://datawhalechina.github.io/easy-rl/#/chapter9/chapter9_questions&keywords) | | -| [第十章 稀疏奖励](https://datawhalechina.github.io/easy-rl/#/chapter10/chapter10) | [第十章 习题](https://datawhalechina.github.io/easy-rl/#/chapter10/chapter10_questions&keywords) | | -| [第十一章 模仿学习](https://datawhalechina.github.io/easy-rl/#/chapter11/chapter11) | [第十一章 习题](https://datawhalechina.github.io/easy-rl/#/chapter11/chapter11_questions&keywords) | | -| [第十二章 深度确定性策略梯度 (DDPG) 算法](https://datawhalechina.github.io/easy-rl/#/chapter12/chapter12) | [第十二章 习题](https://datawhalechina.github.io/easy-rl/#/chapter12/chapter12_questions&keywords) | [DDPG算法实战](https://datawhalechina.github.io/easy-rl/#/chapter12/project3) | -| [第十三章 AlphaStar 论文解读](https://datawhalechina.github.io/easy-rl/#/chapter13/chapter13) | | | -## 算法实战 - -[点击](https://github.com/datawhalechina/easy-rl/tree/master/codes)或者跳转```codes```文件夹下进入算法实战 - -## 贡献者 - - - - - - - - - -
- pic
- Qi Wang -

教程设计(第1~12章)
中国科学院大学

-
- pic
- David Young -

习题设计&第13章
清华大学

-
- pic
- John Jim -

算法实战
北京大学

-
- - -## 致谢 - -特别感谢 [@Sm1les](https://github.com/Sm1les)、[@LSGOMYP](https://github.com/LSGOMYP) 对本项目的帮助与支持。 - -## 关注我们 -
Datawhale是一个专注AI领域的开源组织,以“for the learner,和学习者一起成长”为愿景,构建对学习者最有价值的开源学习社区。关注我们,一起学习成长。
- -## LICENSE -知识共享许可协议
本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。 - diff --git a/codes/A2C/README.md b/codes/A2C/README.md index e69de29..5856b80 100644 --- a/codes/A2C/README.md +++ b/codes/A2C/README.md @@ -0,0 +1,5 @@ +## A2C + + + +https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f \ No newline at end of file diff --git a/codes/A2C/agent.py b/codes/A2C/agent.py index aafe7c1..9de9aab 100644 --- a/codes/A2C/agent.py +++ b/codes/A2C/agent.py @@ -1,32 +1,27 @@ #!/usr/bin/env python # coding=utf-8 ''' -Author: John +Author: JiangJi Email: johnjim0816@gmail.com -Date: 2020-11-03 20:47:09 -LastEditor: John -LastEditTime: 2021-03-20 17:41:21 +Date: 2021-05-03 22:16:08 +LastEditor: JiangJi +LastEditTime: 2021-05-03 22:23:48 Discription: Environment: ''' -from A2C.model import ActorCritic import torch.optim as optim - +from A2C.model import ActorCritic class A2C: - def __init__(self,state_dim, action_dim, cfg): - self.gamma = 0.99 - self.model = ActorCritic(state_dim, action_dim, hidden_dim=cfg.hidden_dim).to(cfg.device) - self.optimizer = optim.Adam(self.model.parameters(),lr=cfg.lr) - def choose_action(self, state): - dist, value = self.model(state) - action = dist.sample() - return action + def __init__(self,state_dim,action_dim,cfg) -> None: + self.gamma = cfg.gamma + self.device = cfg.device + self.model = ActorCritic(state_dim, action_dim, cfg.hidden_size).to(self.device) + self.optimizer = optim.Adam(self.model.parameters()) + def compute_returns(self,next_value, rewards, masks): R = next_value returns = [] for step in reversed(range(len(rewards))): R = rewards[step] + self.gamma * R * masks[step] returns.insert(0, R) - return returns - def update(self): - pass \ No newline at end of file + return returns \ No newline at end of file diff --git a/codes/A2C/env.py b/codes/A2C/env.py deleted file mode 100644 index 652824b..0000000 --- a/codes/A2C/env.py +++ /dev/null @@ -1,48 +0,0 @@ -#!/usr/bin/env python -# coding=utf-8 -''' -Author: John -Email: johnjim0816@gmail.com -Date: 2020-10-30 15:39:37 -LastEditor: John -LastEditTime: 2021-03-17 20:19:14 -Discription: -Environment: -''' - -import gym -from A2C.multiprocessing_env import SubprocVecEnv - -# num_envs = 16 -# env = "Pendulum-v0" - -def make_envs(num_envs=16,env="Pendulum-v0"): - ''' 创建多个子环境 - ''' - num_envs = 16 - env = "CartPole-v0" - def make_env(): - def _thunk(): - env = gym.make(env) - return env - - return _thunk - - envs = [make_env() for i in range(num_envs)] - envs = SubprocVecEnv(envs) - return envs -# if __name__ == "__main__": - -# num_envs = 16 -# env = "CartPole-v0" -# def make_env(): -# def _thunk(): -# env = gym.make(env) -# return env - -# return _thunk - -# envs = [make_env() for i in range(num_envs)] -# envs = SubprocVecEnv(envs) -if __name__ == "__main__": - envs = make_envs(num_envs=16,env="CartPole-v0") \ No newline at end of file diff --git a/codes/A2C/main.py b/codes/A2C/main.py deleted file mode 100644 index 9237f48..0000000 --- a/codes/A2C/main.py +++ /dev/null @@ -1,106 +0,0 @@ -#!/usr/bin/env python -# coding=utf-8 -''' -@Author: John -@Email: johnjim0816@gmail.com -@Date: 2020-06-11 20:58:21 -@LastEditor: John -LastEditTime: 2021-04-05 11:14:39 -@Discription: -@Environment: python 3.7.9 -''' -import sys,os -curr_path = os.path.dirname(__file__) -parent_path=os.path.dirname(curr_path) -sys.path.append(parent_path) # add current terminal path to sys.path - -import torch -import gym -import datetime -from A2C.agent import A2C -from common.utils import save_results,make_dir,del_empty_dir - - - -SEQUENCE = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # 获取当前时间 -SAVED_MODEL_PATH = os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"+SEQUENCE+'/' # 生成保存的模型路径 -if not os.path.exists(os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"): - os.mkdir(os.path.split(os.path.abspath(__file__))[0]+"/saved_model/") -if not os.path.exists(SAVED_MODEL_PATH): - os.mkdir(SAVED_MODEL_PATH) -RESULT_PATH = os.path.split(os.path.abspath(__file__))[0]+"/results/"+SEQUENCE+'/' # 存储reward的路径 -if not os.path.exists(os.path.split(os.path.abspath(__file__))[0]+"/results/"): - os.mkdir(os.path.split(os.path.abspath(__file__))[0]+"/results/") -if not os.path.exists(RESULT_PATH): - os.mkdir(RESULT_PATH) - -class A2CConfig: - def __init__(self): - self.gamma = 0.99 - self.lr = 3e-4 # learnning rate - self.actor_lr = 1e-4 # learnning rate of actor network - self.memory_capacity = 10000 # capacity of replay memory - self.batch_size = 128 - self.train_eps = 200 - self.train_steps = 200 - self.eval_eps = 200 - self.eval_steps = 200 - self.target_update = 4 - self.hidden_dim=256 - self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") - -def train(cfg,env,agent): - print('Start to train ! ') - for i_episode in range(cfg.train_eps): - state = env.reset() - log_probs = [] - values = [] - rewards = [] - masks = [] - entropy = 0 - ep_reward = 0 - for i_step in range(cfg.train_steps): - state = torch.FloatTensor(state).to(cfg.device) - dist, value = agent.model(state) - action = dist.sample() - next_state, reward, done, _ = env.step(action.cpu().numpy()) - ep_reward+=reward - state = next_state - log_prob = dist.log_prob(action) - entropy += dist.entropy().mean() - log_probs.append(log_prob) - values.append(value) - rewards.append(torch.FloatTensor(reward).unsqueeze(1).to(cfg.device)) - masks.append(torch.FloatTensor(1 - done).unsqueeze(1).to(cfg.device)) - if done: - break - print('Episode:{}/{}, Reward:{}, Steps:{}, Done:{}'.format(i_episode+1,cfg.train_eps,ep_reward,i_step+1,done)) - next_state = torch.FloatTensor(next_state).to(cfg.device) - _, next_value =agent.model(next_state) - returns = agent.compute_returns(next_value, rewards, masks) - - log_probs = torch.cat(log_probs) - returns = torch.cat(returns).detach() - values = torch.cat(values) - advantage = returns - values - actor_loss = -(log_probs * advantage.detach()).mean() - critic_loss = advantage.pow(2).mean() - loss = actor_loss + 0.5 * critic_loss - 0.001 * entropy - - agent.optimizer.zero_grad() - loss.backward() - agent.optimizer.step() - - print('Complete training!') - - - -if __name__ == "__main__": - cfg = A2CConfig() - env = gym.make('CartPole-v0') - env.seed(1) # set random seed for env - state_dim = env.observation_space.shape[0] - action_dim = env.action_space.n - agent = A2C(state_dim, action_dim, cfg) - train(cfg,env,agent) - diff --git a/codes/A2C/model.py b/codes/A2C/model.py index 46b59de..5e77d4d 100644 --- a/codes/A2C/model.py +++ b/codes/A2C/model.py @@ -1,36 +1,36 @@ #!/usr/bin/env python # coding=utf-8 ''' -Author: John +Author: JiangJi Email: johnjim0816@gmail.com -Date: 2020-11-03 20:45:25 -LastEditor: John -LastEditTime: 2021-03-20 17:41:33 +Date: 2021-05-03 21:38:54 +LastEditor: JiangJi +LastEditTime: 2021-05-03 21:40:06 Discription: Environment: ''' import torch.nn as nn +import torch.nn.functional as F from torch.distributions import Categorical - class ActorCritic(nn.Module): - def __init__(self, state_dim, action_dim, hidden_dim=256): + def __init__(self, num_inputs, num_outputs, hidden_size, std=0.0): super(ActorCritic, self).__init__() + self.critic = nn.Sequential( - nn.Linear(state_dim, hidden_dim), + nn.Linear(num_inputs, hidden_size), nn.ReLU(), - nn.Linear(hidden_dim, 1) + nn.Linear(hidden_size, 1) ) self.actor = nn.Sequential( - nn.Linear(state_dim, hidden_dim), + nn.Linear(num_inputs, hidden_size), nn.ReLU(), - nn.Linear(hidden_dim, action_dim), + nn.Linear(hidden_size, num_outputs), nn.Softmax(dim=1), ) def forward(self, x): value = self.critic(x) - print(x) probs = self.actor(x) dist = Categorical(probs) return dist, value \ No newline at end of file diff --git a/codes/A2C/multiprocessing_env.py b/codes/A2C/multiprocessing_env.py deleted file mode 100644 index 46bbc08..0000000 --- a/codes/A2C/multiprocessing_env.py +++ /dev/null @@ -1,153 +0,0 @@ -#This code is from openai baseline -#https://github.com/openai/baselines/tree/master/baselines/common/vec_env - -import numpy as np -from multiprocessing import Process, Pipe - -def worker(remote, parent_remote, env_fn_wrapper): - parent_remote.close() - env = env_fn_wrapper.x() - while True: - cmd, data = remote.recv() - if cmd == 'step': - ob, reward, done, info = env.step(data) - if done: - ob = env.reset() - remote.send((ob, reward, done, info)) - elif cmd == 'reset': - ob = env.reset() - remote.send(ob) - elif cmd == 'reset_task': - ob = env.reset_task() - remote.send(ob) - elif cmd == 'close': - remote.close() - break - elif cmd == 'get_spaces': - remote.send((env.observation_space, env.action_space)) - else: - raise NotImplementedError - -class VecEnv(object): - """ - An abstract asynchronous, vectorized environment. - """ - def __init__(self, num_envs, observation_space, action_space): - self.num_envs = num_envs - self.observation_space = observation_space - self.action_space = action_space - - def reset(self): - """ - Reset all the environments and return an array of - observations, or a tuple of observation arrays. - If step_async is still doing work, that work will - be cancelled and step_wait() should not be called - until step_async() is invoked again. - """ - pass - - def step_async(self, actions): - """ - Tell all the environments to start taking a step - with the given actions. - Call step_wait() to get the results of the step. - You should not call this if a step_async run is - already pending. - """ - pass - - def step_wait(self): - """ - Wait for the step taken with step_async(). - Returns (obs, rews, dones, infos): - - obs: an array of observations, or a tuple of - arrays of observations. - - rews: an array of rewards - - dones: an array of "episode done" booleans - - infos: a sequence of info objects - """ - pass - - def close(self): - """ - Clean up the environments' resources. - """ - pass - - def step(self, actions): - self.step_async(actions) - return self.step_wait() - - -class CloudpickleWrapper(object): - """ - Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle) - """ - def __init__(self, x): - self.x = x - def __getstate__(self): - import cloudpickle - return cloudpickle.dumps(self.x) - def __setstate__(self, ob): - import pickle - self.x = pickle.loads(ob) - - -class SubprocVecEnv(VecEnv): - def __init__(self, env_fns, spaces=None): - """ - envs: list of gym environments to run in subprocesses - """ - self.waiting = False - self.closed = False - nenvs = len(env_fns) - self.nenvs = nenvs - self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)]) - self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn))) - for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)] - for p in self.ps: - p.daemon = True # if the main process crashes, we should not cause things to hang - p.start() - for remote in self.work_remotes: - remote.close() - - self.remotes[0].send(('get_spaces', None)) - observation_space, action_space = self.remotes[0].recv() - VecEnv.__init__(self, len(env_fns), observation_space, action_space) - - def step_async(self, actions): - for remote, action in zip(self.remotes, actions): - remote.send(('step', action)) - self.waiting = True - - def step_wait(self): - results = [remote.recv() for remote in self.remotes] - self.waiting = False - obs, rews, dones, infos = zip(*results) - return np.stack(obs), np.stack(rews), np.stack(dones), infos - - def reset(self): - for remote in self.remotes: - remote.send(('reset', None)) - return np.stack([remote.recv() for remote in self.remotes]) - - def reset_task(self): - for remote in self.remotes: - remote.send(('reset_task', None)) - return np.stack([remote.recv() for remote in self.remotes]) - - def close(self): - if self.closed: - return - if self.waiting: - for remote in self.remotes: - remote.recv() - for remote in self.remotes: - remote.send(('close', None)) - for p in self.ps: - p.join() - self.closed = True - - def __len__(self): - return self.nenvs diff --git a/codes/A2C/outputs/CartPole-v0/20210503-224814/results/train_ma_rewards.npy b/codes/A2C/outputs/CartPole-v0/20210503-224814/results/train_ma_rewards.npy new file mode 100644 index 0000000..57f4174 Binary files /dev/null and b/codes/A2C/outputs/CartPole-v0/20210503-224814/results/train_ma_rewards.npy differ diff --git a/codes/A2C/outputs/CartPole-v0/20210503-224814/results/train_rewards.npy b/codes/A2C/outputs/CartPole-v0/20210503-224814/results/train_rewards.npy new file mode 100644 index 0000000..bdb3fce Binary files /dev/null and b/codes/A2C/outputs/CartPole-v0/20210503-224814/results/train_rewards.npy differ diff --git a/codes/A2C/outputs/CartPole-v0/20210503-224814/results/train_rewards_curve.png b/codes/A2C/outputs/CartPole-v0/20210503-224814/results/train_rewards_curve.png new file mode 100644 index 0000000..5f1cf9a Binary files /dev/null and b/codes/A2C/outputs/CartPole-v0/20210503-224814/results/train_rewards_curve.png differ diff --git a/codes/A2C/task0_train.py b/codes/A2C/task0_train.py new file mode 100644 index 0000000..69f6976 --- /dev/null +++ b/codes/A2C/task0_train.py @@ -0,0 +1,120 @@ +import sys,os +curr_path = os.path.dirname(__file__) +parent_path = os.path.dirname(curr_path) +sys.path.append(parent_path) # add current terminal path to sys.path + + +import gym +import numpy as np +import torch +import torch.optim as optim +import datetime +from common.multiprocessing_env import SubprocVecEnv +from A2C.model import ActorCritic +from common.utils import save_results, make_dir +from common.plot import plot_rewards + +curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time +class A2CConfig: + def __init__(self) -> None: + self.algo='A2C' + self.env= 'CartPole-v0' + self.result_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/results/' # path to save results + self.model_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/models/' # path to save models + self.n_envs = 8 + self.gamma = 0.99 + self.hidden_size = 256 + self.lr = 1e-3 # learning rate + self.max_frames = 30000 + self.n_steps = 5 + self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") +def make_envs(env_name): + def _thunk(): + env = gym.make(env_name) + env.seed(2) + return env + return _thunk +def test_env(env,model,vis=False): + state = env.reset() + if vis: env.render() + done = False + total_reward = 0 + while not done: + state = torch.FloatTensor(state).unsqueeze(0).to(cfg.device) + dist, _ = model(state) + next_state, reward, done, _ = env.step(dist.sample().cpu().numpy()[0]) + state = next_state + if vis: env.render() + total_reward += reward + return total_reward +def compute_returns(next_value, rewards, masks, gamma=0.99): + R = next_value + returns = [] + for step in reversed(range(len(rewards))): + R = rewards[step] + gamma * R * masks[step] + returns.insert(0, R) + return returns + + +def train(cfg,envs): + env = gym.make(cfg.env) # a single env + env.seed(10) + state_dim = envs.observation_space.shape[0] + action_dim = envs.action_space.n + model = ActorCritic(state_dim, action_dim, cfg.hidden_size).to(cfg.device) + optimizer = optim.Adam(model.parameters()) + frame_idx = 0 + test_rewards = [] + test_ma_rewards = [] + state = envs.reset() + while frame_idx < cfg.max_frames: + log_probs = [] + values = [] + rewards = [] + masks = [] + entropy = 0 + # rollout trajectory + for _ in range(cfg.n_steps): + state = torch.FloatTensor(state).to(cfg.device) + dist, value = model(state) + action = dist.sample() + next_state, reward, done, _ = envs.step(action.cpu().numpy()) + log_prob = dist.log_prob(action) + entropy += dist.entropy().mean() + log_probs.append(log_prob) + values.append(value) + rewards.append(torch.FloatTensor(reward).unsqueeze(1).to(cfg.device)) + masks.append(torch.FloatTensor(1 - done).unsqueeze(1).to(cfg.device)) + state = next_state + frame_idx += 1 + if frame_idx % 100 == 0: + test_reward = np.mean([test_env(env,model) for _ in range(10)]) + print(f"frame_idx:{frame_idx}, test_reward:{test_reward}") + test_rewards.append(test_reward) + if test_ma_rewards: + test_ma_rewards.append(0.9*test_ma_rewards[-1]+0.1*test_reward) + else: + test_ma_rewards.append(test_reward) + # plot(frame_idx, test_rewards) + next_state = torch.FloatTensor(next_state).to(cfg.device) + _, next_value = model(next_state) + returns = compute_returns(next_value, rewards, masks) + log_probs = torch.cat(log_probs) + returns = torch.cat(returns).detach() + values = torch.cat(values) + advantage = returns - values + actor_loss = -(log_probs * advantage.detach()).mean() + critic_loss = advantage.pow(2).mean() + loss = actor_loss + 0.5 * critic_loss - 0.001 * entropy + optimizer.zero_grad() + loss.backward() + optimizer.step() + return test_rewards, test_ma_rewards +if __name__ == "__main__": + cfg = A2CConfig() + envs = [make_envs(cfg.env) for i in range(cfg.n_envs)] + envs = SubprocVecEnv(envs) # 8 env + rewards,ma_rewards = train(cfg,envs) + make_dir(cfg.result_path,cfg.model_path) + save_results(rewards,ma_rewards,tag='train',path=cfg.result_path) + plot_rewards(rewards,ma_rewards,tag="train",env=cfg.env,algo = cfg.algo,path=cfg.result_path) diff --git a/codes/A2C/utils.py b/codes/A2C/utils.py deleted file mode 100644 index b6d66c6..0000000 --- a/codes/A2C/utils.py +++ /dev/null @@ -1,32 +0,0 @@ -#!/usr/bin/env python -# coding=utf-8 -''' -Author: John -Email: johnjim0816@gmail.com -Date: 2020-10-15 21:31:19 -LastEditor: John -LastEditTime: 2020-11-03 17:05:48 -Discription: -Environment: -''' -import os -import numpy as np -import datetime - -SEQUENCE = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") -SAVED_MODEL_PATH = os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"+SEQUENCE+'/' -RESULT_PATH = os.path.split(os.path.abspath(__file__))[0]+"/results/"+SEQUENCE+'/' - - -def save_results(rewards,moving_average_rewards,ep_steps,path=RESULT_PATH): - if not os.path.exists(path): # 检测是否存在文件夹 - os.mkdir(path) - np.save(RESULT_PATH+'rewards_train.npy', rewards) - np.save(RESULT_PATH+'moving_average_rewards_train.npy', moving_average_rewards) - np.save(RESULT_PATH+'steps_train.npy',ep_steps ) - -def save_model(agent,model_path='./saved_model'): - if not os.path.exists(model_path): # 检测是否存在文件夹 - os.mkdir(model_path) - agent.save_model(model_path+'checkpoint.pth') - print('model saved!') \ No newline at end of file diff --git a/codes/DDPG/README.md b/codes/DDPG/README.md index 351615b..bbcedcc 100644 --- a/codes/DDPG/README.md +++ b/codes/DDPG/README.md @@ -1,5 +1,7 @@ # DDPG +#TODO + ## 伪代码 ![image-20210320151900695](assets/image-20210320151900695.png) \ No newline at end of file diff --git a/codes/DQN/README.md b/codes/DQN/README.md index 530d666..45612be 100644 --- a/codes/DQN/README.md +++ b/codes/DQN/README.md @@ -1,5 +1,5 @@ # DQN - +#TODO ## 原理简介 DQN是Q-leanning算法的优化和延伸,Q-leaning中使用有限的Q表存储值的信息,而DQN中则用神经网络替代Q表存储信息,这样更适用于高维的情况,相关知识基础可参考[datawhale李宏毅笔记-Q学习](https://datawhalechina.github.io/easy-rl/#/chapter6/chapter6)。 diff --git a/codes/DQN/agent.py b/codes/DQN/agent.py index 7890270..669295f 100644 --- a/codes/DQN/agent.py +++ b/codes/DQN/agent.py @@ -5,7 +5,7 @@ @Email: johnjim0816@gmail.com @Date: 2020-06-12 00:50:49 @LastEditor: John -LastEditTime: 2021-03-30 17:01:26 +LastEditTime: 2021-04-29 22:19:18 @Discription: @Environment: python 3.7.7 ''' @@ -39,6 +39,8 @@ class DQN: hidden_dim=cfg.hidden_dim).to(self.device) self.target_net = MLP(state_dim, action_dim, hidden_dim=cfg.hidden_dim).to(self.device) + for target_param, param in zip(self.target_net.parameters(), self.policy_net.parameters()): + target_param.data.copy_(param.data) self.optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg.lr) self.loss = 0 self.memory = ReplayBuffer(cfg.memory_capacity) @@ -48,21 +50,16 @@ class DQN: ''' self.frame_idx += 1 if random.random() > self.epsilon(self.frame_idx): - with torch.no_grad(): - # 先转为张量便于丢给神经网络,state元素数据原本为float64 - # 注意state=torch.tensor(state).unsqueeze(0)跟state=torch.tensor([state])等价 - state = torch.tensor( - [state], device=self.device, dtype=torch.float32) - # 如tensor([[-0.0798, -0.0079]], grad_fn=) - q_value = self.policy_net(state) - # tensor.max(1)返回每行的最大值以及对应的下标, - # 如torch.return_types.max(values=tensor([10.3587]),indices=tensor([0])) - # 所以tensor.max(1)[1]返回最大值对应的下标,即action - action = q_value.max(1)[1].item() + action = self.predict(state) else: action = random.randrange(self.action_dim) return action - + def predict(self,state): + with torch.no_grad(): + state = torch.tensor([state], device=self.device, dtype=torch.float32) + q_values = self.policy_net(state) + action = q_values.max(1)[1].item() + return action def update(self): if len(self.memory) < self.batch_size: @@ -109,3 +106,5 @@ class DQN: def load(self, path): self.target_net.load_state_dict(torch.load(path+'dqn_checkpoint.pth')) + for target_param, param in zip(self.target_net.parameters(), self.policy_net.parameters()): + param.data.copy_(target_param.data) diff --git a/codes/DQN/outputs/CartPole-v0/20210418-143542/models/dqn_checkpoint.pth b/codes/DQN/outputs/CartPole-v0/20210418-143542/models/dqn_checkpoint.pth deleted file mode 100644 index 3bc041d..0000000 Binary files a/codes/DQN/outputs/CartPole-v0/20210418-143542/models/dqn_checkpoint.pth and /dev/null differ diff --git a/codes/DQN/outputs/CartPole-v0/20210418-143542/results/ma_rewards_train.npy b/codes/DQN/outputs/CartPole-v0/20210418-143542/results/ma_rewards_train.npy deleted file mode 100644 index 152ad7a..0000000 Binary files a/codes/DQN/outputs/CartPole-v0/20210418-143542/results/ma_rewards_train.npy and /dev/null differ diff --git a/codes/DQN/outputs/CartPole-v0/20210418-143542/results/rewards_curve_train.png b/codes/DQN/outputs/CartPole-v0/20210418-143542/results/rewards_curve_train.png deleted file mode 100644 index ad42573..0000000 Binary files a/codes/DQN/outputs/CartPole-v0/20210418-143542/results/rewards_curve_train.png and /dev/null differ diff --git a/codes/DQN/outputs/CartPole-v0/20210429-222132/models/dqn_checkpoint.pth b/codes/DQN/outputs/CartPole-v0/20210429-222132/models/dqn_checkpoint.pth new file mode 100644 index 0000000..2b2200e Binary files /dev/null and b/codes/DQN/outputs/CartPole-v0/20210429-222132/models/dqn_checkpoint.pth differ diff --git a/codes/DQN/outputs/CartPole-v0/20210429-222132/results/eval_ma_rewards.npy b/codes/DQN/outputs/CartPole-v0/20210429-222132/results/eval_ma_rewards.npy new file mode 100644 index 0000000..e25eb51 Binary files /dev/null and b/codes/DQN/outputs/CartPole-v0/20210429-222132/results/eval_ma_rewards.npy differ diff --git a/codes/DQN/outputs/CartPole-v0/20210429-222132/results/eval_rewards.npy b/codes/DQN/outputs/CartPole-v0/20210429-222132/results/eval_rewards.npy new file mode 100644 index 0000000..2fc0e4e Binary files /dev/null and b/codes/DQN/outputs/CartPole-v0/20210429-222132/results/eval_rewards.npy differ diff --git a/codes/DQN/outputs/CartPole-v0/20210429-222132/results/eval_rewards_curve.png b/codes/DQN/outputs/CartPole-v0/20210429-222132/results/eval_rewards_curve.png new file mode 100644 index 0000000..295fdac Binary files /dev/null and b/codes/DQN/outputs/CartPole-v0/20210429-222132/results/eval_rewards_curve.png differ diff --git a/codes/DQN/outputs/CartPole-v0/20210429-222132/results/train_ma_rewards.npy b/codes/DQN/outputs/CartPole-v0/20210429-222132/results/train_ma_rewards.npy new file mode 100644 index 0000000..f71e613 Binary files /dev/null and b/codes/DQN/outputs/CartPole-v0/20210429-222132/results/train_ma_rewards.npy differ diff --git a/codes/DQN/outputs/CartPole-v0/20210429-222132/results/train_rewards.npy b/codes/DQN/outputs/CartPole-v0/20210429-222132/results/train_rewards.npy new file mode 100644 index 0000000..fa9ffc3 Binary files /dev/null and b/codes/DQN/outputs/CartPole-v0/20210429-222132/results/train_rewards.npy differ diff --git a/codes/DQN/outputs/CartPole-v0/20210429-222132/results/train_rewards_curve.png b/codes/DQN/outputs/CartPole-v0/20210429-222132/results/train_rewards_curve.png new file mode 100644 index 0000000..a6857d3 Binary files /dev/null and b/codes/DQN/outputs/CartPole-v0/20210429-222132/results/train_rewards_curve.png differ diff --git a/codes/DQN/outputs/CartPole-v0/20210429-222344/models/dqn_checkpoint.pth b/codes/DQN/outputs/CartPole-v0/20210429-222344/models/dqn_checkpoint.pth new file mode 100644 index 0000000..0b192d4 Binary files /dev/null and b/codes/DQN/outputs/CartPole-v0/20210429-222344/models/dqn_checkpoint.pth differ diff --git a/codes/DQN/outputs/CartPole-v0/20210429-222344/results/eval_ma_rewards.npy b/codes/DQN/outputs/CartPole-v0/20210429-222344/results/eval_ma_rewards.npy new file mode 100644 index 0000000..f91ed3c Binary files /dev/null and b/codes/DQN/outputs/CartPole-v0/20210429-222344/results/eval_ma_rewards.npy differ diff --git a/codes/DQN/outputs/CartPole-v0/20210429-222344/results/eval_rewards.npy b/codes/DQN/outputs/CartPole-v0/20210429-222344/results/eval_rewards.npy new file mode 100644 index 0000000..51e06c2 Binary files /dev/null and b/codes/DQN/outputs/CartPole-v0/20210429-222344/results/eval_rewards.npy differ diff --git a/codes/DQN/outputs/CartPole-v0/20210429-222344/results/eval_rewards_curve.png b/codes/DQN/outputs/CartPole-v0/20210429-222344/results/eval_rewards_curve.png new file mode 100644 index 0000000..0327b47 Binary files /dev/null and b/codes/DQN/outputs/CartPole-v0/20210429-222344/results/eval_rewards_curve.png differ diff --git a/codes/DQN/outputs/CartPole-v0/20210429-222344/results/train_ma_rewards.npy b/codes/DQN/outputs/CartPole-v0/20210429-222344/results/train_ma_rewards.npy new file mode 100644 index 0000000..fadc523 Binary files /dev/null and b/codes/DQN/outputs/CartPole-v0/20210429-222344/results/train_ma_rewards.npy differ diff --git a/codes/DQN/outputs/CartPole-v0/20210418-143542/results/rewards_train.npy b/codes/DQN/outputs/CartPole-v0/20210429-222344/results/train_rewards.npy similarity index 60% rename from codes/DQN/outputs/CartPole-v0/20210418-143542/results/rewards_train.npy rename to codes/DQN/outputs/CartPole-v0/20210429-222344/results/train_rewards.npy index 58fb2b8..61cf9fc 100644 Binary files a/codes/DQN/outputs/CartPole-v0/20210418-143542/results/rewards_train.npy and b/codes/DQN/outputs/CartPole-v0/20210429-222344/results/train_rewards.npy differ diff --git a/codes/DQN/outputs/CartPole-v0/20210429-222344/results/train_rewards_curve.png b/codes/DQN/outputs/CartPole-v0/20210429-222344/results/train_rewards_curve.png new file mode 100644 index 0000000..b9667f1 Binary files /dev/null and b/codes/DQN/outputs/CartPole-v0/20210429-222344/results/train_rewards_curve.png differ diff --git a/codes/DQN/main.py b/codes/DQN/task0_train.py similarity index 62% rename from codes/DQN/main.py rename to codes/DQN/task0_train.py index cd22aad..fc13983 100644 --- a/codes/DQN/main.py +++ b/codes/DQN/task0_train.py @@ -5,7 +5,7 @@ @Email: johnjim0816@gmail.com @Date: 2020-06-12 00:48:57 @LastEditor: John -LastEditTime: 2021-04-29 02:02:12 +LastEditTime: 2021-04-29 22:23:38 @Discription: @Environment: python 3.7.7 ''' @@ -36,21 +36,28 @@ class DQNConfig: '/'+curr_time+'/results/' # path to save results self.model_path = curr_path+"/outputs/" + self.env + \ '/'+curr_time+'/models/' # path to save results + self.train_eps = 300 # 训练的episode数目 + self.eval_eps = 50 # number of episodes for evaluating self.gamma = 0.95 - self.epsilon_start = 1 # e-greedy策略的初始epsilon + self.epsilon_start = 0.90 # e-greedy策略的初始epsilon self.epsilon_end = 0.01 self.epsilon_decay = 500 self.lr = 0.0001 # learning rate - self.memory_capacity = 10000 # Replay Memory容量 - self.batch_size = 32 - self.train_eps = 300 # 训练的episode数目 + self.memory_capacity = 100000 # Replay Memory容量 + self.batch_size = 64 self.target_update = 2 # target net的更新频率 - self.eval_eps = 20 # 测试的episode数目 self.device = torch.device( "cuda" if torch.cuda.is_available() else "cpu") # 检测gpu self.hidden_dim = 256 # 神经网络隐藏层维度 - +def env_agent_config(cfg,seed=1): + env = gym.make(cfg.env) + env.seed(seed) + state_dim = env.observation_space.shape[0] + action_dim = env.action_space.n + agent = DQN(state_dim,action_dim,cfg) + return env,agent + def train(cfg, env, agent): print('Start to train !') print(f'Env:{cfg.env}, Algorithm:{cfg.algo}, Device:{cfg.device}') @@ -60,13 +67,15 @@ def train(cfg, env, agent): state = env.reset() done = False ep_reward = 0 - while not done: + while True: action = agent.choose_action(state) next_state, reward, done, _ = env.step(action) ep_reward += reward agent.memory.push(state, action, reward, next_state, done) state = next_state agent.update() + if done: + break if i_episode % cfg.target_update == 0: agent.target_net.load_state_dict(agent.policy_net.state_dict()) print('Episode:{}/{}, Reward:{}'.format(i_episode+1, cfg.train_eps, ep_reward)) @@ -79,17 +88,39 @@ def train(cfg, env, agent): print('Complete training!') return rewards, ma_rewards +def eval(cfg,env,agent): + rewards = [] # 记录所有episode的reward + ma_rewards = [] # 滑动平均的reward + for i_ep in range(cfg.eval_eps): + ep_reward = 0 # 记录每个episode的reward + state = env.reset() # 重置环境, 重新开一局(即开始新的一个episode) + while True: + action = agent.predict(state) # 根据算法选择一个动作 + next_state, reward, done, _ = env.step(action) # 与环境进行一个交互 + state = next_state # 存储上一个观察值 + ep_reward += reward + if done: + break + rewards.append(ep_reward) + if ma_rewards: + ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1) + else: + ma_rewards.append(ep_reward) + print(f"Episode:{i_ep+1}/{cfg.eval_eps}, reward:{ep_reward:.1f}") + return rewards,ma_rewards if __name__ == "__main__": cfg = DQNConfig() - env = gym.make(cfg.env) - env.seed(1) - state_dim = env.observation_space.shape[0] - action_dim = env.action_space.n - agent = DQN(state_dim, action_dim, cfg) + env,agent = env_agent_config(cfg,seed=1) rewards, ma_rewards = train(cfg, env, agent) make_dir(cfg.result_path, cfg.model_path) agent.save(path=cfg.model_path) save_results(rewards, ma_rewards, tag='train', path=cfg.result_path) plot_rewards(rewards, ma_rewards, tag="train", algo=cfg.algo, path=cfg.result_path) + + env,agent = env_agent_config(cfg,seed=10) + agent.load(path=cfg.model_path) + rewards,ma_rewards = eval(cfg,env,agent) + save_results(rewards,ma_rewards,tag='eval',path=cfg.result_path) + plot_rewards(rewards,ma_rewards,tag="eval",env=cfg.env,algo = cfg.algo,path=cfg.result_path) diff --git a/codes/QLearning/README.md b/codes/QLearning/README.md new file mode 100644 index 0000000..cc8ef5e --- /dev/null +++ b/codes/QLearning/README.md @@ -0,0 +1,3 @@ +# Q-learning + +#TODO diff --git a/codes/QLearning/agent.py b/codes/QLearning/agent.py index 2d2cb97..3a4eadb 100644 --- a/codes/QLearning/agent.py +++ b/codes/QLearning/agent.py @@ -5,8 +5,8 @@ Author: John Email: johnjim0816@gmail.com Date: 2020-09-11 23:03:00 LastEditor: John -LastEditTime: 2021-03-26 16:51:01 -Discription: +LastEditTime: 2021-04-29 16:59:41 +Discription: use defaultdict to define Q table Environment: ''' import numpy as np @@ -15,7 +15,7 @@ import torch from collections import defaultdict class QLearning(object): - def __init__(self, + def __init__(self,state_dim, action_dim,cfg): self.action_dim = action_dim # dimension of acgtion self.lr = cfg.lr # learning rate @@ -26,17 +26,20 @@ class QLearning(object): self.epsilon_end = cfg.epsilon_end self.epsilon_decay = cfg.epsilon_decay self.Q_table = defaultdict(lambda: np.zeros(action_dim)) # A nested dictionary that maps state -> (action -> action-value) + def choose_action(self, state): self.sample_count += 1 self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \ math.exp(-1. * self.sample_count / self.epsilon_decay) # e-greedy policy if np.random.uniform(0, 1) > self.epsilon: - action = np.argmax(self.Q_table[str(state)]) + action = self.predict(state) else: action = np.random.choice(self.action_dim) return action - + def predict(self,state): + action = np.argmax(self.Q_table[str(state)]) + return action def update(self, state, action, reward, next_state, done): Q_predict = self.Q_table[str(state)][action] if done: diff --git a/codes/QLearning/agent1.py b/codes/QLearning/agent1.py new file mode 100644 index 0000000..4f025ac --- /dev/null +++ b/codes/QLearning/agent1.py @@ -0,0 +1,88 @@ +#!/usr/bin/env python +# coding=utf-8 +''' +Author: John +Email: johnjim0816@gmail.com +Date: 2020-09-11 23:03:00 +LastEditor: John +LastEditTime: 2021-04-29 17:02:00 +Discription: +Environment: +''' +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import numpy as np +import math +#!/usr/bin/env python +# coding=utf-8 +''' +Author: John +Email: johnjim0816@gmail.com +Date: 2020-09-11 23:03:00 +LastEditor: John +LastEditTime: 2021-04-29 16:45:33 +Discription: use np array to define Q table +Environment: +''' +import numpy as np +import math + +class QLearning(object): + def __init__(self, + state_dim,action_dim,cfg): + self.action_dim = action_dim # dimension of acgtion + self.lr = cfg.lr # learning rate + self.gamma = cfg.gamma + self.epsilon = 0 + self.sample_count = 0 + self.epsilon_start = cfg.epsilon_start + self.epsilon_end = cfg.epsilon_end + self.epsilon_decay = cfg.epsilon_decay + self.Q_table = np.zeros((state_dim, action_dim)) # Q表 + + def choose_action(self, state): + self.sample_count += 1 + self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \ + math.exp(-1. * self.sample_count / self.epsilon_decay) + if np.random.uniform(0, 1) > self.epsilon: # 随机选取0-1之间的值,如果大于epsilon就按照贪心策略选取action,否则随机选取 + action = self.predict(state) + else: + action = np.random.choice(self.action_dim) #有一定概率随机探索选取一个动作 + return action + + def predict(self, state): + '''根据输入观测值,采样输出的动作值,带探索,测试模型时使用 + ''' + Q_list = self.Q_table[state, :] + Q_max = np.max(Q_list) + action_list = np.where(Q_list == Q_max)[0] + action = np.random.choice(action_list) # Q_max可能对应多个 action ,可以随机抽取一个 + return action + + def update(self, state, action, reward, next_state, done): + Q_predict = self.Q_table[state, action] + if done: + Q_target = reward # 没有下一个状态了 + else: + Q_target = reward + self.gamma * np.max( + self.Q_table[next_state, :]) # Q_table-learning + self.Q_table[state, action] += self.lr * (Q_target - Q_predict) # 修正q + def save(self,path): + np.save(path+"Q_table.npy", self.Q_table) + def load(self, path): + self.Q_table = np.load(path+"Q_table.npy") + + diff --git a/codes/QLearning/main.ipynb b/codes/QLearning/main.ipynb deleted file mode 100644 index 91d2a6b..0000000 --- a/codes/QLearning/main.ipynb +++ /dev/null @@ -1,152 +0,0 @@ -{ - "metadata": { - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.10-final" - }, - "orig_nbformat": 2, - "kernelspec": { - "name": "python3", - "display_name": "Python 3.7.10 64-bit ('py37': conda)", - "metadata": { - "interpreter": { - "hash": "fbea1422c2cf61ed9c0cfc03f38f71cc9083cc288606edc4170b5309b352ce27" - } - } - } - }, - "nbformat": 4, - "nbformat_minor": 2, - "cells": [ - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "import sys\n", - "from pathlib import Path\n", - "curr_path = str(Path().absolute())\n", - "parent_path = str(Path().absolute().parent)\n", - "sys.path.append(parent_path) # add current terminal path to sys.path\n", - "\n", - "import gym\n", - "\n", - "from envs.gridworld_env import CliffWalkingWapper, FrozenLakeWapper\n", - "from QLearning.agent import QLearning\n", - "from common.plot import plot_rewards\n", - "from common.utils import save_results" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "class QlearningConfig:\n", - " '''训练相关参数'''\n", - " def __init__(self):\n", - " self.train_eps = 200 # 训练的episode数目\n", - " self.gamma = 0.9 # reward的衰减率\n", - " self.epsilon_start = 0.99 # e-greedy策略中初始epsilon\n", - " self.epsilon_end = 0.01 # e-greedy策略中的终止epsilon\n", - " self.epsilon_decay = 200 # e-greedy策略中epsilon的衰减率\n", - " self.lr = 0.1 # learning rate" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "def train(cfg,env,agent):\n", - " rewards = [] \n", - " ma_rewards = [] # moving average reward\n", - " steps = [] # 记录所有episode的steps\n", - " for i_episode in range(cfg.train_eps):\n", - " ep_reward = 0 # 记录每个episode的reward\n", - " ep_steps = 0 # 记录每个episode走了多少step\n", - " state = env.reset() # 重置环境, 重新开一局(即开始新的一个episode)\n", - " while True:\n", - " action = agent.choose_action(state) # 根据算法选择一个动作\n", - " next_state, reward, done, _ = env.step(action) # 与环境进行一次动作交互\n", - " agent.update(state, action, reward, next_state, done) # Q-learning算法更新\n", - " state = next_state # 存储上一个观察值\n", - " ep_reward += reward\n", - " ep_steps += 1 # 计算step数\n", - " if done:\n", - " break\n", - " steps.append(ep_steps)\n", - " rewards.append(ep_reward)\n", - " if ma_rewards:\n", - " ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)\n", - " else:\n", - " ma_rewards.append(ep_reward)\n", - " if (i_episode+1)%10==0:\n", - " print(\"Episode:{}/{}: reward:{:.1f}\".format(i_episode+1, cfg.train_eps,ep_reward))\n", - " return rewards,ma_rewards" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Episode:10/200: reward:-82.0\n", - "Episode:20/200: reward:-59.0\n", - "Episode:30/200: reward:-50.0\n", - "Episode:40/200: reward:-32.0\n", - "Episode:50/200: reward:-102.0\n", - "Episode:60/200: reward:-151.0\n", - "Episode:70/200: reward:-34.0\n", - "Episode:80/200: reward:-71.0\n", - "Episode:90/200: reward:-34.0\n", - "Episode:100/200: reward:-26.0\n", - "Episode:110/200: reward:-32.0\n", - "Episode:120/200: reward:-48.0\n", - "Episode:130/200: reward:-25.0\n", - "Episode:140/200: reward:-31.0\n", - "Episode:150/200: reward:-38.0\n", - "Episode:160/200: reward:-47.0\n", - "Episode:170/200: reward:-29.0\n", - "Episode:180/200: reward:-36.0\n", - "Episode:190/200: reward:-21.0\n", - "Episode:200/200: reward:-34.0\n" - ] - }, - { - "output_type": "display_data", - "data": { - "text/plain": "
", - "image/svg+xml": "\n\n\n\n \n \n \n \n 2021-03-31T18:50:18.442345\n image/svg+xml\n \n \n Matplotlib v3.3.4, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", - "image/png": "\n" - }, - "metadata": {} - } - ], - "source": [ - "cfg = QlearningConfig()\n", - "env = gym.make(\"CliffWalking-v0\") # 0 up, 1 right, 2 down, 3 left\n", - "env = CliffWalkingWapper(env)\n", - "action_dim = env.action_space.n\n", - "agent = QLearning(action_dim,cfg)\n", - "rewards,ma_rewards = train(cfg,env,agent)\n", - "plot_rewards(rewards,ma_rewards,tag=\"train\",algo = \"On-Policy First-Visit MC Control\",save=False)" - ] - } - ] -} \ No newline at end of file diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/models/Qleaning_model.pkl b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/models/Qleaning_model.pkl new file mode 100644 index 0000000..cd4f55b Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/models/Qleaning_model.pkl differ diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/eval_ma_rewards.npy b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/eval_ma_rewards.npy new file mode 100644 index 0000000..a67d064 Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/eval_ma_rewards.npy differ diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/eval_rewards.npy b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/eval_rewards.npy new file mode 100644 index 0000000..6de67e1 Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/eval_rewards.npy differ diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/eval_rewards_curve.png b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/eval_rewards_curve.png new file mode 100644 index 0000000..776685a Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/eval_rewards_curve.png differ diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/train_ma_rewards.npy b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/train_ma_rewards.npy new file mode 100644 index 0000000..c18ceaf Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/train_ma_rewards.npy differ diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/train_rewards.npy b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/train_rewards.npy new file mode 100644 index 0000000..fb83837 Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/train_rewards.npy differ diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/train_rewards_curve.png b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/train_rewards_curve.png new file mode 100644 index 0000000..d737d7f Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-165825/results/train_rewards_curve.png differ diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/models/Qleaning_model.pkl b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/models/Qleaning_model.pkl new file mode 100644 index 0000000..5d963e7 Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/models/Qleaning_model.pkl differ diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/eval_ma_rewards.npy b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/eval_ma_rewards.npy new file mode 100644 index 0000000..a67d064 Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/eval_ma_rewards.npy differ diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/eval_rewards.npy b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/eval_rewards.npy new file mode 100644 index 0000000..6de67e1 Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/eval_rewards.npy differ diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/eval_rewards_curve.png b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/eval_rewards_curve.png new file mode 100644 index 0000000..2992927 Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/eval_rewards_curve.png differ diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/train_ma_rewards.npy b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/train_ma_rewards.npy new file mode 100644 index 0000000..13d68d5 Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/train_ma_rewards.npy differ diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/train_rewards.npy b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/train_rewards.npy new file mode 100644 index 0000000..c504109 Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/train_rewards.npy differ diff --git a/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/train_rewards_curve.png b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/train_rewards_curve.png new file mode 100644 index 0000000..9f0ed62 Binary files /dev/null and b/codes/QLearning/outputs/CliffWalking-v0/20210429-170453/results/train_rewards_curve.png differ diff --git a/codes/QLearning/results/20210326-171621/ma_rewards_train.npy b/codes/QLearning/results/20210326-171621/ma_rewards_train.npy deleted file mode 100644 index 0f842f2..0000000 Binary files a/codes/QLearning/results/20210326-171621/ma_rewards_train.npy and /dev/null differ diff --git a/codes/QLearning/results/20210326-171621/rewards_curve_train.png b/codes/QLearning/results/20210326-171621/rewards_curve_train.png deleted file mode 100644 index 985b8c7..0000000 Binary files a/codes/QLearning/results/20210326-171621/rewards_curve_train.png and /dev/null differ diff --git a/codes/QLearning/results/20210326-171621/rewards_train.npy b/codes/QLearning/results/20210326-171621/rewards_train.npy deleted file mode 100644 index ed8f524..0000000 Binary files a/codes/QLearning/results/20210326-171621/rewards_train.npy and /dev/null differ diff --git a/codes/QLearning/saved_model/20210326-171621/Qleaning_model.pkl b/codes/QLearning/saved_model/20210326-171621/Qleaning_model.pkl deleted file mode 100644 index 47e7279..0000000 Binary files a/codes/QLearning/saved_model/20210326-171621/Qleaning_model.pkl and /dev/null differ diff --git a/codes/QLearning/task0_eval.py b/codes/QLearning/task0_eval.py new file mode 100644 index 0000000..851cfe6 --- /dev/null +++ b/codes/QLearning/task0_eval.py @@ -0,0 +1,84 @@ +#!/usr/bin/env python +# coding=utf-8 +''' +Author: John +Email: johnjim0816@gmail.com +Date: 2020-09-11 23:03:00 +LastEditor: John +LastEditTime: 2021-04-29 17:01:43 +Discription: +Environment: +''' +import sys,os +curr_path = os.path.dirname(__file__) +parent_path=os.path.dirname(curr_path) +sys.path.append(parent_path) # add current terminal path to sys.path + +import gym +import datetime + +from envs.gridworld_env import CliffWalkingWapper +from QLearning.agent import QLearning +from common.plot import plot_rewards +from common.utils import save_results + +curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time + +class QlearningConfig: + '''训练相关参数''' + def __init__(self): + self.algo = 'Qlearning' + self.env = 'CliffWalking-v0' # 0 up, 1 right, 2 down, 3 left + self.result_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/results/' # path to save results + self.model_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/models/' # path to save models + self.train_eps = 300 # 训练的episode数目 + self.eval_eps = 30 + self.gamma = 0.9 # reward的衰减率 + self.epsilon_start = 0.95 # e-greedy策略中初始epsilon + self.epsilon_end = 0.01 # e-greedy策略中的终止epsilon + self.epsilon_decay = 200 # e-greedy策略中epsilon的衰减率 + self.lr = 0.1 # learning rate + +def env_agent_config(cfg,seed=1): + env = gym.make(cfg.env) + env = CliffWalkingWapper(env) + env.seed(seed) + state_dim = env.observation_space.n + action_dim = env.action_space.n + agent = QLearning(state_dim,action_dim,cfg) + return env,agent + +def eval(cfg,env,agent): + # env = gym.make("FrozenLake-v0", is_slippery=False) # 0 left, 1 down, 2 right, 3 up + # env = FrozenLakeWapper(env) + rewards = [] # 记录所有episode的reward + ma_rewards = [] # 滑动平均的reward + for i_ep in range(cfg.eval_eps): + ep_reward = 0 # 记录每个episode的reward + state = env.reset() # 重置环境, 重新开一局(即开始新的一个episode) + while True: + action = agent.predict(state) # 根据算法选择一个动作 + next_state, reward, done, _ = env.step(action) # 与环境进行一个交互 + state = next_state # 存储上一个观察值 + ep_reward += reward + if done: + break + rewards.append(ep_reward) + if ma_rewards: + ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1) + else: + ma_rewards.append(ep_reward) + print(f"Episode:{i_ep+1}/{cfg.eval_eps}, reward:{ep_reward:.1f}") + return rewards,ma_rewards + +if __name__ == "__main__": + cfg = QlearningConfig() + env,agent = env_agent_config(cfg,seed=15) + cfg.model_path = './'+'QLearning/outputs/CliffWalking-v0/20210429-165825/models'+'/' + cfg.result_path = './'+'QLearning/outputs/CliffWalking-v0/20210429-165825/results'+'/' + agent.load(path=cfg.model_path) + rewards,ma_rewards = eval(cfg,env,agent) + save_results(rewards,ma_rewards,tag='eval',path=cfg.result_path) + plot_rewards(rewards,ma_rewards,tag="eval",env=cfg.env,algo = cfg.algo,path=cfg.result_path) + + diff --git a/codes/QLearning/task0_train.ipynb b/codes/QLearning/task0_train.ipynb new file mode 100644 index 0000000..3862439 --- /dev/null +++ b/codes/QLearning/task0_train.ipynb @@ -0,0 +1,230 @@ +{ + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.10" + }, + "orig_nbformat": 2, + "kernelspec": { + "name": "python3710jvsc74a57bd0fbea1422c2cf61ed9c0cfc03f38f71cc9083cc288606edc4170b5309b352ce27", + "display_name": "Python 3.7.10 64-bit ('py37': conda)" + } + }, + "nbformat": 4, + "nbformat_minor": 2, + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "from pathlib import Path\n", + "curr_path = str(Path().absolute())\n", + "parent_path = str(Path().absolute().parent)\n", + "sys.path.append(parent_path) # add current terminal path to sys.path\n", + "\n", + "import gym\n", + "import datetime\n", + "\n", + "from envs.gridworld_env import CliffWalkingWapper\n", + "from QLearning.agent import QLearning\n", + "from common.plot import plot_rewards\n", + "from common.utils import save_results,make_dir\n", + "curr_time = datetime.datetime.now().strftime(\"%Y%m%d-%H%M%S\") # obtain current time" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "class QlearningConfig:\n", + " '''训练相关参数'''\n", + " def __init__(self):\n", + " self.algo = 'Qlearning'\n", + " self.env = 'CliffWalking-v0' # 0 up, 1 right, 2 down, 3 left\n", + " self.result_path = curr_path+\"/outputs/\" +self.env+'/'+curr_time+'/results/' # path to save results\n", + " self.model_path = curr_path+\"/outputs/\" +self.env+'/'+curr_time+'/models/' # path to save models\n", + " self.train_eps = 300 # 训练的episode数目\n", + " self.eval_eps = 30\n", + " self.gamma = 0.9 # reward的衰减率\n", + " self.epsilon_start = 0.95 # e-greedy策略中初始epsilon\n", + " self.epsilon_end = 0.01 # e-greedy策略中的终止epsilon\n", + " self.epsilon_decay = 200 # e-greedy策略中epsilon的衰减率\n", + " self.lr = 0.1 # learning rate" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "def env_agent_config(cfg,seed=1):\n", + " env = gym.make(cfg.env) \n", + " env = CliffWalkingWapper(env)\n", + " env.seed(seed)\n", + " state_dim = env.observation_space.n\n", + " action_dim = env.action_space.n\n", + " agent = QLearning(state_dim,action_dim,cfg)\n", + " return env,agent" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "def train(cfg,env,agent):\n", + " rewards = [] \n", + " ma_rewards = [] # moving average reward\n", + " for i_ep in range(cfg.train_eps):\n", + " ep_reward = 0 # 记录每个episode的reward\n", + " state = env.reset() # 重置环境, 重新开一局(即开始新的一个episode)\n", + " while True:\n", + " action = agent.choose_action(state) # 根据算法选择一个动作\n", + " next_state, reward, done, _ = env.step(action) # 与环境进行一次动作交互\n", + " agent.update(state, action, reward, next_state, done) # Q-learning算法更新\n", + " state = next_state # 存储上一个观察值\n", + " ep_reward += reward\n", + " if done:\n", + " break\n", + " rewards.append(ep_reward)\n", + " if ma_rewards:\n", + " ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)\n", + " else:\n", + " ma_rewards.append(ep_reward)\n", + " if (i_ep+1)%10==0:\n", + " print(\"Episode:{}/{}: reward:{:.1f}\".format(i_ep+1, cfg.train_eps,ep_reward))\n", + " return rewards,ma_rewards" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "def eval(cfg,env,agent):\n", + " # env = gym.make(\"FrozenLake-v0\", is_slippery=False) # 0 left, 1 down, 2 right, 3 up\n", + " # env = FrozenLakeWapper(env)\n", + " rewards = [] # 记录所有episode的reward\n", + " ma_rewards = [] # 滑动平均的reward\n", + " for i_ep in range(cfg.eval_eps):\n", + " ep_reward = 0 # 记录每个episode的reward\n", + " state = env.reset() # 重置环境, 重新开一局(即开始新的一个episode)\n", + " while True:\n", + " action = agent.predict(state) # 根据算法选择一个动作\n", + " next_state, reward, done, _ = env.step(action) # 与环境进行一个交互\n", + " state = next_state # 存储上一个观察值\n", + " ep_reward += reward\n", + " if done:\n", + " break\n", + " rewards.append(ep_reward)\n", + " if ma_rewards:\n", + " ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)\n", + " else:\n", + " ma_rewards.append(ep_reward)\n", + " if (i_ep+1)%10==0:\n", + " print(f\"Episode:{i_ep+1}/{cfg.eval_eps}, reward:{ep_reward:.1f}\")\n", + " return rewards,ma_rewards" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Episode:10/300: reward:-158.0\n", + "Episode:20/300: reward:-131.0\n", + "Episode:30/300: reward:-37.0\n", + "Episode:40/300: reward:-93.0\n", + "Episode:50/300: reward:-47.0\n", + "Episode:60/300: reward:-67.0\n", + "Episode:70/300: reward:-56.0\n", + "Episode:80/300: reward:-44.0\n", + "Episode:90/300: reward:-41.0\n", + "Episode:100/300: reward:-61.0\n", + "Episode:110/300: reward:-52.0\n", + "Episode:120/300: reward:-14.0\n", + "Episode:130/300: reward:-44.0\n", + "Episode:140/300: reward:-31.0\n", + "Episode:150/300: reward:-17.0\n", + "Episode:160/300: reward:-35.0\n", + "Episode:170/300: reward:-34.0\n", + "Episode:180/300: reward:-16.0\n", + "Episode:190/300: reward:-20.0\n", + "Episode:200/300: reward:-25.0\n", + "Episode:210/300: reward:-13.0\n", + "Episode:220/300: reward:-16.0\n", + "Episode:230/300: reward:-20.0\n", + "Episode:240/300: reward:-27.0\n", + "Episode:250/300: reward:-17.0\n", + "Episode:260/300: reward:-14.0\n", + "Episode:270/300: reward:-15.0\n", + "Episode:280/300: reward:-20.0\n", + "Episode:290/300: reward:-13.0\n", + "Episode:300/300: reward:-13.0\n", + "results saved!\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": "
", + "image/svg+xml": "\n\n\n\n \n \n \n \n 2021-04-29T17:04:54.671110\n image/svg+xml\n \n \n Matplotlib v3.3.4, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", + "image/png": "\n" + }, + "metadata": {} + }, + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Episode:10/30, reward:-13.0\nEpisode:20/30, reward:-13.0\nEpisode:30/30, reward:-13.0\nresults saved!\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": "
", + "image/svg+xml": "\n\n\n\n \n \n \n \n 2021-04-29T17:04:55.053953\n image/svg+xml\n \n \n Matplotlib v3.3.4, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", + "image/png": "\n" + }, + "metadata": {} + } + ], + "source": [ + "cfg = QlearningConfig()\n", + "env,agent = env_agent_config(cfg,seed=1)\n", + "rewards,ma_rewards = train(cfg,env,agent)\n", + "make_dir(cfg.result_path,cfg.model_path)\n", + "agent.save(path=cfg.model_path)\n", + "save_results(rewards,ma_rewards,tag='train',path=cfg.result_path)\n", + "plot_rewards(rewards,ma_rewards,tag=\"train\",env=cfg.env,algo = cfg.algo,path=cfg.result_path)\n", + "\n", + "env,agent = env_agent_config(cfg,seed=10)\n", + "agent.load(path=cfg.model_path)\n", + "rewards,ma_rewards = eval(cfg,env,agent)\n", + "save_results(rewards,ma_rewards,tag='eval',path=cfg.result_path)\n", + "plot_rewards(rewards,ma_rewards,tag=\"eval\",env=cfg.env,algo = cfg.algo,path=cfg.result_path)" + ] + } + ] +} \ No newline at end of file diff --git a/codes/QLearning/main.py b/codes/QLearning/task0_train.py similarity index 56% rename from codes/QLearning/main.py rename to codes/QLearning/task0_train.py index 0892bee..73fedae 100644 --- a/codes/QLearning/main.py +++ b/codes/QLearning/task0_train.py @@ -5,11 +5,10 @@ Author: John Email: johnjim0816@gmail.com Date: 2020-09-11 23:03:00 LastEditor: John -LastEditTime: 2021-03-31 18:14:59 +LastEditTime: 2021-04-29 17:01:08 Discription: Environment: ''' - import sys,os curr_path = os.path.dirname(__file__) parent_path=os.path.dirname(curr_path) @@ -18,40 +17,41 @@ sys.path.append(parent_path) # add current terminal path to sys.path import gym import datetime -from envs.gridworld_env import CliffWalkingWapper, FrozenLakeWapper +from envs.gridworld_env import CliffWalkingWapper from QLearning.agent import QLearning from common.plot import plot_rewards -from common.utils import save_results - -SEQUENCE = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time -SAVED_MODEL_PATH = curr_path+"/saved_model/"+SEQUENCE+'/' # path to save model -if not os.path.exists(curr_path+"/saved_model/"): - os.mkdir(curr_path+"/saved_model/") -if not os.path.exists(SAVED_MODEL_PATH): - os.mkdir(SAVED_MODEL_PATH) -RESULT_PATH = curr_path+"/results/"+SEQUENCE+'/' # path to save rewards -if not os.path.exists(curr_path+"/results/"): - os.mkdir(curr_path+"/results/") -if not os.path.exists(RESULT_PATH): - os.mkdir(RESULT_PATH) +from common.utils import save_results,make_dir +curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # obtain current time class QlearningConfig: '''训练相关参数''' def __init__(self): - self.train_eps = 200 # 训练的episode数目 + self.algo = 'Qlearning' + self.env = 'CliffWalking-v0' # 0 up, 1 right, 2 down, 3 left + self.result_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/results/' # path to save results + self.model_path = curr_path+"/outputs/" +self.env+'/'+curr_time+'/models/' # path to save models + self.train_eps = 300 # 训练的episode数目 + self.eval_eps = 30 self.gamma = 0.9 # reward的衰减率 - self.epsilon_start = 0.99 # e-greedy策略中初始epsilon + self.epsilon_start = 0.95 # e-greedy策略中初始epsilon self.epsilon_end = 0.01 # e-greedy策略中的终止epsilon self.epsilon_decay = 200 # e-greedy策略中epsilon的衰减率 self.lr = 0.1 # learning rate + +def env_agent_config(cfg,seed=1): + env = gym.make(cfg.env) + env = CliffWalkingWapper(env) + env.seed(seed) + state_dim = env.observation_space.n + action_dim = env.action_space.n + agent = QLearning(state_dim,action_dim,cfg) + return env,agent def train(cfg,env,agent): rewards = [] ma_rewards = [] # moving average reward - steps = [] # 记录所有episode的steps - for i_episode in range(cfg.train_eps): + for i_ep in range(cfg.train_eps): ep_reward = 0 # 记录每个episode的reward - ep_steps = 0 # 记录每个episode走了多少step state = env.reset() # 重置环境, 重新开一局(即开始新的一个episode) while True: action = agent.choose_action(state) # 根据算法选择一个动作 @@ -59,55 +59,52 @@ def train(cfg,env,agent): agent.update(state, action, reward, next_state, done) # Q-learning算法更新 state = next_state # 存储上一个观察值 ep_reward += reward - ep_steps += 1 # 计算step数 if done: break - steps.append(ep_steps) rewards.append(ep_reward) if ma_rewards: ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1) else: ma_rewards.append(ep_reward) - print("Episode:{}/{}: reward:{:.1f}".format(i_episode+1, cfg.train_eps,ep_reward)) + print("Episode:{}/{}: reward:{:.1f}".format(i_ep+1, cfg.train_eps,ep_reward)) return rewards,ma_rewards - + def eval(cfg,env,agent): # env = gym.make("FrozenLake-v0", is_slippery=False) # 0 left, 1 down, 2 right, 3 up # env = FrozenLakeWapper(env) rewards = [] # 记录所有episode的reward ma_rewards = [] # 滑动平均的reward - steps = [] # 记录所有episode的steps - for i_episode in range(cfg.train_eps): + for i_ep in range(cfg.eval_eps): ep_reward = 0 # 记录每个episode的reward - ep_steps = 0 # 记录每个episode走了多少step state = env.reset() # 重置环境, 重新开一局(即开始新的一个episode) while True: - action = agent.choose_action(state) # 根据算法选择一个动作 + action = agent.predict(state) # 根据算法选择一个动作 next_state, reward, done, _ = env.step(action) # 与环境进行一个交互 state = next_state # 存储上一个观察值 ep_reward += reward - ep_steps += 1 # 计算step数 if done: break - steps.append(ep_steps) rewards.append(ep_reward) - # 计算滑动平均的reward if ma_rewards: - ma_rewards.append(rewards[-1]*0.9+ep_reward*0.1) + ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1) else: ma_rewards.append(ep_reward) - print("Episode:{}/{}: reward:{:.1f}".format(i_episode+1, cfg.train_eps,ep_reward)) + print(f"Episode:{i_ep+1}/{cfg.eval_eps}, reward:{ep_reward:.1f}") return rewards,ma_rewards if __name__ == "__main__": cfg = QlearningConfig() - env = gym.make("CliffWalking-v0") # 0 up, 1 right, 2 down, 3 left - env = CliffWalkingWapper(env) - action_dim = env.action_space.n - agent = QLearning(action_dim,cfg) + env,agent = env_agent_config(cfg,seed=1) rewards,ma_rewards = train(cfg,env,agent) - agent.save(path=SAVED_MODEL_PATH) - save_results(rewards,ma_rewards,tag='train',path=RESULT_PATH) - plot_rewards(rewards,ma_rewards,tag="train",algo = "On-Policy First-Visit MC Control",path=RESULT_PATH) + make_dir(cfg.result_path,cfg.model_path) + agent.save(path=cfg.model_path) + save_results(rewards,ma_rewards,tag='train',path=cfg.result_path) + plot_rewards(rewards,ma_rewards,tag="train",env=cfg.env,algo = cfg.algo,path=cfg.result_path) + + env,agent = env_agent_config(cfg,seed=10) + agent.load(path=cfg.model_path) + rewards,ma_rewards = eval(cfg,env,agent) + save_results(rewards,ma_rewards,tag='eval',path=cfg.result_path) + plot_rewards(rewards,ma_rewards,tag="eval",env=cfg.env,algo = cfg.algo,path=cfg.result_path) diff --git a/codes/README.md b/codes/README.md index 2f51e2a..c392ed3 100644 --- a/codes/README.md +++ b/codes/README.md @@ -27,26 +27,25 @@ python 3.7、pytorch 1.6.0-1.7.1、gym 0.17.0-0.18.0 ## 算法进度 -| 算法名称 | 相关论文材料 | 环境 | 备注 | -| :--------------------------------------: | :----------------------------------------------------------: | ------------------------------------- | :--------------------------------: | -| [On-Policy First-Visit MC](./MonteCarlo) | | [Racetrack](./envs/racetrack_env.md) | | -| [Q-Learning](./QLearning) | | [CliffWalking-v0](./envs/gym_info.md) | | -| [Sarsa](./Sarsa) | | [Racetrack](./envs/racetrack_env.md) | | -| [DQN](./DQN) | [DQN Paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf),[Nature DQN Paper](https://www.nature.com/articles/nature14236) | [CartPole-v0](./envs/gym_info.md) | | -| [DQN-cnn](./DQN_cnn) | [DQN Paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) | [CartPole-v0](./envs/gym_info.md) | 与DQN相比使用了CNN而不是全链接网络 | -| [DoubleDQN](./DoubleDQN) | | [CartPole-v0](./envs/gym_info.md) | | -| [Hierarchical DQN](HierarchicalDQN) | [H-DQN Paper](https://arxiv.org/abs/1604.06057) | [CartPole-v0](./envs/gym_info.md) | | -| [PolicyGradient](./PolicyGradient) | | [CartPole-v0](./envs/gym_info.md) | | -| [A2C](./A2C) | [A3C Paper](https://arxiv.org/abs/1602.01783) | [CartPole-v0](./envs/gym_info.md) | | -| [SAC](./SAC) | [SAC Paper](https://arxiv.org/abs/1801.01290) | [Pendulum-v0](./envs/gym_info.md) | | -| [PPO](./PPO) | [PPO paper](https://arxiv.org/abs/1707.06347) | [CartPole-v0](./envs/gym_info.md) | | -| [DDPG](./DDPG) | [DDPG Paper](https://arxiv.org/abs/1509.02971) | [Pendulum-v0](./envs/gym_info.md) | | -| [TD3](./TD3) | [TD3 Paper](https://arxiv.org/abs/1802.09477) | HalfCheetah-v2 | | - +| 算法名称 | 相关论文材料 | 环境 | 备注 | +| :--------------------------------------: | :----------------------------------------------------------: | ----------------------------------------- | :--------------------------------: | +| [On-Policy First-Visit MC](./MonteCarlo) | [medium blog](https://medium.com/analytics-vidhya/monte-carlo-methods-in-reinforcement-learning-part-1-on-policy-methods-1f004d59686a) | [Racetrack](./envs/racetrack_env.md) | | +| [Q-Learning](./QLearning) | [towardsdatascience blog](https://towardsdatascience.com/simple-reinforcement-learning-q-learning-fcddc4b6fe56),[q learning paper](https://ieeexplore.ieee.org/document/8836506) | [CliffWalking-v0](./envs/gym_info.md) | | +| [Sarsa](./Sarsa) | [geeksforgeeks blog](https://www.geeksforgeeks.org/sarsa-reinforcement-learning/) | [Racetrack](./envs/racetrack_env.md) | | +| [DQN](./DQN) | [DQN Paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf),[Nature DQN Paper](https://www.nature.com/articles/nature14236) | [CartPole-v0](./envs/gym_info.md) | | +| [DQN-cnn](./DQN_cnn) | [DQN Paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) | [CartPole-v0](./envs/gym_info.md) | 与DQN相比使用了CNN而不是全链接网络 | +| [DoubleDQN](./DoubleDQN) | [DoubleDQN Paper](https://arxiv.org/abs/1509.06461) | [CartPole-v0](./envs/gym_info.md) | | +| [Hierarchical DQN](HierarchicalDQN) | [H-DQN Paper](https://arxiv.org/abs/1604.06057) | [CartPole-v0](./envs/gym_info.md) | | +| [PolicyGradient](./PolicyGradient) | [Lil'log](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html) | [CartPole-v0](./envs/gym_info.md) | | +| [A2C](./A2C) | [A3C Paper](https://arxiv.org/abs/1602.01783) | [CartPole-v0](./envs/gym_info.md) | | +| [SAC](./SAC) | [SAC Paper](https://arxiv.org/abs/1801.01290) | [Pendulum-v0](./envs/gym_info.md) | | +| [PPO](./PPO) | [PPO paper](https://arxiv.org/abs/1707.06347) | [CartPole-v0](./envs/gym_info.md) | | +| [DDPG](./DDPG) | [DDPG Paper](https://arxiv.org/abs/1509.02971) | [Pendulum-v0](./envs/gym_info.md) | | +| [TD3](./TD3) | [TD3 Paper](https://arxiv.org/abs/1802.09477) | [HalfCheetah-v2]((./envs/mujoco_info.md)) | | ## Refs [RL-Adventure-2](https://github.com/higgsfield/RL-Adventure-2) -[RL-Adventure](https://github.com/higgsfield/RL-Adventure) +[RL-Adventure](https://github.com/higgsfield/RL-Adventure) \ No newline at end of file diff --git a/codes/README_en.md b/codes/README_en.md index 95a6455..5e9a30c 100644 --- a/codes/README_en.md +++ b/codes/README_en.md @@ -30,25 +30,26 @@ similar to file with ```eval```, which means to evaluate the agent. ## Schedule -| Name | Related materials | Used Envs | Notes | -| :--------------------------------------: | :----------------------------------------------------------: | ------------------------------------- | :---: | -| [On-Policy First-Visit MC](./MonteCarlo) | | [Racetrack](./envs/racetrack_env.md) | | -| [Q-Learning](./QLearning) | | [CliffWalking-v0](./envs/gym_info.md) | | -| [Sarsa](./Sarsa) | | [Racetrack](./envs/racetrack_env.md) | | -| [DQN](./DQN) | [DQN-paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf),[Nature DQN Paper](https://www.nature.com/articles/nature14236) | [CartPole-v0](./envs/gym_info.md) | | -| [DQN-cnn](./DQN_cnn) | [DQN-paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) | [CartPole-v0](./envs/gym_info.md) | | -| [DoubleDQN](./DoubleDQN) | | [CartPole-v0](./envs/gym_info.md) | | -| [Hierarchical DQN](HierarchicalDQN) | [Hierarchical DQN](https://arxiv.org/abs/1604.06057) | [CartPole-v0](./envs/gym_info.md) | | -| [PolicyGradient](./PolicyGradient) | | [CartPole-v0](./envs/gym_info.md) | | -| [A2C](./A2C) | [A3C Paper](https://arxiv.org/abs/1602.01783) | [CartPole-v0](./envs/gym_info.md) | | -| [SAC](./SAC) | [SAC Paper](https://arxiv.org/abs/1801.01290) | | | -| [PPO](./PPO) | [PPO paper](https://arxiv.org/abs/1707.06347) | [CartPole-v0](./envs/gym_info.md) | | -| [DDPG](./DDPG) | [DDPG Paper](https://arxiv.org/abs/1509.02971) | [Pendulum-v0](./envs/gym_info.md) | | -| [TD3](./TD3) | [TD3 Paper](https://arxiv.org/abs/1802.09477) | HalfCheetah-v2 | | +| Name | Related materials | Used Envs | Notes | +| :--------------------------------------: | :----------------------------------------------------------: | ----------------------------------------- | :---: | +| [On-Policy First-Visit MC](./MonteCarlo) | [medium blog](https://medium.com/analytics-vidhya/monte-carlo-methods-in-reinforcement-learning-part-1-on-policy-methods-1f004d59686a) | [Racetrack](./envs/racetrack_env.md) | | +| [Q-Learning](./QLearning) | [towardsdatascience blog](https://towardsdatascience.com/simple-reinforcement-learning-q-learning-fcddc4b6fe56),[q learning paper](https://ieeexplore.ieee.org/document/8836506) | [CliffWalking-v0](./envs/gym_info.md) | | +| [Sarsa](./Sarsa) | [geeksforgeeks blog](https://www.geeksforgeeks.org/sarsa-reinforcement-learning/) | [Racetrack](./envs/racetrack_env.md) | | +| [DQN](./DQN) | [DQN Paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf),[Nature DQN Paper](https://www.nature.com/articles/nature14236) | [CartPole-v0](./envs/gym_info.md) | | +| [DQN-cnn](./DQN_cnn) | [DQN Paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) | [CartPole-v0](./envs/gym_info.md) | 与DQN相比使用了CNN而不是全链接网络 | +| [DoubleDQN](./DoubleDQN) | [DoubleDQN Paper](https://arxiv.org/abs/1509.06461) | [CartPole-v0](./envs/gym_info.md) | | +| [Hierarchical DQN](HierarchicalDQN) | [H-DQN Paper](https://arxiv.org/abs/1604.06057) | [CartPole-v0](./envs/gym_info.md) | | +| [PolicyGradient](./PolicyGradient) | [Lil'log](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html) | [CartPole-v0](./envs/gym_info.md) | | +| [A2C](./A2C) | [A3C Paper](https://arxiv.org/abs/1602.01783) | [CartPole-v0](./envs/gym_info.md) | | +| [SAC](./SAC) | [SAC Paper](https://arxiv.org/abs/1801.01290) | [Pendulum-v0](./envs/gym_info.md) | | +| [PPO](./PPO) | [PPO paper](https://arxiv.org/abs/1707.06347) | [CartPole-v0](./envs/gym_info.md) | | +| [DDPG](./DDPG) | [DDPG Paper](https://arxiv.org/abs/1509.02971) | [Pendulum-v0](./envs/gym_info.md) | | +| [TD3](./TD3) | [TD3 Paper](https://arxiv.org/abs/1802.09477) | [HalfCheetah-v2]((./envs/mujoco_info.md)) | | + ## Refs [RL-Adventure-2](https://github.com/higgsfield/RL-Adventure-2) -[RL-Adventure](https://github.com/higgsfield/RL-Adventure) +[RL-Adventure](https://github.com/higgsfield/RL-Adventure) \ No newline at end of file diff --git a/codes/common/plot.py b/codes/common/plot.py index 8bf1689..92e4d96 100644 --- a/codes/common/plot.py +++ b/codes/common/plot.py @@ -5,7 +5,7 @@ Author: John Email: johnjim0816@gmail.com Date: 2020-10-07 20:57:11 LastEditor: John -LastEditTime: 2021-04-28 10:13:21 +LastEditTime: 2021-04-29 15:41:48 Discription: Environment: ''' @@ -19,7 +19,7 @@ def plot_rewards(rewards,ma_rewards,tag="train",env='CartPole-v0',algo = "DQN",s plt.plot(ma_rewards,label='ma rewards') plt.legend() if save: - plt.savefig(path+"rewards_curve_{}".format(tag)) + plt.savefig(path+"{}_rewards_curve".format(tag)) plt.show() # def plot_rewards(dic,tag="train",env='CartPole-v0',algo = "DQN",save=True,path='./'): # sns.set() diff --git a/codes/common/utils.py b/codes/common/utils.py index d397c89..5d51eea 100644 --- a/codes/common/utils.py +++ b/codes/common/utils.py @@ -5,7 +5,7 @@ Author: John Email: johnjim0816@gmail.com Date: 2021-03-12 16:02:24 LastEditor: John -LastEditTime: 2021-04-13 18:34:20 +LastEditTime: 2021-04-29 15:32:38 Discription: Environment: ''' @@ -18,8 +18,8 @@ from pathlib import Path def save_results(rewards,ma_rewards,tag='train',path='./results'): '''保存reward等结果 ''' - np.save(path+'rewards_'+tag+'.npy', rewards) - np.save(path+'ma_rewards_'+tag+'.npy', ma_rewards) + np.save(path+'{}_rewards.npy'.format(tag), rewards) + np.save(path+'{}_ma_rewards.npy'.format(tag), ma_rewards) print('results saved!') def make_dir(*paths): diff --git a/codes/envs/assets/image-20210429150622353.png b/codes/envs/assets/image-20210429150622353.png new file mode 100644 index 0000000..1216b4c Binary files /dev/null and b/codes/envs/assets/image-20210429150622353.png differ diff --git a/codes/envs/assets/image-20210429150630806.png b/codes/envs/assets/image-20210429150630806.png new file mode 100644 index 0000000..45107d5 Binary files /dev/null and b/codes/envs/assets/image-20210429150630806.png differ diff --git a/codes/envs/mujoco_info.md b/codes/envs/mujoco_info.md new file mode 100644 index 0000000..aaa8cbb --- /dev/null +++ b/codes/envs/mujoco_info.md @@ -0,0 +1,42 @@ +# MuJoCo + +MuJoCo(Multi-Joint dynamics with Contact)是一个物理模拟器,可以用于机器人控制优化等研究。安装见[Mac安装MuJoCo以及mujoco_py](https://blog.csdn.net/JohnJim0/article/details/115656392?spm=1001.2014.3001.5501) + + + +## HalfCheetah-v2 + + + +该环境基于mujoco仿真引擎,该环境的目的是使一只两只脚的“猎豹”跑得越快越好(下面图谷歌HalfCheetah-v2的,https://gym.openai.com/envs/HalfCheetah-v2/)。 + +image-20210429150630806 + +动作空间:Box(6,),一只脚需要控制三个关节一共6个关节,每个关节的运动范围为[-1, 1]。 + +状态空间:Box(17, ),包含各种状态,每个值的范围为![img](assets/9cd6ae68c9aad008ede4139da358ec26.svg),主要描述“猎豹”本身的姿态等信息。 + +回报定义:每一步的回报与这一步的中猎豹的速度和猎豹行动的消耗有关,定义回报的代码如下。 + +```python +def step(self, action): + xposbefore = self.sim.data.qpos[0] + self.do_simulation(action, self.frame_skip) + xposafter = self.sim.data.qpos[0] + ob = self._get_obs() + reward_ctrl = - 0.1 * np.square(action).sum() + reward_run = (xposafter - xposbefore)/self.dt + # =========== reward =========== + reward = reward_ctrl + reward_run + # =========== reward =========== + done = False + return ob, reward, done, dict(reward_run=reward_run, reward_ctrl=reward_ctrl) +``` + +当猎豹无法控制平衡而倒下时,一个回合(episode)结束。 + +但是这个环境有一些问题,目前经过搜索并不知道一个回合的reward上限,实验中训练好的episode能跑出平台之外: + +image-20210429150622353 + +加上时间有限,所以训练中reward一直处于一个平缓上升的状态,本人猜测这可能是gym的一个bug。 \ No newline at end of file diff --git a/codes/envs/snake/agent.py b/codes/envs/snake/agent.py index e514dc3..b32de9d 100644 --- a/codes/envs/snake/agent.py +++ b/codes/envs/snake/agent.py @@ -78,7 +78,6 @@ class Agent: :param points: float, the current points from environment :param dead: boolean, if the snake is dead :return: the index of action. 0,1,2,3 indicates up,down,left,right separately - TODO: write your function here. Return the index of action the snake needs to take, according to the state and points known from environment. Tips: you need to discretize the state to the state space defined on the webpage first. (Note that [adjoining_wall_x=0, adjoining_wall_y=0] is also the case when snake runs out of the 480x480 board)