更新算法模版

2022-11-06 12:15:36 +08:00
parent 466a17707f
commit dc78698262
256 changed files with 17282 additions and 10229 deletions
@@ -1,4 +1,4 @@
-## 0、写在前面
+## 0. 写在前面
 本项目用于学习RL基础算法，主要面向对象为RL初学者、需要结合RL的非专业学习者，尽量做到: **注释详细**，**结构清晰**。
@@ -6,7 +6,7 @@
 未来开发计划包括但不限于：多智能体算法、强化学习Python包以及强化学习图形化编程平台等等。
-## 1、项目说明
+## 1. 项目说明
 项目内容主要包含以下几个部分：
 * [Jupyter Notebook](./notebooks/)：使用Notebook写的算法，有比较详细的实战引导，推荐新手食用
@@ -18,7 +18,7 @@
 * ```[algorithm_name].py```：即保存算法的脚本，例如```dqn.py```，每种算法都会有一定的基础模块，例如```Replay Buffer```、```MLP```(多层感知机)等等；
 * ```task.py```: 即保存任务的脚本，基本包括基于```argparse```模块的参数，训练以及测试函数等等，其中训练函数即```train```遵循伪代码而设计，想读懂代码可从该函数入手；
 * ```utils.py```：该脚本用于保存诸如存储结果以及画图的软件，在实际项目或研究中，推荐大家使用```Tensorboard```来保存结果，然后使用诸如```matplotlib```以及```seabron```来进一步画图。
-## 2、算法列表
+## 2. 算法列表
 注：点击对应的名称会跳到[codes](./codes/)下对应的算法中，其他版本还请读者自行翻阅
@@ -26,26 +26,27 @@
 | :-------------------------------------: | :----------------------------------------------------------: | :--: |
 | [Policy Gradient](codes/PolicyGradient) | [Policy Gradient paper](https://proceedings.neurips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf) |      |
 |                 DQN-CNN                 |                                                              | 待更 |
 |      [DoubleDQN](codes/DoubleDQN)       |     [Double DQN Paper](https://arxiv.org/abs/1509.06461)     |      |
 |          [SoftQ](codes/SoftQ)           |  [Soft Q-learning paper](https://arxiv.org/abs/1702.08165)   |      |
 |            [SAC](codes/SAC)             |      [SAC paper](https://arxiv.org/pdf/1812.05905.pdf)       |      |
 |        [SAC-Discrete](codes/SAC)        |  [SAC-Discrete paper](https://arxiv.org/pdf/1910.07207.pdf)  |      |
 |                  SAC-S                  |       [SAC-S paper](https://arxiv.org/abs/1801.01290)        |      |
 |                  DSAC                   | [DSAC paper](https://paperswithcode.com/paper/addressing-value-estimation-errors-in) | 待更 |
-## 3、算法环境
+## 3. 算法环境
 算法环境说明请跳转[env](./codes/envs/README.md)
-## 4、运行环境
+## 4. 运行环境
-主要依赖：Python 3.7、PyTorch 1.10.0、Gym 0.21.0。
+主要依赖：Python 3.7、PyTorch 1.10.0、Gym 0.25.2。
-### 4.1、创建Conda环境
+### 4.1. 创建Conda环境
 ```bash
 conda create -n easyrl python=3.7
 conda activate easyrl # 激活环境
 ```
-### 4.2、安装Torch
+### 4.2. 安装Torch
 安装CPU版本：
 ```bash
@@ -63,30 +64,49 @@ conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit
 ```bash
 pip install torch==1.10.0+cu113 torchvision==0.11.0+cu113 torchaudio==0.10.0 --extra-index-url https://download.pytorch.org/whl/cu113
 ```
-### 4.3、安装其他依赖
+### 4.3. 检验CUDA版本Torch安装
 项目根目录下执行：
 ```bash
 pip install -r requirements.txt
 ```
 ### 4.4、检验CUDA版本Torch安装
 CPU版本Torch请忽略此步，执行如下Python脚本，如果返回True说明CUDA版本安装成功:
 ```python
 import torch
 print(torch.cuda.is_available())
 ```
 ### 4.4. 安装Gym
-## 5、使用说明
+```bash
 pip install gym==0.25.2
 ```
 如需安装Atari环境，则需另外安装
-对于[codes](./codes/)：
+```bash
-* 运行带有```main.py```脚本
+pip install gym[atari,accept-rom-license]==0.25.2
-* 执行[scripts](codes\scripts)下对应的Bash脚本，例如```sh codes/scripts/DQN_task0.sh```，推荐创建名为"easyrl"的conda环境，否则需要更改sh脚本相关信息。对于Windows系统，建议安装Git(不要更改默认安装路径，否则VS Code可能不会显示Git Bash)然后使用git bash终端，而非PowerShell或者cmd终端！
+```
 ### 4.5. 安装其他依赖
 项目根目录下执行：
 ```bash
 pip install -r requirements.txt
 ```
 ## 6.使用说明
 对于[codes](./codes/)，`cd`到对应的算法目录下，例如`DQN`：
 ```bash
 python task_0.py
 ```
 或者加载配置文件：
 ```bash
 python task0.py --yaml configs/CartPole-v1_DQN_Train.yaml
 ```
 对于[Jupyter Notebook](./notebooks/)：
 * 直接运行对应的ipynb文件就行
-## 6、友情说明
+## 6. 友情说明
 推荐使用VS Code做项目，入门可参考[VSCode上手指南](https://blog.csdn.net/JohnJim0/article/details/126366454)
@@ -38,13 +38,14 @@
 \clearpage
 \section{模版备用}
 \begin{algorithm}[H] % [H]固定位置
-    \floatname{algorithm}{{算法}} 
+    \floatname{algorithm}{{算法}\footnotemark[1]} 
 	\renewcommand{\thealgorithm}{} % 去掉算法标号
 	\caption{} 
 	\begin{algorithmic}[1] % [1]显示步数
 		\STATE 测试
 	\end{algorithmic}
 \end{algorithm}
 \footnotetext[1]{脚注}
 \clearpage
 \section{Q learning算法}
 \begin{algorithm}[H] % [H]固定位置
@@ -55,7 +56,7 @@
 		\STATE 初始化Q表$Q(s,a)$为任意值，但其中$Q(s_{terminal},)=0$，即终止状态对应的Q值为0
 		\FOR {回合数 = $1,M$}
 			\STATE 重置环境，获得初始状态$s_1$
-			\FOR {时步 = $1,t$}
+			\FOR {时步 = $1,T$}
 				\STATE 根据$\varepsilon-greedy$策略采样动作$a_t$
 				\STATE 环境根据$a_t$反馈奖励$r_t$和下一个状态$s_{t+1}$
 				\STATE {\bfseries 更新策略：}
@@ -134,7 +135,7 @@
 		\STATE 初始化策略参数$\boldsymbol{\theta} \in \mathbb{R}^{d^{\prime}}($ e.g., to $\mathbf{0})$
 		\FOR {回合数 = $1,M$}
 			\STATE 根据策略$\pi(\cdot \mid \cdot, \boldsymbol{\theta})$采样一个(或几个)回合的transition
-			\FOR {时步 = $1,t$}
+			\FOR {时步 = $0,1,2,...,T-1$}
 				\STATE 计算回报$G \leftarrow \sum_{k=t+1}^{T} \gamma^{k-t-1} R_{k}$
 				\STATE 更新策略$\boldsymbol{\theta} \leftarrow {\boldsymbol{\theta}+\alpha \gamma^{t}} G \nabla \ln \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}\right)$
 			\ENDFOR
@@ -164,6 +165,65 @@
 \end{algorithm}
 \footnotetext[1]{这里结合TD error的特性按照从$t+1$到$1$计算法Advantage更方便}
 \clearpage
 \section{PPO-Clip算法}
 \begin{algorithm}[H] % [H]固定位置
    \floatname{algorithm}{{PPO-Clip算法}\footnotemark[1]\footnotemark[2]} 
 	\renewcommand{\thealgorithm}{} % 去掉算法标号
 	\caption{} 
 	\begin{algorithmic}[1] % [1]显示步数
 		\STATE 初始化策略网络(Actor)参数$\theta$和价值网络(Critic)参数$\phi$
 		\STATE 初始化Clip参数$\epsilon$
 		\STATE 初始化epoch数量$K$
 		\STATE 初始化经验回放$D$
 		\STATE 初始化总时步数$c=0$
 		\FOR {回合数 = $1,2,\cdots,M$}
 			\STATE 重置环境，获得初始状态$s_0$
 			\FOR {时步 $t = 1,2,\cdots,T$}
 				\STATE 计数总时步$c \leftarrow c+1$
 				\STATE 根据策略$\pi_{\theta}$选择$a_t$
 				\STATE 环境根据$a_t$反馈奖励$r_t$和下一个状态$s_{t+1}$
 				\STATE 存储$(s_t,a_t,r_t,s_{t+1})$到经验回放$D$中
 				\IF{$c$被$C$整除\footnotemark[3]}
 					\FOR {$k= 1,2,\cdots,K$}
 						\STATE 测试
 					\ENDFOR
 					\STATE 清空经验回放$D$
 				\ENDIF
 			\ENDFOR
 		\ENDFOR
 	\end{algorithmic}
 \end{algorithm}
 \footnotetext[1]{Proximal Policy Optimization Algorithms}
 \footnotetext[2]{https://spinningup.openai.com/en/latest/algorithms/ppo.html}
 \footnotetext[3]{\bfseries 即每$C$个时步更新策略}
 \clearpage
 \section{DDPG算法}
 \begin{algorithm}[H] % [H]固定位置
    \floatname{algorithm}{{DDPG算法}\footnotemark[1]} 
 	\renewcommand{\thealgorithm}{} % 去掉算法标号
 	\caption{} 
 	\begin{algorithmic}[1] % [1]显示步数
 		\STATE 初始化critic网络$Q\left(s, a \mid \theta^Q\right)$和actor网络$\mu(s|\theta^{\mu})$的参数$\theta^Q$和$\theta^{\mu}$
 		\STATE 初始化对应的目标网络参数，即$\theta^{Q^{\prime}} \leftarrow \theta^Q, \theta^{\mu^{\prime}} \leftarrow \theta^\mu$
 		\STATE 初始化经验回放$R$
 		\FOR {回合数 = $1,M$}
 			\STATE 选择动作$a_t=\mu\left(s_t \mid \theta^\mu\right)+\mathcal{N}_t$，$\mathcal{N}_t$为探索噪声
 			\STATE 环境根据$a_t$反馈奖励$s_t$和下一个状态$s_{t+1}$
 			\STATE 存储transition$(s_t,a_t,r_t,s_{t+1})$到经验回放$R$中
 			\STATE 更新环境状态$s_{t+1} \leftarrow s_t$
 			\STATE {\bfseries 更新策略：}
 			\STATE 从$R$中取出一个随机批量的$(s_i,a_i,r_i,s_{i+1})$
 			\STATE 求得$y_i=r_i+\gamma Q^{\prime}\left(s_{i+1}, \mu^{\prime}\left(s_{i+1} \mid \theta^{\mu^{\prime}}\right) \mid \theta^{Q^{\prime}}\right)$
 			\STATE 更新critic参数，其损失为：$L=\frac{1}{N} \sum_i\left(y_i-Q\left(s_i, a_i \mid \theta^Q\right)\right)^2$
 			\STATE 更新actor参数：$\left.\left.\nabla_{\theta^\mu} J \approx \frac{1}{N} \sum_i \nabla_a Q\left(s, a \mid \theta^Q\right)\right|_{s=s_i, a=\mu\left(s_i\right)} \nabla_{\theta^\mu} \mu\left(s \mid \theta^\mu\right)\right|_{s_i}$
 			\STATE 软更新目标网络：$\theta^{Q^{\prime}} \leftarrow \tau \theta^Q+(1-\tau) \theta^{Q^{\prime}}$，
 			$\theta^{\mu^{\prime}} \leftarrow \tau \theta^\mu+(1-\tau) \theta^{\mu^{\prime}}$
 		\ENDFOR
 	\end{algorithmic}
 \end{algorithm}
 \footnotetext[1]{Continuous control with deep reinforcement learning}
 \clearpage
 \section{SoftQ算法}
 \begin{algorithm}[H]
@@ -0,0 +1,7 @@
 ## 脚本描述
 * `task0.py`：离散动作任务
 * `task1.py`：离散动作任务，与`task0.py`唯一的区别就是Actor的激活函数是tanh而不是relu，在`CartPole-v1`上效果更好
 * `task2.py`：连续动作任务，#TODO待调试
@@ -0,0 +1,24 @@
 general_cfg:
  algo_name: A2C
  device: cuda
  env_name: CartPole-v1
  eval_eps: 10
  load_checkpoint: true
  load_path: Train_CartPole-v1_A2C_20221030-211435
  max_steps: 200
  mode: test
  save_fig: true
  seed: 1
  show_fig: false
  test_eps: 20
  train_eps: 1000
 algo_cfg:
  actor_hidden_dim: 256
  actor_lr: 0.0003
  batch_size: 64
  buffer_size: 100000
  critic_hidden_dim: 256
  critic_lr: 0.001
  gamma: 0.99
  hidden_dim: 256
  target_update: 4
@@ -0,0 +1,23 @@
 2022-10-30 21:25:53 - r - INFO: - n_states: 4, n_actions: 2
 2022-10-30 21:25:55 - r - INFO: - Start testing!
 2022-10-30 21:25:55 - r - INFO: - Env: CartPole-v1, Algorithm: A2C, Device: cuda
 2022-10-30 21:25:56 - r - INFO: - Episode: 1/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:56 - r - INFO: - Episode: 2/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:56 - r - INFO: - Episode: 3/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:56 - r - INFO: - Episode: 4/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:56 - r - INFO: - Episode: 5/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:56 - r - INFO: - Episode: 6/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:56 - r - INFO: - Episode: 7/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:56 - r - INFO: - Episode: 8/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:56 - r - INFO: - Episode: 9/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:56 - r - INFO: - Episode: 10/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:57 - r - INFO: - Episode: 11/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:57 - r - INFO: - Episode: 12/20, Reward: 190.0, Step: 190
 2022-10-30 21:25:57 - r - INFO: - Episode: 13/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:57 - r - INFO: - Episode: 14/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:57 - r - INFO: - Episode: 15/20, Reward: 96.0, Step: 96
 2022-10-30 21:25:57 - r - INFO: - Episode: 16/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:57 - r - INFO: - Episode: 17/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:57 - r - INFO: - Episode: 18/20, Reward: 200.0, Step: 200
 2022-10-30 21:25:57 - r - INFO: - Episode: 19/20, Reward: 112.0, Step: 112
 2022-10-30 21:25:57 - r - INFO: - Episode: 20/20, Reward: 200.0, Step: 200
@@ -0,0 +1,21 @@
 episodes,rewards,steps
 0,200.0,200
 1,200.0,200
 2,200.0,200
 3,200.0,200
 4,200.0,200
 5,200.0,200
 6,200.0,200
 7,200.0,200
 8,200.0,200
 9,200.0,200
 10,200.0,200
 11,190.0,190
 12,200.0,200
 13,200.0,200
 14,96.0,96
 15,200.0,200
 16,200.0,200
 17,200.0,200
 18,112.0,112
 19,200.0,200
@@ -0,0 +1,25 @@
 general_cfg:
  algo_name: A2C
  device: cuda
  env_name: CartPole-v1
  eval_eps: 10
  eval_per_episode: 5
  load_checkpoint: true
  load_path: Train_CartPole-v1_A2C_20221031-232138
  max_steps: 200
  mode: test
  save_fig: true
  seed: 1
  show_fig: false
  test_eps: 20
  train_eps: 1000
 algo_cfg:
  actor_hidden_dim: 256
  actor_lr: 0.0003
  batch_size: 64
  buffer_size: 100000
  critic_hidden_dim: 256
  critic_lr: 0.001
  gamma: 0.99
  hidden_dim: 256
  target_update: 4
@@ -0,0 +1,28 @@
 2022-10-31 23:33:16 - r - INFO: - n_states: 4, n_actions: 2
 2022-10-31 23:33:16 - r - INFO: - Actor model name: ActorSoftmaxTanh
 2022-10-31 23:33:16 - r - INFO: - Critic model name: Critic
 2022-10-31 23:33:16 - r - INFO: - ACMemory memory name: PGReplay
 2022-10-31 23:33:16 - r - INFO: - agent name: A2C
 2022-10-31 23:33:17 - r - INFO: - Start testing!
 2022-10-31 23:33:17 - r - INFO: - Env: CartPole-v1, Algorithm: A2C, Device: cuda
 2022-10-31 23:33:18 - r - INFO: - Episode: 1/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:18 - r - INFO: - Episode: 2/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:18 - r - INFO: - Episode: 3/20, Reward: 186.0, Step: 186
 2022-10-31 23:33:18 - r - INFO: - Episode: 4/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:18 - r - INFO: - Episode: 5/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 6/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 7/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 8/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 9/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 10/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 11/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 12/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 13/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 14/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 15/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 16/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 17/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 18/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:19 - r - INFO: - Episode: 19/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:20 - r - INFO: - Episode: 20/20, Reward: 200.0, Step: 200
 2022-10-31 23:33:20 - r - INFO: - Finish testing!
@@ -1,21 +1,21 @@
 episodes,rewards,steps
 0,200.0,200
 1,200.0,200
-2,93.0,93
+2,186.0,186
-3,155.0,155
+3,200.0,200
-4,116.0,116
+4,200.0,200
 5,200.0,200
-6,190.0,190
+6,200.0,200
-7,176.0,176
+7,200.0,200
 8,200.0,200
 9,200.0,200
 10,200.0,200
-11,179.0,179
+11,200.0,200
 12,200.0,200
-13,185.0,185
+13,200.0,200
-14,191.0,191
+14,200.0,200
 15,200.0,200
 16,200.0,200
-17,124.0,124
+17,200.0,200
 18,200.0,200
-19,172.0,172
+19,200.0,200
@@ -0,0 +1,23 @@
 general_cfg:
  algo_name: A2C
  device: cuda
  env_name: CartPole-v1
  eval_eps: 10
  load_checkpoint: false
  load_path: tasks
  max_steps: 200
  mode: train
  save_fig: true
  seed: 1
  show_fig: false
  test_eps: 20
  train_eps: 1000
 algo_cfg:
  actor_hidden_dim: 256
  actor_lr: 0.0003
  batch_size: 64
  buffer_size: 100000
  critic_hidden_dim: 256
  critic_lr: 0.001
  gamma: 0.99
  hidden_dim: 256
@@ -0,0 +1,24 @@
 general_cfg:
  algo_name: A2C
  device: cuda
  env_name: CartPole-v1
  eval_eps: 10
  eval_per_episode: 5
  load_checkpoint: false
  load_path: tasks
  max_steps: 200
  mode: train
  save_fig: true
  seed: 1
  show_fig: false
  test_eps: 20
  train_eps: 1000
 algo_cfg:
  actor_hidden_dim: 256
  actor_lr: 0.0003
  batch_size: 64
  buffer_size: 100000
  critic_hidden_dim: 256
  critic_lr: 0.001
  gamma: 0.99
  hidden_dim: 256
@@ -1,34 +1,79 @@
 #!/usr/bin/env python
 # coding=utf-8
 '''
 Author: JiangJi
 Email: johnjim0816@gmail.com
 Date: 2022-08-16 23:05:25
 LastEditor: JiangJi
 LastEditTime: 2022-11-01 00:33:49
 Discription: 
 '''
 import torch
 import numpy as np
-
+from torch.distributions import Categorical,Normal
 class A2C:
    def __init__(self,models,memories,cfg):
-        self.n_actions = cfg['n_actions']
+        self.n_actions = cfg.n_actions
-        self.gamma = cfg['gamma']
+        self.gamma = cfg.gamma
-        self.device = torch.device(cfg['device']) 
+        self.device = torch.device(cfg.device) 
        self.continuous = cfg.continuous
        if hasattr(cfg,'action_bound'):
            self.action_bound = cfg.action_bound
        self.memory = memories['ACMemory']
        self.actor = models['Actor'].to(self.device)
        self.critic = models['Critic'].to(self.device)
-        self.actor_optim = torch.optim.Adam(self.actor.parameters(), lr=cfg['actor_lr'])
+        self.actor_optim = torch.optim.Adam(self.actor.parameters(), lr=cfg.actor_lr)
-        self.critic_optim = torch.optim.Adam(self.critic.parameters(), lr=cfg['critic_lr'])
+        self.critic_optim = torch.optim.Adam(self.critic.parameters(), lr=cfg.critic_lr)
    def sample_action(self,state):
-        state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
+        # state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
-        dist = self.actor(state)
+        # dist = self.actor(state)
-        value = self.critic(state) # note that 'dist' need require_grad=True
+        # self.entropy = - np.sum(np.mean(dist.detach().cpu().numpy()) * np.log(dist.detach().cpu().numpy()))
-        value = value.detach().numpy().squeeze(0)[0]
+        # value = self.critic(state) # note that 'dist' need require_grad=True
-        action = np.random.choice(self.n_actions, p=dist.detach().numpy().squeeze(0)) # shape(p=(n_actions,1)
+        # self.value = value.detach().cpu().numpy().squeeze(0)[0]
-        return action,value,dist 
+        # action = np.random.choice(self.n_actions, p=dist.detach().cpu().numpy().squeeze(0)) # shape(p=(n_actions,1)
        # self.log_prob = torch.log(dist.squeeze(0)[action])
        if self.continuous:
            state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
            mu, sigma = self.actor(state)
            dist = Normal(self.action_bound * mu.view(1,), sigma.view(1,))
            action = dist.sample()
            value = self.critic(state)
            # self.entropy = - np.sum(np.mean(dist.detach().cpu().numpy()) * np.log(dist.detach().cpu().numpy()))
            self.value = value.detach().cpu().numpy().squeeze(0)[0] # detach() to avoid gradient
            self.log_prob = dist.log_prob(action).squeeze(dim=0) # Tensor([0.])
            self.entropy = dist.entropy().cpu().detach().numpy().squeeze(0) # detach() to avoid gradient
            return action.cpu().detach().numpy()
        else:
            state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
            probs = self.actor(state)
            dist = Categorical(probs)
            action = dist.sample() # Tensor([0])
            value = self.critic(state)
            self.value = value.detach().cpu().numpy().squeeze(0)[0] # detach() to avoid gradient
            self.log_prob = dist.log_prob(action).squeeze(dim=0) # Tensor([0.])
            self.entropy = dist.entropy().cpu().detach().numpy().squeeze(0) # detach() to avoid gradient
            return action.cpu().numpy().item()
    @torch.no_grad()
    def predict_action(self,state):
-        state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
+        if self.continuous:
-        dist = self.actor(state)
+            state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
-        value = self.critic(state) # note that 'dist' need require_grad=True
+            mu, sigma = self.actor(state)
-        value = value.detach().numpy().squeeze(0)[0]
+            dist = Normal(self.action_bound * mu.view(1,), sigma.view(1,))
-        action = np.random.choice(self.n_actions, p=dist.detach().numpy().squeeze(0)) # shape(p=(n_actions,1)
+            action = dist.sample()
-        return action,value,dist 
+            return action.cpu().detach().numpy()
        else:
            state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
            dist = self.actor(state)
            # value = self.critic(state) # note that 'dist' need require_grad=True
            # value = value.detach().cpu().numpy().squeeze(0)[0]
            action = np.random.choice(self.n_actions, p=dist.detach().cpu().numpy().squeeze(0)) # shape(p=(n_actions,1)
            return action
    def update(self,next_state,entropy):
        value_pool,log_prob_pool,reward_pool = self.memory.sample()
        value_pool = torch.tensor(value_pool, device=self.device)
        log_prob_pool = torch.stack(log_prob_pool)
        next_state = torch.tensor(next_state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
        next_value = self.critic(next_state)
        returns = np.zeros_like(reward_pool)
@@ -36,9 +81,7 @@ class A2C:
            next_value = reward_pool[t] + self.gamma * next_value # G(s_{t},a{t}) = r_{t+1} + gamma * V(s_{t+1})
            returns[t] = next_value
        returns = torch.tensor(returns, device=self.device)
        value_pool = torch.tensor(value_pool, device=self.device)
        advantages = returns - value_pool
        log_prob_pool = torch.stack(log_prob_pool)
        actor_loss = (-log_prob_pool * advantages).mean()
        critic_loss = 0.5 * advantages.pow(2).mean()
        tot_loss = actor_loss + critic_loss + 0.001 * entropy
@@ -1,14 +1,24 @@
 #!/usr/bin/env python
 # coding=utf-8
 '''
 Author: JiangJi
 Email: johnjim0816@gmail.com
 Date: 2022-09-19 14:48:16
 LastEditor: JiangJi
 LastEditTime: 2022-10-30 01:21:50
 Discription: #TODO，待更新模版
 '''
 import torch
 import numpy as np
 class A2C_2:
    def __init__(self,models,memories,cfg):
-        self.n_actions = cfg['n_actions']
+        self.n_actions = cfg.n_actions
-        self.gamma = cfg['gamma']
+        self.gamma = cfg.gamma
-        self.device = torch.device(cfg['device']) 
+        self.device = torch.device(cfg.device) 
        self.memory = memories['ACMemory']
        self.ac_net = models['ActorCritic'].to(self.device)
-        self.ac_optimizer = torch.optim.Adam(self.ac_net.parameters(), lr=cfg['lr'])
+        self.ac_optimizer = torch.optim.Adam(self.ac_net.parameters(), lr = cfg.lr)
    def sample_action(self,state):
        state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
        value, dist = self.ac_net(state) # note that 'dist' need require_grad=True
@@ -0,0 +1,21 @@
 general_cfg:
  algo_name: A2C
  device: cuda
  env_name: CartPole-v1
  mode: test
  load_checkpoint: true
  load_path: Train_CartPole-v1_A2C_20221031-232138
  max_steps: 200
  save_fig: true
  seed: 1
  show_fig: false
  test_eps: 20
  train_eps: 1000
 algo_cfg:
  continuous: false
  batch_size: 64
  buffer_size: 100000
  gamma: 0.99
  actor_lr: 0.0003
  critic_lr: 0.001
  target_update: 4
@@ -0,0 +1,19 @@
 general_cfg:
  algo_name: A2C
  device: cuda
  env_name: CartPole-v1
  mode: train
  load_checkpoint: false
  load_path: Train_CartPole-v1_DQN_20221026-054757
  max_steps: 200
  save_fig: true
  seed: 1
  show_fig: false
  test_eps: 20
  train_eps: 600
 algo_cfg:
  continuous: false
  batch_size: 64
  buffer_size: 100000
  gamma: 0.0003
  lr: 0.001
@@ -0,0 +1,21 @@
 general_cfg:
  algo_name: A2C
  device: cuda
  env_name: Pendulum-v1
  mode: train
  eval_per_episode: 200
  load_checkpoint: false
  load_path: Train_CartPole-v1_DQN_20221026-054757
  max_steps: 200
  save_fig: true
  seed: 1
  show_fig: false
  test_eps: 20
  train_eps: 1000
 algo_cfg:
  continuous: true
  batch_size: 64
  buffer_size: 100000
  gamma: 0.0003
  actor_lr: 0.0003
  critic_lr: 0.001
@@ -0,0 +1,38 @@
 #!/usr/bin/env python
 # coding=utf-8
 '''
 Author: JiangJi
 Email: johnjim0816@gmail.com
 Date: 2022-10-30 00:53:03
 LastEditor: JiangJi
 LastEditTime: 2022-11-01 00:17:55
 Discription: default parameters of A2C
 '''
 from common.config import GeneralConfig,AlgoConfig
 class GeneralConfigA2C(GeneralConfig):
    def __init__(self) -> None:
        self.env_name = "CartPole-v1" # name of environment
        self.algo_name = "A2C" # name of algorithm
        self.mode = "train" # train or test
        self.seed = 1 # random seed
        self.device = "cuda" # device to use
        self.train_eps = 1000 # number of episodes for training
        self.test_eps = 20 # number of episodes for testing
        self.max_steps = 200 # max steps for each episode
        self.load_checkpoint = False
        self.load_path = "tasks" # path to load model
        self.show_fig = False # show figure or not
        self.save_fig = True # save figure or not
 class AlgoConfigA2C(AlgoConfig):
    def __init__(self) -> None:
        self.continuous = False # continuous or discrete action space
        self.hidden_dim = 256 # hidden_dim for MLP
        self.gamma = 0.99 # discount factor
        self.actor_lr = 3e-4 # learning rate of actor
        self.critic_lr = 1e-3 # learning rate of critic
        self.actor_hidden_dim = 256 # hidden_dim for actor MLP
        self.critic_hidden_dim = 256 # hidden_dim for critic MLP
        self.buffer_size = 100000 # size of replay buffer
        self.batch_size = 64 # batch size
@@ -1,121 +0,0 @@
 import sys,os
 os.environ["KMP_DUPLICATE_LIB_OK"]  =  "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized."
 curr_path = os.path.dirname(os.path.abspath(__file__))  # current path
 parent_path = os.path.dirname(curr_path)  # parent path 
 sys.path.append(parent_path)  # add path to system path
 import datetime
 import argparse
 import gym
 import torch
 import numpy as np
 from common.utils import all_seed
 from common.launcher import Launcher
 from common.memories import PGReplay
 from common.models import ActorSoftmax,Critic
 from envs.register import register_env
 from a2c import A2C
 class Main(Launcher):
    def get_args(self):
        curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")   # obtain current time
        parser = argparse.ArgumentParser(description="hyperparameters")      
        parser.add_argument('--algo_name',default='A2C',type=str,help="name of algorithm")
        parser.add_argument('--env_name',default='CartPole-v0',type=str,help="name of environment")
        parser.add_argument('--train_eps',default=1600,type=int,help="episodes of training") 
        parser.add_argument('--test_eps',default=20,type=int,help="episodes of testing") 
        parser.add_argument('--ep_max_steps',default = 100000,type=int,help="steps per episode, much larger value can simulate infinite steps")
        parser.add_argument('--gamma',default=0.99,type=float,help="discounted factor") 
        parser.add_argument('--actor_lr',default=3e-4,type=float,help="learning rate of actor")
        parser.add_argument('--critic_lr',default=1e-3,type=float,help="learning rate of critic")
        parser.add_argument('--actor_hidden_dim',default=256,type=int,help="hidden of actor net")
        parser.add_argument('--critic_hidden_dim',default=256,type=int,help="hidden of critic net")
        parser.add_argument('--device',default='cpu',type=str,help="cpu or cuda") 
        parser.add_argument('--seed',default=10,type=int,help="seed") 
        parser.add_argument('--show_fig',default=False,type=bool,help="if show figure or not")  
        parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not")   
        args = parser.parse_args()   
        default_args = {'result_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/results/",
                        'model_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/models/",
        }
        args = {**vars(args),**default_args}  # type(dict)                         
        return args
    def env_agent_config(self,cfg):
        ''' create env and agent
        '''  
        register_env(cfg['env_name'])
        env = gym.make(cfg['env_name']) 
        if cfg['seed'] !=0: # set random seed
            all_seed(env,seed=cfg["seed"]) 
        try: # state dimension
            n_states = env.observation_space.n # print(hasattr(env.observation_space, 'n'))
        except AttributeError:
            n_states = env.observation_space.shape[0] # print(hasattr(env.observation_space, 'shape'))
        n_actions = env.action_space.n  # action dimension
        print(f"n_states: {n_states}, n_actions: {n_actions}")
        cfg.update({"n_states":n_states,"n_actions":n_actions}) # update to cfg paramters
        models = {'Actor':ActorSoftmax(cfg['n_states'],cfg['n_actions'], hidden_dim = cfg['actor_hidden_dim']),'Critic':Critic(cfg['n_states'],1,hidden_dim=cfg['critic_hidden_dim'])}
        memories = {'ACMemory':PGReplay()}
        agent = A2C(models,memories,cfg)
        return env,agent
    def train(self,cfg,env,agent):
        print("Start training!")
        print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}")
        rewards = []  # record rewards for all episodes
        steps = [] # record steps for all episodes
        for i_ep in range(cfg['train_eps']):
            ep_reward = 0  # reward per episode
            ep_step = 0 # step per episode
            ep_entropy = 0
            state = env.reset()  # reset and obtain initial state
            for _ in range(cfg['ep_max_steps']):
                action, value, dist = agent.sample_action(state)  # sample action
                next_state, reward, done, _ = env.step(action)  # update env and return transitions
                log_prob = torch.log(dist.squeeze(0)[action])
                entropy = -np.sum(np.mean(dist.detach().numpy()) * np.log(dist.detach().numpy()))
                agent.memory.push((value,log_prob,reward))  # save transitions
                state = next_state  # update state
                ep_reward += reward
                ep_entropy += entropy
                ep_step += 1
                if done:
                    break
            agent.update(next_state,ep_entropy)  # update agent
            rewards.append(ep_reward)
            steps.append(ep_step)
            if (i_ep+1)%10==0:
                print(f'Episode: {i_ep+1}/{cfg["train_eps"]}, Reward: {ep_reward:.2f}, Steps:{ep_step}')
        print("Finish training!")
        return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
    def test(self,cfg,env,agent):
        print("Start testing!")
        print(f"Env: {cfg['env_name']}, Algorithm: {cfg['algo_name']}, Device: {cfg['device']}")
        rewards = []  # record rewards for all episodes
        steps = [] # record steps for all episodes
        for i_ep in range(cfg['test_eps']):
            ep_reward = 0  # reward per episode
            ep_step = 0
            state = env.reset()  # reset and obtain initial state
            for _ in range(cfg['ep_max_steps']):
                action,_,_ = agent.predict_action(state)  # predict action
                next_state, reward, done, _ = env.step(action)  
                state = next_state 
                ep_reward += reward
                ep_step += 1
                if done:
                    break
            rewards.append(ep_reward)
            steps.append(ep_step)
            print(f"Episode: {i_ep+1}/{cfg['test_eps']}, Steps:{ep_step}, Reward: {ep_reward:.2f}")
        print("Finish testing!")
        return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
 if __name__ == "__main__":
    main = Main()
    main.run()
@@ -1,3 +1,13 @@
 #!/usr/bin/env python
 # coding=utf-8
 '''
 Author: JiangJi
 Email: johnjim0816@gmail.com
 Date: 2022-09-19 14:48:16
 LastEditor: JiangJi
 LastEditTime: 2022-10-30 01:21:15
 Discription: #TODO，待更新模版
 '''
 import sys,os
 os.environ["KMP_DUPLICATE_LIB_OK"]  =  "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized."
 curr_path = os.path.dirname(os.path.abspath(__file__))  # current path
@@ -1,19 +0,0 @@
 {
    "algo_name": "A2C",
    "env_name": "CartPole-v0",
    "train_eps": 2000,
    "test_eps": 20,
    "ep_max_steps": 100000,
    "gamma": 0.99,
    "lr": 0.0003,
    "actor_hidden_dim": 256,
    "critic_hidden_dim": 256,
    "device": "cpu",
    "seed": 10,
    "show_fig": false,
    "save_fig": true,
    "result_path": "/Users/jj/Desktop/rl-tutorials/codes/A2C/outputs/CartPole-v0/20220829-135818/results/",
    "model_path": "/Users/jj/Desktop/rl-tutorials/codes/A2C/outputs/CartPole-v0/20220829-135818/models/",
    "n_states": 4,
    "n_actions": 2
 }
@@ -1 +0,0 @@
 {"algo_name": "A2C", "env_name": "CartPole-v0", "train_eps": 1600, "test_eps": 20, "ep_max_steps": 100000, "gamma": 0.99, "actor_lr": 0.0003, "critic_lr": 0.001, "actor_hidden_dim": 256, "critic_hidden_dim": 256, "device": "cpu", "seed": 10, "show_fig": false, "save_fig": true, "result_path": "/Users/jj/Desktop/rl-tutorials/codes/A2C/outputs/CartPole-v0/20220829-143327/results/", "model_path": "/Users/jj/Desktop/rl-tutorials/codes/A2C/outputs/CartPole-v0/20220829-143327/models/", "n_states": 4, "n_actions": 2}
@@ -1,21 +0,0 @@
 episodes,rewards,steps
 0,177.0,177
 1,180.0,180
 2,200.0,200
 3,200.0,200
 4,167.0,167
 5,124.0,124
 6,128.0,128
 7,200.0,200
 8,200.0,200
 9,200.0,200
 10,186.0,186
 11,187.0,187
 12,200.0,200
 13,176.0,176
 14,200.0,200
 15,200.0,200
 16,200.0,200
 17,200.0,200
 18,185.0,185
 19,180.0,180
@@ -0,0 +1,142 @@
 #!/usr/bin/env python
 # coding=utf-8
 '''
 Author: JiangJi
 Email: johnjim0816@gmail.com
 Date: 2022-10-30 01:19:43
 LastEditor: JiangJi
 LastEditTime: 2022-11-01 01:21:06
 Discription: 
 '''
 import sys,os
 os.environ["KMP_DUPLICATE_LIB_OK"]  =  "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized."
 curr_path = os.path.dirname(os.path.abspath(__file__))  # current path
 parent_path = os.path.dirname(curr_path)  # parent path 
 sys.path.append(parent_path)  # add path to system path
 import gym
 from common.utils import all_seed,merge_class_attrs
 from common.launcher import Launcher
 from common.memories import PGReplay
 from common.models import ActorSoftmax,Critic
 from envs.register import register_env
 from a2c import A2C
 from config.config import GeneralConfigA2C,AlgoConfigA2C
 class Main(Launcher):
    def __init__(self) -> None:
        super().__init__()
        self.cfgs['general_cfg'] = merge_class_attrs(self.cfgs['general_cfg'],GeneralConfigA2C())
        self.cfgs['algo_cfg'] = merge_class_attrs(self.cfgs['algo_cfg'],AlgoConfigA2C())
    def env_agent_config(self,cfg,logger):
        ''' create env and agent
        '''  
        register_env(cfg.env_name)
        env = gym.make(cfg.env_name,new_step_api=True)  # create env
        if cfg.seed !=0: # set random seed
            all_seed(env,seed = cfg.seed) 
        try: # state dimension
            n_states = env.observation_space.n # print(hasattr(env.observation_space, 'n'))
        except AttributeError:
            n_states = env.observation_space.shape[0] # print(hasattr(env.observation_space, 'shape'))
        n_actions = env.action_space.n  # action dimension
        logger.info(f"n_states: {n_states}, n_actions: {n_actions}") # print info
        # update to cfg paramters
        setattr(cfg, 'n_states', n_states)
        setattr(cfg, 'n_actions', n_actions)
        models = {'Actor':ActorSoftmax(n_states,n_actions, hidden_dim = cfg.actor_hidden_dim),'Critic':Critic(n_states,1,hidden_dim=cfg.critic_hidden_dim)}
        memories = {'ACMemory':PGReplay()}
        agent = A2C(models,memories,cfg)
        for k,v in models.items():
            logger.info(f"{k} model name: {type(v).__name__}")
        for k,v in memories.items():
            logger.info(f"{k} memory name: {type(v).__name__}")
        logger.info(f"agent name: {type(agent).__name__}")
        return env,agent
    def train_one_episode(self, env, agent, cfg):
        ep_reward = 0  # reward per episode
        ep_step = 0 # step per episode
        ep_entropy = 0 # entropy per episode
        state = env.reset()  # reset and obtain initial state
        for _ in range(cfg.max_steps):
            action = agent.sample_action(state)  # sample action
            next_state, reward, terminated, truncated , info = env.step(action)  # update env and return transitions
            agent.memory.push((agent.value,agent.log_prob,reward))  # save transitions
            state = next_state  # update state
            ep_reward += reward
            ep_entropy += agent.entropy
            ep_step += 1
            if terminated:
                break
        agent.update(next_state,ep_entropy)  # update agent
        return agent,ep_reward,ep_step
    def test_one_episode(self, env, agent, cfg):
        ep_reward = 0  # reward per episode
        ep_step = 0 # step per episode
        state = env.reset()  # reset and obtain initial state
        for _ in range(cfg.max_steps):
            action = agent.predict_action(state)  # predict action
            next_state, reward, terminated, truncated , info = env.step(action)  
            state = next_state 
            ep_reward += reward
            ep_step += 1
            if terminated:
                break
        return agent,ep_reward,ep_step
    # def train(self,cfg,env,agent,logger):
    #     logger.info("Start training!")
    #     logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
    #     rewards = []  # record rewards for all episodes
    #     steps = [] # record steps for all episodes
    #     for i_ep in range(cfg.train_eps):
    #         ep_reward = 0  # reward per episode
    #         ep_step = 0 # step per episode
    #         ep_entropy = 0
    #         state = env.reset()  # reset and obtain initial state
    #         for _ in range(cfg.max_steps):
    #             action = agent.sample_action(state)  # sample action
    #             next_state, reward, terminated, truncated , info = env.step(action)  # update env and return transitions
    #             agent.memory.push((agent.value,agent.log_prob,reward))  # save transitions
    #             state = next_state  # update state
    #             ep_reward += reward
    #             ep_entropy += agent.entropy
    #             ep_step += 1
    #             if terminated:
    #                 break
    #         agent.update(next_state,ep_entropy)  # update agent
    #         rewards.append(ep_reward)
    #         steps.append(ep_step)
    #         logger.info(f"Episode: {i_ep+1}/{cfg.train_eps}, Reward: {ep_reward:.2f}, Steps:{ep_step}")
    #     logger.info("Finish training!")
    #     return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
    # def test(self,cfg,env,agent,logger):
    #     logger.info("Start testing!")
    #     logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
    #     rewards = []  # record rewards for all episodes
    #     steps = [] # record steps for all episodes
    #     for i_ep in range(cfg.test_eps):
    #         ep_reward = 0  # reward per episode
    #         ep_step = 0
    #         state = env.reset()  # reset and obtain initial state
    #         for _ in range(cfg.max_steps):
    #             action = agent.predict_action(state)  # predict action
    #             next_state, reward, terminated, truncated , info = env.step(action)  
    #             state = next_state 
    #             ep_reward += reward
    #             ep_step += 1
    #             if terminated:
    #                 break
    #         rewards.append(ep_reward)
    #         steps.append(ep_step)
    #         logger.info(f"Episode: {i_ep+1}/{cfg.test_eps}, Reward: {ep_reward:.2f}, Steps:{ep_step}")
    #     logger.info("Finish testing!")
    #     env.close()
    #     return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
 if __name__ == "__main__":
    main = Main()
    main.run()
@@ -0,0 +1,142 @@
 #!/usr/bin/env python
 # coding=utf-8
 '''
 Author: JiangJi
 Email: johnjim0816@gmail.com
 Date: 2022-10-30 01:19:43
 LastEditor: JiangJi
 LastEditTime: 2022-11-01 01:21:12
 Discription: continuous action space
 '''
 import sys,os
 os.environ["KMP_DUPLICATE_LIB_OK"]  =  "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized."
 curr_path = os.path.dirname(os.path.abspath(__file__))  # current path
 parent_path = os.path.dirname(curr_path)  # parent path 
 sys.path.append(parent_path)  # add path to system path
 import gym
 from common.utils import all_seed,merge_class_attrs
 from common.launcher import Launcher
 from common.memories import PGReplay
 from common.models import ActorSoftmaxTanh,Critic
 from envs.register import register_env
 from a2c import A2C
 from config.config import GeneralConfigA2C,AlgoConfigA2C
 class Main(Launcher):
    def __init__(self) -> None:
        super().__init__()
        self.cfgs['general_cfg'] = merge_class_attrs(self.cfgs['general_cfg'],GeneralConfigA2C())
        self.cfgs['algo_cfg'] = merge_class_attrs(self.cfgs['algo_cfg'],AlgoConfigA2C())
    def env_agent_config(self,cfg,logger):
        ''' create env and agent
        '''  
        register_env(cfg.env_name)
        env = gym.make(cfg.env_name,new_step_api=True)  # create env
        if cfg.seed !=0: # set random seed
            all_seed(env,seed = cfg.seed) 
        try: # state dimension
            n_states = env.observation_space.n # print(hasattr(env.observation_space, 'n'))
        except AttributeError:
            n_states = env.observation_space.shape[0] # print(hasattr(env.observation_space, 'shape'))
        n_actions = env.action_space.n  # action dimension
        logger.info(f"n_states: {n_states}, n_actions: {n_actions}") # print info
        # update to cfg paramters
        setattr(cfg, 'n_states', n_states)
        setattr(cfg, 'n_actions', n_actions)
        models = {'Actor':ActorSoftmaxTanh(n_states,n_actions, hidden_dim = cfg.actor_hidden_dim),'Critic':Critic(n_states,1,hidden_dim=cfg.critic_hidden_dim)}
        memories = {'ACMemory':PGReplay()}
        agent = A2C(models,memories,cfg)
        for k,v in models.items():
            logger.info(f"{k} model name: {type(v).__name__}")
        for k,v in memories.items():
            logger.info(f"{k} memory name: {type(v).__name__}")
        logger.info(f"agent name: {type(agent).__name__}")
        return env,agent
    def train_one_episode(self, env, agent, cfg):
        ep_reward = 0  # reward per episode
        ep_step = 0 # step per episode
        ep_entropy = 0 # entropy per episode
        state = env.reset()  # reset and obtain initial state
        for _ in range(cfg.max_steps):
            action = agent.sample_action(state)  # sample action
            next_state, reward, terminated, truncated , info = env.step(action)  # update env and return transitions
            agent.memory.push((agent.value,agent.log_prob,reward))  # save transitions
            state = next_state  # update state
            ep_reward += reward
            ep_entropy += agent.entropy
            ep_step += 1
            if terminated:
                break
        agent.update(next_state,ep_entropy)  # update agent
        return agent,ep_reward,ep_step
    def test_one_episode(self, env, agent, cfg):
        ep_reward = 0  # reward per episode
        ep_step = 0 # step per episode
        state = env.reset()  # reset and obtain initial state
        for _ in range(cfg.max_steps):
            action = agent.predict_action(state)  # predict action
            next_state, reward, terminated, truncated , info = env.step(action)  
            state = next_state 
            ep_reward += reward
            ep_step += 1
            if terminated:
                break
        return agent,ep_reward,ep_step
    # def train(self,cfg,env,agent,logger):
    #     logger.info("Start training!")
    #     logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
    #     rewards = []  # record rewards for all episodes
    #     steps = [] # record steps for all episodes
    #     for i_ep in range(cfg.train_eps):
    #         ep_reward = 0  # reward per episode
    #         ep_step = 0 # step per episode
    #         ep_entropy = 0
    #         state = env.reset()  # reset and obtain initial state
    #         for _ in range(cfg.max_steps):
    #             action = agent.sample_action(state)  # sample action
    #             next_state, reward, terminated, truncated , info = env.step(action)  # update env and return transitions
    #             agent.memory.push((agent.value,agent.log_prob,reward))  # save transitions
    #             state = next_state  # update state
    #             ep_reward += reward
    #             ep_entropy += agent.entropy
    #             ep_step += 1
    #             if terminated:
    #                 break
    #         agent.update(next_state,ep_entropy)  # update agent
    #         rewards.append(ep_reward)
    #         steps.append(ep_step)
    #         logger.info(f"Episode: {i_ep+1}/{cfg.train_eps}, Reward: {ep_reward:.2f}, Steps:{ep_step}")
    #     logger.info("Finish training!")
    #     return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
    # def test(self,cfg,env,agent,logger):
    #     logger.info("Start testing!")
    #     logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
    #     rewards = []  # record rewards for all episodes
    #     steps = [] # record steps for all episodes
    #     for i_ep in range(cfg.test_eps):
    #         ep_reward = 0  # reward per episode
    #         ep_step = 0
    #         state = env.reset()  # reset and obtain initial state
    #         for _ in range(cfg.max_steps):
    #             action = agent.predict_action(state)  # predict action
    #             next_state, reward, terminated, truncated , info = env.step(action)  
    #             state = next_state 
    #             ep_reward += reward
    #             ep_step += 1
    #             if terminated:
    #                 break
    #         rewards.append(ep_reward)
    #         steps.append(ep_step)
    #         logger.info(f"Episode: {i_ep+1}/{cfg.test_eps}, Reward: {ep_reward:.2f}, Steps:{ep_step}")
    #     logger.info("Finish testing!")
    #     env.close()
    #     return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
 if __name__ == "__main__":
    main = Main()
    main.run()
@@ -0,0 +1,149 @@
 #!/usr/bin/env python
 # coding=utf-8
 '''
 Author: JiangJi
 Email: johnjim0816@gmail.com
 Date: 2022-10-30 01:19:43
 LastEditor: JiangJi
 LastEditTime: 2022-11-01 00:08:22
 Discription: the only difference from task0.py is that the actor here we use ActorSoftmaxTanh instead of ActorSoftmax with ReLU
 '''
 import sys,os
 os.environ["KMP_DUPLICATE_LIB_OK"]  =  "TRUE" # avoid "OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized."
 curr_path = os.path.dirname(os.path.abspath(__file__))  # current path
 parent_path = os.path.dirname(curr_path)  # parent path 
 sys.path.append(parent_path)  # add path to system path
 import gym
 import torch
 import numpy as np
 from common.utils import all_seed,merge_class_attrs
 from common.launcher import Launcher
 from common.memories import PGReplay
 from common.models import ActorNormal,Critic
 from envs.register import register_env
 from a2c import A2C
 from config.config import GeneralConfigA2C,AlgoConfigA2C
 class Main(Launcher):
    def __init__(self) -> None:
        super().__init__()
        self.cfgs['general_cfg'] = merge_class_attrs(self.cfgs['general_cfg'],GeneralConfigA2C())
        self.cfgs['algo_cfg'] = merge_class_attrs(self.cfgs['algo_cfg'],AlgoConfigA2C())
    def env_agent_config(self,cfg,logger):
        ''' create env and agent
        '''  
        register_env(cfg.env_name)
        env = gym.make(cfg.env_name,new_step_api=True)  # create env
        if cfg.seed !=0: # set random seed
            all_seed(env,seed = cfg.seed) 
        try: # state dimension
            n_states = env.observation_space.n # print(hasattr(env.observation_space, 'n'))
        except AttributeError:
            n_states = env.observation_space.shape[0] # print(hasattr(env.observation_space, 'shape'))
        try:
            n_actions = env.action_space.n  # action dimension
        except AttributeError:
            n_actions = env.action_space.shape[0]
            logger.info(f"action bound: {abs(env.action_space.low.item())}") 
            setattr(cfg, 'action_bound', abs(env.action_space.low.item()))
        logger.info(f"n_states: {n_states}, n_actions: {n_actions}") # print info
        # update to cfg paramters
        setattr(cfg, 'n_states', n_states)
        setattr(cfg, 'n_actions', n_actions)
        models = {'Actor':ActorNormal(n_states,n_actions, hidden_dim = cfg.actor_hidden_dim),'Critic':Critic(n_states,1,hidden_dim=cfg.critic_hidden_dim)}
        memories = {'ACMemory':PGReplay()}
        agent = A2C(models,memories,cfg)
        for k,v in models.items():
            logger.info(f"{k} model name: {type(v).__name__}")
        for k,v in memories.items():
            logger.info(f"{k} memory name: {type(v).__name__}")
        logger.info(f"agent name: {type(agent).__name__}")
        return env,agent
    def train_one_episode(self, env, agent, cfg):
        ep_reward = 0  # reward per episode
        ep_step = 0 # step per episode
        ep_entropy = 0 # entropy per episode
        state = env.reset()  # reset and obtain initial state
        for _ in range(cfg.max_steps):
            action = agent.sample_action(state)  # sample action
            next_state, reward, terminated, truncated , info = env.step(action)  # update env and return transitions
            agent.memory.push((agent.value,agent.log_prob,reward))  # save transitions
            state = next_state  # update state
            ep_reward += reward
            ep_entropy += agent.entropy
            ep_step += 1
            if terminated:
                break
        agent.update(next_state,ep_entropy)  # update agent
        return agent,ep_reward,ep_step
    def test_one_episode(self, env, agent, cfg):
        ep_reward = 0  # reward per episode
        ep_step = 0 # step per episode
        state = env.reset()  # reset and obtain initial state
        for _ in range(cfg.max_steps):
            action = agent.predict_action(state)  # predict action
            next_state, reward, terminated, truncated , info = env.step(action)  
            state = next_state 
            ep_reward += reward
            ep_step += 1
            if terminated:
                break
        return agent,ep_reward,ep_step
    # def train(self,cfg,env,agent,logger):
    #     logger.info("Start training!")
    #     logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
    #     rewards = []  # record rewards for all episodes
    #     steps = [] # record steps for all episodes
    #     for i_ep in range(cfg.train_eps):
    #         ep_reward = 0  # reward per episode
    #         ep_step = 0 # step per episode
    #         ep_entropy = 0
    #         state = env.reset()  # reset and obtain initial state
    #         for _ in range(cfg.max_steps):
    #             action = agent.sample_action(state)  # sample action
    #             next_state, reward, terminated, truncated , info = env.step(action)  # update env and return transitions
    #             agent.memory.push((agent.value,agent.log_prob,reward))  # save transitions
    #             state = next_state  # update state
    #             ep_reward += reward
    #             ep_entropy += agent.entropy
    #             ep_step += 1
    #             if terminated:
    #                 break
    #         agent.update(next_state,ep_entropy)  # update agent
    #         rewards.append(ep_reward)
    #         steps.append(ep_step)
    #         logger.info(f"Episode: {i_ep+1}/{cfg.train_eps}, Reward: {ep_reward:.2f}, Steps:{ep_step}")
    #     logger.info("Finish training!")
    #     return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
    # def test(self,cfg,env,agent,logger):
    #     logger.info("Start testing!")
    #     logger.info(f"Env: {cfg.env_name}, Algorithm: {cfg.algo_name}, Device: {cfg.device}")
    #     rewards = []  # record rewards for all episodes
    #     steps = [] # record steps for all episodes
    #     for i_ep in range(cfg.test_eps):
    #         ep_reward = 0  # reward per episode
    #         ep_step = 0
    #         state = env.reset()  # reset and obtain initial state
    #         for _ in range(cfg.max_steps):
    #             action = agent.predict_action(state)  # predict action
    #             next_state, reward, terminated, truncated , info = env.step(action)  
    #             state = next_state 
    #             ep_reward += reward
    #             ep_step += 1
    #             if terminated:
    #                 break
    #         rewards.append(ep_reward)
    #         steps.append(ep_step)
    #         logger.info(f"Episode: {i_ep+1}/{cfg.test_eps}, Reward: {ep_reward:.2f}, Steps:{ep_step}")
    #     logger.info("Finish testing!")
    #     env.close()
    #     return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
 if __name__ == "__main__":
    main = Main()
    main.run()
@@ -5,7 +5,7 @@
@Email: johnjim0816@gmail.com
@Date: 2020-06-09 20:25:52
@LastEditor: John
-LastEditTime: 2022-06-09 19:04:44
+LastEditTime: 2022-09-27 15:43:21
@Discription: 
@Environment: python 3.7.7
 '''
@@ -14,96 +14,45 @@ import numpy as np
 import torch
 import torch.nn as nn
 import torch.optim as optim
 import torch.nn.functional as F
 class ReplayBuffer:
    def __init__(self, capacity):
        self.capacity = capacity # 经验回放的容量
        self.buffer = [] # 缓冲区
        self.position = 0 
    def push(self, state, action, reward, next_state, done):
        ''' 缓冲区是一个队列，容量超出时去掉开始存入的转移(transition)
        '''
        if len(self.buffer) < self.capacity:
            self.buffer.append(None)
        self.buffer[self.position] = (state, action, reward, next_state, done)
        self.position = (self.position + 1) % self.capacity 
    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size) # 随机采出小批量转移
        state, action, reward, next_state, done =  zip(*batch) # 解压成状态，动作等
        return state, action, reward, next_state, done
    def __len__(self):
        ''' 返回当前存储的量
        '''
        return len(self.buffer)
 class Actor(nn.Module):
    def __init__(self, n_states, n_actions, hidden_dim, init_w=3e-3):
        super(Actor, self).__init__()  
        self.linear1 = nn.Linear(n_states, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
        self.linear3 = nn.Linear(hidden_dim, n_actions)
        self.linear3.weight.data.uniform_(-init_w, init_w)
        self.linear3.bias.data.uniform_(-init_w, init_w)
    def forward(self, x):
        x = F.relu(self.linear1(x))
        x = F.relu(self.linear2(x))
        x = torch.tanh(self.linear3(x))
        return x
 class Critic(nn.Module):
    def __init__(self, n_states, n_actions, hidden_dim, init_w=3e-3):
        super(Critic, self).__init__()
        self.linear1 = nn.Linear(n_states + n_actions, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
        self.linear3 = nn.Linear(hidden_dim, 1)
        # 随机初始化为较小的值
        self.linear3.weight.data.uniform_(-init_w, init_w)
        self.linear3.bias.data.uniform_(-init_w, init_w)
    def forward(self, state, action):
        # 按维数1拼接
        x = torch.cat([state, action], 1)
        x = F.relu(self.linear1(x))
        x = F.relu(self.linear2(x))
        x = self.linear3(x)
        return x
 class DDPG:
    def __init__(self, n_states, n_actions, cfg):
        self.device = torch.device(cfg.device)
        self.critic = Critic(n_states, n_actions, cfg.hidden_dim).to(self.device)
        self.actor = Actor(n_states, n_actions, cfg.hidden_dim).to(self.device)
        self.target_critic = Critic(n_states, n_actions, cfg.hidden_dim).to(self.device)
        self.target_actor = Actor(n_states, n_actions, cfg.hidden_dim).to(self.device)
-        # 复制参数到目标网络
+class DDPG:
    def __init__(self, models,memories,cfg):
        self.device = torch.device(cfg['device'])
        self.critic = models['critic'].to(self.device)
        self.target_critic = models['critic'].to(self.device)
        self.actor = models['actor'].to(self.device)
        self.target_actor = models['actor'].to(self.device)
        # copy weights from critic to target_critic
        for target_param, param in zip(self.target_critic.parameters(), self.critic.parameters()):
            target_param.data.copy_(param.data)
        # copy weights from actor to target_actor
        for target_param, param in zip(self.target_actor.parameters(), self.actor.parameters()):
            target_param.data.copy_(param.data)
        self.critic_optimizer = optim.Adam(self.critic.parameters(),  lr=cfg['critic_lr'])
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=cfg['actor_lr'])
        self.memory = memories['memory']
        self.batch_size = cfg['batch_size']
        self.gamma = cfg['gamma']
        self.tau = cfg['tau']
-        self.critic_optimizer = optim.Adam(
+    def sample_action(self, state):
            self.critic.parameters(),  lr=cfg.critic_lr)
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=cfg.actor_lr)
        self.memory = ReplayBuffer(cfg.memory_capacity)
        self.batch_size = cfg.batch_size
        self.soft_tau = cfg.soft_tau # 软更新参数
        self.gamma = cfg.gamma
    def choose_action(self, state):
        state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        action = self.actor(state)
        return action.detach().cpu().numpy()[0, 0]
    @torch.no_grad()
    def predict_action(self, state):
        ''' predict action
        '''
        state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        action = self.actor(state)
        return action.cpu().numpy()[0, 0]
    def update(self):
-        if len(self.memory) < self.batch_size: # 当 memory 中不满足一个批量时，不更新策略
+        if len(self.memory) < self.batch_size: # when memory size is less than batch size, return
            return
-        # 从经验回放中(replay memory)中随机采样一个批量的转移(transition)
+        # sample a random minibatch of N transitions from R
        state, action, reward, next_state, done = self.memory.sample(self.batch_size)
-        # 转变为张量
+        # convert to tensor
        state = torch.FloatTensor(np.array(state)).to(self.device)
        next_state = torch.FloatTensor(np.array(next_state)).to(self.device)
        action = torch.FloatTensor(np.array(action)).to(self.device)
@@ -126,19 +75,22 @@ class DDPG:
        self.critic_optimizer.zero_grad()
        value_loss.backward()
        self.critic_optimizer.step()
-        # 软更新
+        # soft update
        for target_param, param in zip(self.target_critic.parameters(), self.critic.parameters()):
            target_param.data.copy_(
-                target_param.data * (1.0 - self.soft_tau) +
+                target_param.data * (1.0 - self.tau) +
-                param.data * self.soft_tau
+                param.data * self.tau
            )
        for target_param, param in zip(self.target_actor.parameters(), self.actor.parameters()):
            target_param.data.copy_(
-                target_param.data * (1.0 - self.soft_tau) +
+                target_param.data * (1.0 - self.tau) +
-                param.data * self.soft_tau
+                param.data * self.tau
            )
-    def save(self,path):
+    def save_model(self,path):
-        torch.save(self.actor.state_dict(), path+'checkpoint.pt')
+        from pathlib import Path
        # create path
        Path(path).mkdir(parents=True, exist_ok=True)
        torch.save(self.actor.state_dict(), f"{path}/actor_checkpoint.pt")
-    def load(self,path):
+    def load_model(self,path):
-        self.actor.load_state_dict(torch.load(path+'checkpoint.pt')) 
+        self.actor.load_state_dict(torch.load(f"{path}/actor_checkpoint.pt")) 
@@ -0,0 +1,152 @@
 #!/usr/bin/env python
 # coding=utf-8
 '''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-11 20:58:21
@LastEditor: John
 LastEditTime: 2022-09-27 15:50:12
@Discription: 
@Environment: python 3.7.7
 '''
 import sys,os
 curr_path = os.path.dirname(os.path.abspath(__file__))  # current path
 parent_path = os.path.dirname(curr_path)  # parent path
 sys.path.append(parent_path)  # add to system path
 import datetime
 import gym
 import torch
 import argparse
 import torch.nn as nn
 import torch.nn.functional as F
 from env import NormalizedActions,OUNoise
 from ddpg import DDPG
 from common.utils import all_seed
 from common.memories import ReplayBufferQue
 from common.launcher import Launcher
 from envs.register import register_env
 class Actor(nn.Module):
    def __init__(self, n_states, n_actions, hidden_dim, init_w=3e-3):
        super(Actor, self).__init__()  
        self.linear1 = nn.Linear(n_states, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
        self.linear3 = nn.Linear(hidden_dim, n_actions)
        self.linear3.weight.data.uniform_(-init_w, init_w)
        self.linear3.bias.data.uniform_(-init_w, init_w)
    def forward(self, x):
        x = F.relu(self.linear1(x))
        x = F.relu(self.linear2(x))
        x = torch.tanh(self.linear3(x))
        return x
 class Critic(nn.Module):
    def __init__(self, n_states, n_actions, hidden_dim, init_w=3e-3):
        super(Critic, self).__init__()
        self.linear1 = nn.Linear(n_states + n_actions, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
        self.linear3 = nn.Linear(hidden_dim, 1)
        # 随机初始化为较小的值
        self.linear3.weight.data.uniform_(-init_w, init_w)
        self.linear3.bias.data.uniform_(-init_w, init_w)
    def forward(self, state, action):
        # 按维数1拼接
        x = torch.cat([state, action], 1)
        x = F.relu(self.linear1(x))
        x = F.relu(self.linear2(x))
        x = self.linear3(x)
        return x
 class Main(Launcher):
    def get_args(self):
        """ hyperparameters
        """
        curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")  # obtain current time
        parser = argparse.ArgumentParser(description="hyperparameters") 
        parser.add_argument('--algo_name',default='DDPG',type=str,help="name of algorithm")
        parser.add_argument('--env_name',default='Pendulum-v1',type=str,help="name of environment")
        parser.add_argument('--train_eps',default=300,type=int,help="episodes of training")
        parser.add_argument('--test_eps',default=20,type=int,help="episodes of testing")
        parser.add_argument('--max_steps',default=100000,type=int,help="steps per episode, much larger value can simulate infinite steps")
        parser.add_argument('--gamma',default=0.99,type=float,help="discounted factor")
        parser.add_argument('--critic_lr',default=1e-3,type=float,help="learning rate of critic")
        parser.add_argument('--actor_lr',default=1e-4,type=float,help="learning rate of actor")
        parser.add_argument('--memory_capacity',default=8000,type=int,help="memory capacity")
        parser.add_argument('--batch_size',default=128,type=int)
        parser.add_argument('--target_update',default=2,type=int)
        parser.add_argument('--tau',default=1e-2,type=float)
        parser.add_argument('--critic_hidden_dim',default=256,type=int)
        parser.add_argument('--actor_hidden_dim',default=256,type=int)
        parser.add_argument('--device',default='cpu',type=str,help="cpu or cuda") 
        parser.add_argument('--seed',default=1,type=int,help="random seed")
        parser.add_argument('--show_fig',default=False,type=bool,help="if show figure or not")  
        parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not")  
        args = parser.parse_args()   
        default_args = {'result_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/results/",
                        'model_path':f"{curr_path}/outputs/{args.env_name}/{curr_time}/models/",
        }
        args = {**vars(args),**default_args}  # type(dict)                         
        return args
    def env_agent_config(self,cfg):
        register_env(cfg['env_name'])
        env = gym.make(cfg['env_name']) 
        env = NormalizedActions(env) # decorate with action noise
        if cfg['seed'] !=0: # set random seed
            all_seed(env,seed=cfg["seed"]) 
        n_states = env.observation_space.shape[0]
        n_actions = env.action_space.shape[0]
        print(f"n_states: {n_states}, n_actions: {n_actions}")
        cfg.update({"n_states":n_states,"n_actions":n_actions}) # update to cfg paramters
        models = {"actor":Actor(n_states,n_actions,hidden_dim=cfg['actor_hidden_dim']),"critic":Critic(n_states,n_actions,hidden_dim=cfg['critic_hidden_dim'])}
        memories = {"memory":ReplayBufferQue(cfg['memory_capacity'])}
        agent = DDPG(models,memories,cfg)
        return env,agent
    def train(self,cfg, env, agent):
        print('Start training!')
        ou_noise = OUNoise(env.action_space)  # noise of action
        rewards = [] # record rewards for all episodes
        for i_ep in range(cfg['train_eps']):
            state = env.reset()
            ou_noise.reset()
            ep_reward = 0
            for i_step in range(cfg['max_steps']):
                action = agent.sample_action(state)
                action = ou_noise.get_action(action, i_step+1) 
                next_state, reward, done, _ = env.step(action)
                ep_reward += reward
                agent.memory.push((state, action, reward, next_state, done))
                agent.update()
                state = next_state
                if done:
                    break
            if (i_ep+1)%10 == 0:
                print(f"Env:{i_ep+1}/{cfg['train_eps']}, Reward:{ep_reward:.2f}")
            rewards.append(ep_reward)
        print('Finish training!')
        return {'rewards':rewards}
    def test(self,cfg, env, agent):
        print('Start testing!')
        rewards = [] # record rewards for all episodes
        for i_ep in range(cfg['test_eps']):
            state = env.reset() 
            ep_reward = 0
            for i_step in range(cfg['max_steps']):
                action = agent.predict_action(state)
                next_state, reward, done, _ = env.step(action)
                ep_reward += reward
                state = next_state
                if done:
                    break
            rewards.append(ep_reward)
            print(f"Episode:{i_ep+1}/{cfg['test_eps']}, Reward:{ep_reward:.1f}")
        print('Finish testing!')
        return {'rewards':rewards}
 if __name__ == "__main__":
    main = Main()
    main.run()
@@ -1,18 +0,0 @@
 {
    "algo_name": "DDPG",
    "env_name": "Pendulum-v1",
    "train_eps": 300,
    "test_eps": 20,
    "gamma": 0.99,
    "critic_lr": 0.001,
    "actor_lr": 0.0001,
    "memory_capacity": 8000,
    "batch_size": 128,
    "target_update": 2,
    "soft_tau": 0.01,
    "hidden_dim": 256,
    "deivce": "cpu",
    "result_path": "C:\\Users\\24438\\Desktop\\rl-tutorials/outputs/DDPG/outputs/Pendulum-v1/20220713-225402/results//",
    "model_path": "C:\\Users\\24438\\Desktop\\rl-tutorials/outputs/DDPG/outputs/Pendulum-v1/20220713-225402/models/",
    "save_fig": true
 }
@@ -0,0 +1,25 @@
 {
    "algo_name": "DDPG",
    "env_name": "Pendulum-v1",
    "train_eps": 300,
    "test_eps": 20,
    "max_steps": 100000,
    "gamma": 0.99,
    "critic_lr": 0.001,
    "actor_lr": 0.0001,
    "memory_capacity": 8000,
    "batch_size": 128,
    "target_update": 2,
    "tau": 0.01,
    "critic_hidden_dim": 256,
    "actor_hidden_dim": 256,
    "device": "cpu",
    "seed": 1,
    "show_fig": false,
    "save_fig": true,
    "result_path": "/Users/jj/Desktop/rl-tutorials/codes/DDPG/outputs/Pendulum-v1/20220927-155053/results/",
    "model_path": "/Users/jj/Desktop/rl-tutorials/codes/DDPG/outputs/Pendulum-v1/20220927-155053/models/",
    "n_states": 3,
    "n_actions": 1,
    "training_time": 358.8142900466919
 }
@@ -0,0 +1,21 @@
 rewards
 -116.045416124376
 -126.18022935469217
 -231.46338228458293
 -246.40481094689758
 -304.69493818839186
 -124.39609191913091
 -1.060003582878406
 -114.19659653048288
 -348.9745708742037
 -116.10811133324769
 -117.20146333694844
 -118.66206784602966
 -235.17836229762355
 -356.14054913290624
 -118.38579118156366
 -351.9415915140771
 -114.50877866098972
 -124.775484599685
 -226.47062962476875
 -121.48872909193936
@@ -0,0 +1,301 @@
 rewards
 -1557.8518596631177
 -1354.7599369723537
 -1375.5732016629706
 -1493.8609739040871
 -1426.7116204537845
 -1235.7920755027762
 -1339.1647620443073
 -1544.2379906560486
 -1539.6232758780877
 -1549.5690058648204
 -1446.9193195793853
 -1520.2666688767558
 -1525.0116707122581
 -1379.136573640111
 -1532.702831768523
 -1484.7552963941637
 -1359.6699201737677
 -1349.6805649166854
 -1510.869999766432
 -1515.8398785434708
 -1447.4648656578254
 -1537.3822077872178
 -1249.6517039877456
 -1350.0302666965736
 -1529.4363372505607
 -1320.28204807604
 -1502.9248141320654
 -1545.4861772197075
 -1579.928789692619
 -1413.296070504152
 -1242.4673258663781
 -1403.8672028946078
 -1452.7199002523635
 -871.6071114009982
 -1324.1789316121412
 -1313.3348146041249
 -1059.8722927418046
 -1054.232673559123
 -973.8956270782459
 -972.9936641224186
 -972.9477399905655
 -947.0613443333731
 -737.3866328989184
 -958.6068164634295
 -739.6973395350705
 -886.8383108399455
 -775.1430379821574
 -937.3115016337417
 -700.875502951337
 -829.9396339144109
 -271.1629773396998
 -493.5460684734584
 -485.9321719313203
 -858.3735607086766
 -1145.3440084994113
 -1121.1338201339777
 -1191.5640831332366
 -1350.0425368846784
 -249.25438665107953
 -727.9051714734406
 -368.5579316240395
 -392.0611344939354
 -955.3231703741553
 -488.27956192035265
 -362.2734695759137
 -949.5440839122496
 -496.8460016912189
 -726.6871514929877
 -424.48641462866266
 -954.7075428204689
 -608.9650086409792
 -848.6059768900151
 -866.7052398755033
 -856.9846415044439
 -751.0342976129083
 -749.5118249469103
 -509.882299129811
 -506.56154097018043
 -906.0964475820368
 -1318.3941416286855
 -1422.2017011876615
 -1523.1661091894277
 -1209.2850593747999
 -1415.0972750475833
 -1533.2263827605834
 -1405.8345530072663
 -1244.3384723384913
 -1237.4704845061992
 -949.3394417935086
 -981.1855396112669
 -1241.224568444032
 -1033.118364799829
 -1017.2403725619487
 -981.9727804516916
 -853.1877724775591
 -869.0652369861646
 -1069.8265343327998
 -371.73173813891884
 -735.5887912713665
 -1262.050240428957
 -1242.985056062197
 -1191.6867713427482
 -1328.5323118458034
 -1015.5308653784714
 -895.3066515461381
 -994.1114862316568
 -761.4710321387583
 -717.6979056272868
 -782.302146467708
 -640.4913147345328
 -725.6469893076355
 -497.5346232085584
 -1027.1192149202325
 -950.0117149822681
 -956.1343737377374
 -708.9489626669097
 -964.5003064113283
 -611.9111516886613
 -612.3182791021098
 -1100.0047939174613
 -984.9262458612923
 -858.7106075590494
 -842.305917848386
 -745.9043991922597
 -741.2168858394704
 -1143.0750387284456
 -755.5257242325362
 -745.8440029056219
 -387.8717950334138
 -764.6628701051523
 -486.7967495537958
 -485.13357559164814
 -313.5415216767419
 -611.3450529954782
 -611.1570544377465
 -507.6456747676814
 -615.2032627013064
 -242.37988821149764
 -603.85498620892
 -352.2672241055367
 -155.99874664988383
 -615.4003063516313
 -384.9811293551548
 -498.80727354456315
 -407.6898591217813
 -1213.6383844696395
 -1122.2425748913884
 -592.4819308883913
 -478.2046833075051
 -891.0254788311132
 -482.40204115385
 -339.34676196677407
 -582.9985110154428
 -213.38243627478826
 -928.8434951613825
 -1545.5433749195483
 -1179.5016285049896
 -1211.9549773601925
 -1396.8082561792166
 -1318.073128824395
 -597.3837225413702
 -564.7793352410449
 -723.744223659601
 -653.0145534050461
 -847.6138123247009
 -385.62784320332867
 -245.25250602651928
 -117.55094416757835
 -864.0064774069044
 -124.30221387458867
 -244.4014050243669
 -1148.861754008653
 -914.4047868424254
 -765.9394408203351
 -124.05114610943177
 -605.7641303826842
 -616.3595829453579
 -375.5024692962698
 -253.51874076866997
 -240.08405245866714
 -503.96565579077225
 -606.7646526173963
 -502.6512112729435
 -746.404013238678
 -718.8658110051653
 -125.65808359856703
 -247.62256797883364
 -363.69852213666803
 -249.21801061415547
 -491.7724416523124
 -235.37050442527357
 -609.6026403583944
 -236.05731608228092
 -381.19853850450454
 -298.7683201867404
 -127.64145601534942
 -233.4300138495176
 -129.11243486763516
 -390.0092951263507
 -1000.7729892969854
 -249.60445310459787
 -253.02347910759622
 -129.04269174391223
 -360.6321251486308
 -377.26297602576534
 -124.98466986009481
 -245.47913567739212
 -127.0885254550411
 -118.11013006825459
 -128.8682755001942
 -497.3015586531096
 -340.77352433313484
 -514.4945799737978
 -503.24077308842783
 -627.9068157464455
 -511.39396524392146
 -763.8866112068075
 -741.7885082408757
 -617.4945380476306
 -950.3176437519387
 -643.4791402436576
 -511.9377874351982
 -573.6219349516633
 -564.1297823875693
 -242.06399233336583
 -496.4020380325518
 -360.56387982880364
 -495.4590728336022
 -503.7263345016764
 -122.47964616802327
 -254.16543926263168
 -614.5335268729743
 -234.3718017676852
 -301.27514663062874
 -387.64758894986204
 -368.74492411716415
 -364.43559131093593
 -160.6845848115533
 -504.1948947975429
 -246.51676032967683
 -251.5732500220603
 -600.1463819723879
 -247.17476928471288
 -381.924164337607
 -377.4773226068174
 -378.511830774651
 -126.69199895843033
 -365.0506645811703
 -130.45052114802874
 -374.37400288581813
 -502.37678159638887
 -374.43552658473055
 -241.157211525502
 -388.9597456642503
 -249.4412385534861
 -114.71395078439846
 -864.6882327286056
 -626.8144095971478
 -732.9226896140248
 -368.24767905020394
 -369.7425524469132
 -398.07832598184626
 -906.7113918582257
 -252.2343258180765
 -370.4258473086036
 -736.0203154396909
 -609.4605173515027
 -661.1255920773486
 -489.9605291008584
 -364.1671188501402
 -644.4029089587781
 -477.9510457677364
 -128.78294672880136
 -373.74382001694886
 -380.69931133982936
 -372.60275628381805
 -743.0410655515724
 -597.558847789258
 -387.94245652694394
 -725.3939448944484
 -409.1301313430852
 -491.8442467896486
 -123.0638156839621
 -377.9292326597324
 -489.27209762667974
 -255.63227821371257
 -379.5885382060625
 -370.2312967024669
 -250.94061817008688
 -131.2125308195906
 -600.3312016651868
 -130.84444772735733
 -312.6287688438562
 -382.4144610039701
 -259.03558003697265
 -224.92206667096863
 -376.81390821359685
 -382.39993489751646
 -380.25599578593636
 -610.1016672243638
@@ -1,133 +0,0 @@
 #!/usr/bin/env python
 # coding=utf-8
 '''
@Author: John
@Email: johnjim0816@gmail.com
@Date: 2020-06-11 20:58:21
@LastEditor: John
 LastEditTime: 2022-07-21 21:51:34
@Discription: 
@Environment: python 3.7.7
 '''
 import sys,os
 curr_path = os.path.dirname(os.path.abspath(__file__))  # current path
 parent_path = os.path.dirname(curr_path)  # parent path
 sys.path.append(parent_path)  # add to system path
 import datetime
 import gym
 import torch
 import argparse
 from env import NormalizedActions,OUNoise
 from ddpg import DDPG
 from common.utils import save_results,make_dir
 from common.utils import plot_rewards,save_args
 def get_args():
    """ Hyperparameters
    """
    curr_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")  # Obtain current time
    parser = argparse.ArgumentParser(description="hyperparameters")      
    parser.add_argument('--algo_name',default='DDPG',type=str,help="name of algorithm")
    parser.add_argument('--env_name',default='Pendulum-v1',type=str,help="name of environment")
    parser.add_argument('--train_eps',default=300,type=int,help="episodes of training")
    parser.add_argument('--test_eps',default=20,type=int,help="episodes of testing")
    parser.add_argument('--gamma',default=0.99,type=float,help="discounted factor")
    parser.add_argument('--critic_lr',default=1e-3,type=float,help="learning rate of critic")
    parser.add_argument('--actor_lr',default=1e-4,type=float,help="learning rate of actor")
    parser.add_argument('--memory_capacity',default=8000,type=int,help="memory capacity")
    parser.add_argument('--batch_size',default=128,type=int)
    parser.add_argument('--target_update',default=2,type=int)
    parser.add_argument('--soft_tau',default=1e-2,type=float)
    parser.add_argument('--hidden_dim',default=256,type=int)
    parser.add_argument('--device',default='cpu',type=str,help="cpu or cuda") 
    parser.add_argument('--result_path',default=curr_path + "/outputs/" + parser.parse_args().env_name + \
            '/' + curr_time + '/results/' )
    parser.add_argument('--model_path',default=curr_path + "/outputs/" + parser.parse_args().env_name + \
            '/' + curr_time + '/models/' ) # path to save models
    parser.add_argument('--save_fig',default=True,type=bool,help="if save figure or not")   
    args = parser.parse_args()                           
    return args
 def env_agent_config(cfg,seed=1):
    env = NormalizedActions(gym.make(cfg.env_name)) # 装饰action噪声
    env.seed(seed) # 随机种子
    n_states = env.observation_space.shape[0]
    n_actions = env.action_space.shape[0]
    agent = DDPG(n_states,n_actions,cfg)
    return env,agent
 def train(cfg, env, agent):
    print('Start training!')
    print(f'Env:{cfg.env_name}, Algorithm:{cfg.algo_name}, Device:{cfg.device}')
    ou_noise = OUNoise(env.action_space)  # noise of action
    rewards = [] # 记录所有回合的奖励
    ma_rewards = []  # 记录所有回合的滑动平均奖励
    for i_ep in range(cfg.train_eps):
        state = env.reset()
        ou_noise.reset()
        done = False
        ep_reward = 0
        i_step = 0
        while not done:
            i_step += 1
            action = agent.choose_action(state)
            action = ou_noise.get_action(action, i_step) 
            next_state, reward, done, _ = env.step(action)
            ep_reward += reward
            agent.memory.push(state, action, reward, next_state, done)
            agent.update()
            state = next_state
        if (i_ep+1)%10 == 0:
            print(f'Env:{i_ep+1}/{cfg.train_eps}, Reward:{ep_reward:.2f}')
        rewards.append(ep_reward)
        if ma_rewards:
            ma_rewards.append(0.9*ma_rewards[-1]+0.1*ep_reward)
        else:
            ma_rewards.append(ep_reward)
    print('Finish training!')
    return {'rewards':rewards,'ma_rewards':ma_rewards}
 def test(cfg, env, agent):
    print('Start testing')
    print(f'Env:{cfg.env_name}, Algorithm:{cfg.algo_name}, Device:{cfg.device}')
    rewards = [] # 记录所有回合的奖励
    ma_rewards = []  # 记录所有回合的滑动平均奖励
    for i_ep in range(cfg.test_eps):
        state = env.reset() 
        done = False
        ep_reward = 0
        i_step = 0
        while not done:
            i_step += 1
            action = agent.choose_action(state)
            next_state, reward, done, _ = env.step(action)
            ep_reward += reward
            state = next_state
        rewards.append(ep_reward)
        if ma_rewards:
            ma_rewards.append(0.9*ma_rewards[-1]+0.1*ep_reward)
        else:
            ma_rewards.append(ep_reward)
        print(f"Epside:{i_ep+1}/{cfg.test_eps}, Reward:{ep_reward:.1f}")
    print('Finish testing!')
    return {'rewards':rewards,'ma_rewards':ma_rewards}
 if __name__ == "__main__":
    cfg = get_args()
    # training
    env,agent = env_agent_config(cfg,seed=1)
    res_dic = train(cfg, env, agent)
    make_dir(cfg.result_path, cfg.model_path)
    save_args(cfg)
    agent.save(path=cfg.model_path)
    save_results(res_dic, tag='train',
                 path=cfg.result_path)  
    plot_rewards(res_dic['rewards'], res_dic['ma_rewards'], cfg, tag="train")  
    # testing
    env,agent = env_agent_config(cfg,seed=10)
    agent.load(path=cfg.model_path)
    res_dic = test(cfg,env,agent)
    save_results(res_dic, tag='test',
                 path=cfg.result_path)  
    plot_rewards(res_dic['rewards'], res_dic['ma_rewards'], cfg, tag="test")   
@@ -0,0 +1,25 @@
 general_cfg:
  algo_name: DQN
  device: cuda
  env_name: CartPole-v1
  eval_eps: 10
  eval_per_episode: 5
  load_checkpoint: true
  load_path: Train_CartPole-v1_DQN_20221031-001201
  max_steps: 200
  mode: test
  save_fig: true
  seed: 0
  show_fig: false
  test_eps: 10
  train_eps: 100
 algo_cfg:
  batch_size: 64
  buffer_size: 100000
  epsilon_decay: 500
  epsilon_end: 0.01
  epsilon_start: 0.95
  gamma: 0.95
  hidden_dim: 256
  lr: 0.0001
  target_update: 4
@@ -0,0 +1,14 @@
 2022-10-31 00:13:43 - r - INFO: - n_states: 4, n_actions: 2
 2022-10-31 00:13:44 - r - INFO: - Start testing!
 2022-10-31 00:13:44 - r - INFO: - Env: CartPole-v1, Algorithm: DQN, Device: cuda
 2022-10-31 00:13:45 - r - INFO: - Episode: 1/10, Reward: 200.0, Step: 200
 2022-10-31 00:13:45 - r - INFO: - Episode: 2/10, Reward: 200.0, Step: 200
 2022-10-31 00:13:45 - r - INFO: - Episode: 3/10, Reward: 200.0, Step: 200
 2022-10-31 00:13:45 - r - INFO: - Episode: 4/10, Reward: 200.0, Step: 200
 2022-10-31 00:13:45 - r - INFO: - Episode: 5/10, Reward: 200.0, Step: 200
 2022-10-31 00:13:45 - r - INFO: - Episode: 6/10, Reward: 200.0, Step: 200
 2022-10-31 00:13:45 - r - INFO: - Episode: 7/10, Reward: 200.0, Step: 200
 2022-10-31 00:13:45 - r - INFO: - Episode: 8/10, Reward: 200.0, Step: 200
 2022-10-31 00:13:45 - r - INFO: - Episode: 9/10, Reward: 200.0, Step: 200
 2022-10-31 00:13:45 - r - INFO: - Episode: 10/10, Reward: 200.0, Step: 200
 2022-10-31 00:13:45 - r - INFO: - Finish testing!
@@ -0,0 +1,11 @@
 episodes,rewards,steps
 0,200.0,200
 1,200.0,200
 2,200.0,200
 3,200.0,200
 4,200.0,200
 5,200.0,200
 6,200.0,200
 7,200.0,200
 8,200.0,200
 9,200.0,200
@@ -0,0 +1,23 @@
 general_cfg:
  algo_name: DQN
  device: cuda
  env_name: Acrobot-v1
  load_checkpoint: false
  load_path: Train_CartPole-v1_DQN_20221026-054757
  max_steps: 100000
  mode: train
  save_fig: true
  seed: 1
  show_fig: false
  test_eps: 10
  train_eps: 100
 algo_cfg:
  batch_size: 128
  buffer_size: 200000
  epsilon_decay: 500
  epsilon_end: 0.01
  epsilon_start: 0.95
  gamma: 0.95
  hidden_dim: 256
  lr: 0.002
  target_update: 4
@@ -0,0 +1,104 @@
 2022-10-26 09:46:45 - r - INFO: - n_states: 6, n_actions: 3
 2022-10-26 09:46:48 - r - INFO: - Start training!
 2022-10-26 09:46:48 - r - INFO: - Env: Acrobot-v1, Algorithm: DQN, Device: cuda
 2022-10-26 09:46:50 - r - INFO: - Episode: 1/100, Reward: -861.00: Epislon: 0.178
 2022-10-26 09:46:50 - r - INFO: - Episode: 2/100, Reward: -252.00: Epislon: 0.111
 2022-10-26 09:46:50 - r - INFO: - Episode: 3/100, Reward: -196.00: Epislon: 0.078
 2022-10-26 09:46:51 - r - INFO: - Episode: 4/100, Reward: -390.00: Epislon: 0.041
 2022-10-26 09:46:52 - r - INFO: - Episode: 5/100, Reward: -371.00: Epislon: 0.025
 2022-10-26 09:46:52 - r - INFO: - Episode: 6/100, Reward: -237.00: Epislon: 0.019
 2022-10-26 09:46:52 - r - INFO: - Episode: 7/100, Reward: -227.00: Epislon: 0.016
 2022-10-26 09:46:53 - r - INFO: - Episode: 8/100, Reward: -228.00: Epislon: 0.014
 2022-10-26 09:46:53 - r - INFO: - Episode: 9/100, Reward: -305.00: Epislon: 0.012
 2022-10-26 09:46:54 - r - INFO: - Episode: 10/100, Reward: -234.00: Epislon: 0.011
 2022-10-26 09:46:54 - r - INFO: - Episode: 11/100, Reward: -204.00: Epislon: 0.011
 2022-10-26 09:46:55 - r - INFO: - Episode: 12/100, Reward: -277.00: Epislon: 0.010
 2022-10-26 09:46:55 - r - INFO: - Episode: 13/100, Reward: -148.00: Epislon: 0.010
 2022-10-26 09:46:56 - r - INFO: - Episode: 14/100, Reward: -372.00: Epislon: 0.010
 2022-10-26 09:46:56 - r - INFO: - Episode: 15/100, Reward: -273.00: Epislon: 0.010
 2022-10-26 09:46:56 - r - INFO: - Episode: 16/100, Reward: -105.00: Epislon: 0.010
 2022-10-26 09:46:56 - r - INFO: - Episode: 17/100, Reward: -79.00: Epislon: 0.010
 2022-10-26 09:46:57 - r - INFO: - Episode: 18/100, Reward: -112.00: Epislon: 0.010
 2022-10-26 09:46:57 - r - INFO: - Episode: 19/100, Reward: -276.00: Epislon: 0.010
 2022-10-26 09:46:57 - r - INFO: - Episode: 20/100, Reward: -148.00: Epislon: 0.010
 2022-10-26 09:46:58 - r - INFO: - Episode: 21/100, Reward: -201.00: Epislon: 0.010
 2022-10-26 09:46:58 - r - INFO: - Episode: 22/100, Reward: -173.00: Epislon: 0.010
 2022-10-26 09:46:58 - r - INFO: - Episode: 23/100, Reward: -226.00: Epislon: 0.010
 2022-10-26 09:46:59 - r - INFO: - Episode: 24/100, Reward: -154.00: Epislon: 0.010
 2022-10-26 09:46:59 - r - INFO: - Episode: 25/100, Reward: -269.00: Epislon: 0.010
 2022-10-26 09:46:59 - r - INFO: - Episode: 26/100, Reward: -191.00: Epislon: 0.010
 2022-10-26 09:47:00 - r - INFO: - Episode: 27/100, Reward: -177.00: Epislon: 0.010
 2022-10-26 09:47:00 - r - INFO: - Episode: 28/100, Reward: -209.00: Epislon: 0.010
 2022-10-26 09:47:00 - r - INFO: - Episode: 29/100, Reward: -116.00: Epislon: 0.010
 2022-10-26 09:47:00 - r - INFO: - Episode: 30/100, Reward: -117.00: Epislon: 0.010
 2022-10-26 09:47:01 - r - INFO: - Episode: 31/100, Reward: -121.00: Epislon: 0.010
 2022-10-26 09:47:01 - r - INFO: - Episode: 32/100, Reward: -208.00: Epislon: 0.010
 2022-10-26 09:47:01 - r - INFO: - Episode: 33/100, Reward: -147.00: Epislon: 0.010
 2022-10-26 09:47:02 - r - INFO: - Episode: 34/100, Reward: -104.00: Epislon: 0.010
 2022-10-26 09:47:02 - r - INFO: - Episode: 35/100, Reward: -161.00: Epislon: 0.010
 2022-10-26 09:47:02 - r - INFO: - Episode: 36/100, Reward: -144.00: Epislon: 0.010
 2022-10-26 09:47:02 - r - INFO: - Episode: 37/100, Reward: -131.00: Epislon: 0.010
 2022-10-26 09:47:03 - r - INFO: - Episode: 38/100, Reward: -226.00: Epislon: 0.010
 2022-10-26 09:47:03 - r - INFO: - Episode: 39/100, Reward: -117.00: Epislon: 0.010
 2022-10-26 09:47:03 - r - INFO: - Episode: 40/100, Reward: -344.00: Epislon: 0.010
 2022-10-26 09:47:04 - r - INFO: - Episode: 41/100, Reward: -123.00: Epislon: 0.010
 2022-10-26 09:47:04 - r - INFO: - Episode: 42/100, Reward: -232.00: Epislon: 0.010
 2022-10-26 09:47:04 - r - INFO: - Episode: 43/100, Reward: -190.00: Epislon: 0.010
 2022-10-26 09:47:05 - r - INFO: - Episode: 44/100, Reward: -176.00: Epislon: 0.010
 2022-10-26 09:47:05 - r - INFO: - Episode: 45/100, Reward: -139.00: Epislon: 0.010
 2022-10-26 09:47:06 - r - INFO: - Episode: 46/100, Reward: -410.00: Epislon: 0.010
 2022-10-26 09:47:06 - r - INFO: - Episode: 47/100, Reward: -115.00: Epislon: 0.010
 2022-10-26 09:47:06 - r - INFO: - Episode: 48/100, Reward: -118.00: Epislon: 0.010
 2022-10-26 09:47:06 - r - INFO: - Episode: 49/100, Reward: -113.00: Epislon: 0.010
 2022-10-26 09:47:07 - r - INFO: - Episode: 50/100, Reward: -355.00: Epislon: 0.010
 2022-10-26 09:47:07 - r - INFO: - Episode: 51/100, Reward: -110.00: Epislon: 0.010
 2022-10-26 09:47:07 - r - INFO: - Episode: 52/100, Reward: -148.00: Epislon: 0.010
 2022-10-26 09:47:08 - r - INFO: - Episode: 53/100, Reward: -135.00: Epislon: 0.010
 2022-10-26 09:47:08 - r - INFO: - Episode: 54/100, Reward: -220.00: Epislon: 0.010
 2022-10-26 09:47:08 - r - INFO: - Episode: 55/100, Reward: -157.00: Epislon: 0.010
 2022-10-26 09:47:09 - r - INFO: - Episode: 56/100, Reward: -130.00: Epislon: 0.010
 2022-10-26 09:47:09 - r - INFO: - Episode: 57/100, Reward: -150.00: Epislon: 0.010
 2022-10-26 09:47:09 - r - INFO: - Episode: 58/100, Reward: -254.00: Epislon: 0.010
 2022-10-26 09:47:10 - r - INFO: - Episode: 59/100, Reward: -148.00: Epislon: 0.010
 2022-10-26 09:47:10 - r - INFO: - Episode: 60/100, Reward: -108.00: Epislon: 0.010
 2022-10-26 09:47:10 - r - INFO: - Episode: 61/100, Reward: -152.00: Epislon: 0.010
 2022-10-26 09:47:10 - r - INFO: - Episode: 62/100, Reward: -107.00: Epislon: 0.010
 2022-10-26 09:47:10 - r - INFO: - Episode: 63/100, Reward: -110.00: Epislon: 0.010
 2022-10-26 09:47:11 - r - INFO: - Episode: 64/100, Reward: -266.00: Epislon: 0.010
 2022-10-26 09:47:11 - r - INFO: - Episode: 65/100, Reward: -344.00: Epislon: 0.010
 2022-10-26 09:47:12 - r - INFO: - Episode: 66/100, Reward: -93.00: Epislon: 0.010
 2022-10-26 09:47:12 - r - INFO: - Episode: 67/100, Reward: -113.00: Epislon: 0.010
 2022-10-26 09:47:12 - r - INFO: - Episode: 68/100, Reward: -191.00: Epislon: 0.010
 2022-10-26 09:47:12 - r - INFO: - Episode: 69/100, Reward: -102.00: Epislon: 0.010
 2022-10-26 09:47:13 - r - INFO: - Episode: 70/100, Reward: -187.00: Epislon: 0.010
 2022-10-26 09:47:13 - r - INFO: - Episode: 71/100, Reward: -158.00: Epislon: 0.010
 2022-10-26 09:47:13 - r - INFO: - Episode: 72/100, Reward: -166.00: Epislon: 0.010
 2022-10-26 09:47:14 - r - INFO: - Episode: 73/100, Reward: -202.00: Epislon: 0.010
 2022-10-26 09:47:14 - r - INFO: - Episode: 74/100, Reward: -179.00: Epislon: 0.010
 2022-10-26 09:47:14 - r - INFO: - Episode: 75/100, Reward: -150.00: Epislon: 0.010
 2022-10-26 09:47:14 - r - INFO: - Episode: 76/100, Reward: -170.00: Epislon: 0.010
 2022-10-26 09:47:15 - r - INFO: - Episode: 77/100, Reward: -149.00: Epislon: 0.010
 2022-10-26 09:47:15 - r - INFO: - Episode: 78/100, Reward: -119.00: Epislon: 0.010
 2022-10-26 09:47:15 - r - INFO: - Episode: 79/100, Reward: -115.00: Epislon: 0.010
 2022-10-26 09:47:15 - r - INFO: - Episode: 80/100, Reward: -97.00: Epislon: 0.010
 2022-10-26 09:47:16 - r - INFO: - Episode: 81/100, Reward: -153.00: Epislon: 0.010
 2022-10-26 09:47:16 - r - INFO: - Episode: 82/100, Reward: -97.00: Epislon: 0.010
 2022-10-26 09:47:16 - r - INFO: - Episode: 83/100, Reward: -211.00: Epislon: 0.010
 2022-10-26 09:47:16 - r - INFO: - Episode: 84/100, Reward: -195.00: Epislon: 0.010
 2022-10-26 09:47:17 - r - INFO: - Episode: 85/100, Reward: -125.00: Epislon: 0.010
 2022-10-26 09:47:17 - r - INFO: - Episode: 86/100, Reward: -155.00: Epislon: 0.010
 2022-10-26 09:47:17 - r - INFO: - Episode: 87/100, Reward: -151.00: Epislon: 0.010
 2022-10-26 09:47:18 - r - INFO: - Episode: 88/100, Reward: -194.00: Epislon: 0.010
 2022-10-26 09:47:18 - r - INFO: - Episode: 89/100, Reward: -188.00: Epislon: 0.010
 2022-10-26 09:47:18 - r - INFO: - Episode: 90/100, Reward: -195.00: Epislon: 0.010
 2022-10-26 09:47:19 - r - INFO: - Episode: 91/100, Reward: -141.00: Epislon: 0.010
 2022-10-26 09:47:19 - r - INFO: - Episode: 92/100, Reward: -132.00: Epislon: 0.010
 2022-10-26 09:47:19 - r - INFO: - Episode: 93/100, Reward: -127.00: Epislon: 0.010
 2022-10-26 09:47:19 - r - INFO: - Episode: 94/100, Reward: -195.00: Epislon: 0.010
 2022-10-26 09:47:20 - r - INFO: - Episode: 95/100, Reward: -152.00: Epislon: 0.010
 2022-10-26 09:47:20 - r - INFO: - Episode: 96/100, Reward: -145.00: Epislon: 0.010
 2022-10-26 09:47:20 - r - INFO: - Episode: 97/100, Reward: -123.00: Epislon: 0.010
 2022-10-26 09:47:20 - r - INFO: - Episode: 98/100, Reward: -176.00: Epislon: 0.010
 2022-10-26 09:47:21 - r - INFO: - Episode: 99/100, Reward: -180.00: Epislon: 0.010
 2022-10-26 09:47:21 - r - INFO: - Episode: 100/100, Reward: -124.00: Epislon: 0.010
 2022-10-26 09:47:21 - r - INFO: - Finish training!
@@ -0,0 +1,101 @@
 episodes,rewards,steps
 0,-861.0,862
 1,-252.0,253
 2,-196.0,197
 3,-390.0,391
 4,-371.0,372
 5,-237.0,238
 6,-227.0,228
 7,-228.0,229
 8,-305.0,306
 9,-234.0,235
 10,-204.0,205
 11,-277.0,278
 12,-148.0,149
 13,-372.0,373
 14,-273.0,274
 15,-105.0,106
 16,-79.0,80
 17,-112.0,113
 18,-276.0,277
 19,-148.0,149
 20,-201.0,202
 21,-173.0,174
 22,-226.0,227
 23,-154.0,155
 24,-269.0,270
 25,-191.0,192
 26,-177.0,178
 27,-209.0,210
 28,-116.0,117
 29,-117.0,118
 30,-121.0,122
 31,-208.0,209
 32,-147.0,148
 33,-104.0,105
 34,-161.0,162
 35,-144.0,145
 36,-131.0,132
 37,-226.0,227
 38,-117.0,118
 39,-344.0,345
 40,-123.0,124
 41,-232.0,233
 42,-190.0,191
 43,-176.0,177
 44,-139.0,140
 45,-410.0,411
 46,-115.0,116
 47,-118.0,119
 48,-113.0,114
 49,-355.0,356
 50,-110.0,111
 51,-148.0,149
 52,-135.0,136
 53,-220.0,221
 54,-157.0,158
 55,-130.0,131
 56,-150.0,151
 57,-254.0,255
 58,-148.0,149
 59,-108.0,109
 60,-152.0,153
 61,-107.0,108
 62,-110.0,111
 63,-266.0,267
 64,-344.0,345
 65,-93.0,94
 66,-113.0,114
 67,-191.0,192
 68,-102.0,103
 69,-187.0,188
 70,-158.0,159
 71,-166.0,167
 72,-202.0,203
 73,-179.0,180
 74,-150.0,151
 75,-170.0,171
 76,-149.0,150
 77,-119.0,120
 78,-115.0,116
 79,-97.0,98
 80,-153.0,154
 81,-97.0,98
 82,-211.0,212
 83,-195.0,196
 84,-125.0,126
 85,-155.0,156
 86,-151.0,152
 87,-194.0,195
 88,-188.0,189
 89,-195.0,196
 90,-141.0,142
 91,-132.0,133
 92,-127.0,128
 93,-195.0,196
 94,-152.0,153
 95,-145.0,146
 96,-123.0,124
 97,-176.0,177
 98,-180.0,181
 99,-124.0,125
@@ -0,0 +1,25 @@
 general_cfg:
  algo_name: DQN
  device: cuda
  env_name: CartPole-v1
  eval_eps: 10
  eval_per_episode: 5
  load_checkpoint: false
  load_path: tasks
  max_steps: 200
  mode: train
  save_fig: true
  seed: 1
  show_fig: false
  test_eps: 10
  train_eps: 100
 algo_cfg:
  batch_size: 64
  buffer_size: 100000
  epsilon_decay: 500
  epsilon_end: 0.01
  epsilon_start: 0.95
  gamma: 0.95
  hidden_dim: 256
  lr: 0.0001
  target_update: 800
@@ -0,0 +1,116 @@
 2022-10-31 00:12:01 - r - INFO: - n_states: 4, n_actions: 2
 2022-10-31 00:12:01 - r - INFO: - Start training!
 2022-10-31 00:12:01 - r - INFO: - Env: CartPole-v1, Algorithm: DQN, Device: cuda
 2022-10-31 00:12:04 - r - INFO: - Episode: 1/100, Reward: 18.0, Step: 18
 2022-10-31 00:12:04 - r - INFO: - Episode: 2/100, Reward: 35.0, Step: 35
 2022-10-31 00:12:04 - r - INFO: - Episode: 3/100, Reward: 13.0, Step: 13
 2022-10-31 00:12:04 - r - INFO: - Episode: 4/100, Reward: 32.0, Step: 32
 2022-10-31 00:12:04 - r - INFO: - Episode: 5/100, Reward: 16.0, Step: 16
 2022-10-31 00:12:04 - r - INFO: - Current episode 5 has the best eval reward: 15.30
 2022-10-31 00:12:04 - r - INFO: - Episode: 6/100, Reward: 12.0, Step: 12
 2022-10-31 00:12:04 - r - INFO: - Episode: 7/100, Reward: 13.0, Step: 13
 2022-10-31 00:12:04 - r - INFO: - Episode: 8/100, Reward: 15.0, Step: 15
 2022-10-31 00:12:04 - r - INFO: - Episode: 9/100, Reward: 11.0, Step: 11
 2022-10-31 00:12:04 - r - INFO: - Episode: 10/100, Reward: 15.0, Step: 15
 2022-10-31 00:12:04 - r - INFO: - Episode: 11/100, Reward: 9.0, Step: 9
 2022-10-31 00:12:04 - r - INFO: - Episode: 12/100, Reward: 13.0, Step: 13
 2022-10-31 00:12:04 - r - INFO: - Episode: 13/100, Reward: 13.0, Step: 13
 2022-10-31 00:12:04 - r - INFO: - Episode: 14/100, Reward: 10.0, Step: 10
 2022-10-31 00:12:04 - r - INFO: - Episode: 15/100, Reward: 9.0, Step: 9
 2022-10-31 00:12:04 - r - INFO: - Episode: 16/100, Reward: 24.0, Step: 24
 2022-10-31 00:12:04 - r - INFO: - Episode: 17/100, Reward: 8.0, Step: 8
 2022-10-31 00:12:04 - r - INFO: - Episode: 18/100, Reward: 10.0, Step: 10
 2022-10-31 00:12:04 - r - INFO: - Episode: 19/100, Reward: 11.0, Step: 11
 2022-10-31 00:12:04 - r - INFO: - Episode: 20/100, Reward: 13.0, Step: 13
 2022-10-31 00:12:04 - r - INFO: - Episode: 21/100, Reward: 12.0, Step: 12
 2022-10-31 00:12:04 - r - INFO: - Episode: 22/100, Reward: 11.0, Step: 11
 2022-10-31 00:12:04 - r - INFO: - Episode: 23/100, Reward: 9.0, Step: 9
 2022-10-31 00:12:04 - r - INFO: - Episode: 24/100, Reward: 21.0, Step: 21
 2022-10-31 00:12:05 - r - INFO: - Episode: 25/100, Reward: 14.0, Step: 14
 2022-10-31 00:12:05 - r - INFO: - Episode: 26/100, Reward: 12.0, Step: 12
 2022-10-31 00:12:05 - r - INFO: - Episode: 27/100, Reward: 9.0, Step: 9
 2022-10-31 00:12:05 - r - INFO: - Episode: 28/100, Reward: 11.0, Step: 11
 2022-10-31 00:12:05 - r - INFO: - Episode: 29/100, Reward: 12.0, Step: 12
 2022-10-31 00:12:05 - r - INFO: - Episode: 30/100, Reward: 13.0, Step: 13
 2022-10-31 00:12:05 - r - INFO: - Episode: 31/100, Reward: 10.0, Step: 10
 2022-10-31 00:12:05 - r - INFO: - Episode: 32/100, Reward: 13.0, Step: 13
 2022-10-31 00:12:05 - r - INFO: - Episode: 33/100, Reward: 18.0, Step: 18
 2022-10-31 00:12:05 - r - INFO: - Episode: 34/100, Reward: 9.0, Step: 9
 2022-10-31 00:12:05 - r - INFO: - Episode: 35/100, Reward: 10.0, Step: 10
 2022-10-31 00:12:05 - r - INFO: - Episode: 36/100, Reward: 9.0, Step: 9
 2022-10-31 00:12:05 - r - INFO: - Episode: 37/100, Reward: 10.0, Step: 10
 2022-10-31 00:12:05 - r - INFO: - Episode: 38/100, Reward: 10.0, Step: 10
 2022-10-31 00:12:05 - r - INFO: - Episode: 39/100, Reward: 10.0, Step: 10
 2022-10-31 00:12:05 - r - INFO: - Episode: 40/100, Reward: 8.0, Step: 8
 2022-10-31 00:12:06 - r - INFO: - Episode: 41/100, Reward: 9.0, Step: 9
 2022-10-31 00:12:06 - r - INFO: - Episode: 42/100, Reward: 9.0, Step: 9
 2022-10-31 00:12:06 - r - INFO: - Episode: 43/100, Reward: 20.0, Step: 20
 2022-10-31 00:12:06 - r - INFO: - Episode: 44/100, Reward: 16.0, Step: 16
 2022-10-31 00:12:06 - r - INFO: - Episode: 45/100, Reward: 17.0, Step: 17
 2022-10-31 00:12:06 - r - INFO: - Current episode 45 has the best eval reward: 17.50
 2022-10-31 00:12:06 - r - INFO: - Episode: 46/100, Reward: 17.0, Step: 17
 2022-10-31 00:12:06 - r - INFO: - Episode: 47/100, Reward: 17.0, Step: 17
 2022-10-31 00:12:06 - r - INFO: - Episode: 48/100, Reward: 18.0, Step: 18
 2022-10-31 00:12:06 - r - INFO: - Episode: 49/100, Reward: 25.0, Step: 25
 2022-10-31 00:12:06 - r - INFO: - Episode: 50/100, Reward: 31.0, Step: 31
 2022-10-31 00:12:06 - r - INFO: - Current episode 50 has the best eval reward: 24.80
 2022-10-31 00:12:06 - r - INFO: - Episode: 51/100, Reward: 22.0, Step: 22
 2022-10-31 00:12:06 - r - INFO: - Episode: 52/100, Reward: 39.0, Step: 39
 2022-10-31 00:12:06 - r - INFO: - Episode: 53/100, Reward: 36.0, Step: 36
 2022-10-31 00:12:06 - r - INFO: - Episode: 54/100, Reward: 26.0, Step: 26
 2022-10-31 00:12:07 - r - INFO: - Episode: 55/100, Reward: 33.0, Step: 33
 2022-10-31 00:12:07 - r - INFO: - Current episode 55 has the best eval reward: 38.70
 2022-10-31 00:12:07 - r - INFO: - Episode: 56/100, Reward: 56.0, Step: 56
 2022-10-31 00:12:07 - r - INFO: - Episode: 57/100, Reward: 112.0, Step: 112
 2022-10-31 00:12:07 - r - INFO: - Episode: 58/100, Reward: 101.0, Step: 101
 2022-10-31 00:12:08 - r - INFO: - Episode: 59/100, Reward: 69.0, Step: 69
 2022-10-31 00:12:08 - r - INFO: - Episode: 60/100, Reward: 75.0, Step: 75
 2022-10-31 00:12:08 - r - INFO: - Episode: 61/100, Reward: 182.0, Step: 182
 2022-10-31 00:12:09 - r - INFO: - Episode: 62/100, Reward: 52.0, Step: 52
 2022-10-31 00:12:09 - r - INFO: - Episode: 63/100, Reward: 67.0, Step: 67
 2022-10-31 00:12:09 - r - INFO: - Episode: 64/100, Reward: 53.0, Step: 53
 2022-10-31 00:12:09 - r - INFO: - Episode: 65/100, Reward: 119.0, Step: 119
 2022-10-31 00:12:10 - r - INFO: - Current episode 65 has the best eval reward: 171.90
 2022-10-31 00:12:10 - r - INFO: - Episode: 66/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:10 - r - INFO: - Episode: 67/100, Reward: 74.0, Step: 74
 2022-10-31 00:12:11 - r - INFO: - Episode: 68/100, Reward: 138.0, Step: 138
 2022-10-31 00:12:11 - r - INFO: - Episode: 69/100, Reward: 149.0, Step: 149
 2022-10-31 00:12:12 - r - INFO: - Episode: 70/100, Reward: 144.0, Step: 144
 2022-10-31 00:12:12 - r - INFO: - Current episode 70 has the best eval reward: 173.70
 2022-10-31 00:12:13 - r - INFO: - Episode: 71/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:13 - r - INFO: - Episode: 72/100, Reward: 198.0, Step: 198
 2022-10-31 00:12:14 - r - INFO: - Episode: 73/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:14 - r - INFO: - Episode: 74/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:15 - r - INFO: - Episode: 75/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:16 - r - INFO: - Current episode 75 has the best eval reward: 200.00
 2022-10-31 00:12:16 - r - INFO: - Episode: 76/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:17 - r - INFO: - Episode: 77/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:17 - r - INFO: - Episode: 78/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:18 - r - INFO: - Episode: 79/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:19 - r - INFO: - Episode: 80/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:19 - r - INFO: - Current episode 80 has the best eval reward: 200.00
 2022-10-31 00:12:20 - r - INFO: - Episode: 81/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:20 - r - INFO: - Episode: 82/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:21 - r - INFO: - Episode: 83/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:21 - r - INFO: - Episode: 84/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:22 - r - INFO: - Episode: 85/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:23 - r - INFO: - Current episode 85 has the best eval reward: 200.00
 2022-10-31 00:12:23 - r - INFO: - Episode: 86/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:24 - r - INFO: - Episode: 87/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:25 - r - INFO: - Episode: 88/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:25 - r - INFO: - Episode: 89/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:26 - r - INFO: - Episode: 90/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:27 - r - INFO: - Current episode 90 has the best eval reward: 200.00
 2022-10-31 00:12:27 - r - INFO: - Episode: 91/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:28 - r - INFO: - Episode: 92/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:28 - r - INFO: - Episode: 93/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:29 - r - INFO: - Episode: 94/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:29 - r - INFO: - Episode: 95/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:30 - r - INFO: - Current episode 95 has the best eval reward: 200.00
 2022-10-31 00:12:31 - r - INFO: - Episode: 96/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:31 - r - INFO: - Episode: 97/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:32 - r - INFO: - Episode: 98/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:32 - r - INFO: - Episode: 99/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:33 - r - INFO: - Episode: 100/100, Reward: 200.0, Step: 200
 2022-10-31 00:12:33 - r - INFO: - Current episode 100 has the best eval reward: 200.00
 2022-10-31 00:12:33 - r - INFO: - Finish training!
@@ -0,0 +1,101 @@
 episodes,rewards,steps
 0,18.0,18
 1,35.0,35
 2,13.0,13
 3,32.0,32
 4,16.0,16
 5,12.0,12
 6,13.0,13
 7,15.0,15
 8,11.0,11
 9,15.0,15
 10,9.0,9
 11,13.0,13
 12,13.0,13
 13,10.0,10
 14,9.0,9
 15,24.0,24
 16,8.0,8
 17,10.0,10
 18,11.0,11
 19,13.0,13
 20,12.0,12
 21,11.0,11
 22,9.0,9
 23,21.0,21
 24,14.0,14
 25,12.0,12
 26,9.0,9
 27,11.0,11
 28,12.0,12
 29,13.0,13
 30,10.0,10
 31,13.0,13
 32,18.0,18
 33,9.0,9
 34,10.0,10
 35,9.0,9
 36,10.0,10
 37,10.0,10
 38,10.0,10
 39,8.0,8
 40,9.0,9
 41,9.0,9
 42,20.0,20
 43,16.0,16
 44,17.0,17
 45,17.0,17
 46,17.0,17
 47,18.0,18
 48,25.0,25
 49,31.0,31
 50,22.0,22
 51,39.0,39
 52,36.0,36
 53,26.0,26
 54,33.0,33
 55,56.0,56
 56,112.0,112
 57,101.0,101
 58,69.0,69
 59,75.0,75
 60,182.0,182
 61,52.0,52
 62,67.0,67
 63,53.0,53
 64,119.0,119
 65,200.0,200
 66,74.0,74
 67,138.0,138
 68,149.0,149
 69,144.0,144
 70,200.0,200
 71,198.0,198
 72,200.0,200
 73,200.0,200
 74,200.0,200
 75,200.0,200
 76,200.0,200
 77,200.0,200
 78,200.0,200
 79,200.0,200
 80,200.0,200
 81,200.0,200
 82,200.0,200
 83,200.0,200
 84,200.0,200
 85,200.0,200
 86,200.0,200
 87,200.0,200
 88,200.0,200
 89,200.0,200
 90,200.0,200
 91,200.0,200
 92,200.0,200
 93,200.0,200
 94,200.0,200
 95,200.0,200
 96,200.0,200
 97,200.0,200
 98,200.0,200
 99,200.0,200
@@ -0,0 +1,22 @@
 general_cfg:
  algo_name: DQN
  device: cuda
  env_name: Acrobot-v1
  mode: test
  load_checkpoint: true
  load_path: Train_Acrobot-v1_DQN_20221026-094645
  max_steps: 100000
  save_fig: true
  seed: 1
  show_fig: false
  test_eps: 10
  train_eps: 100
 algo_cfg:
  batch_size: 128
  buffer_size: 200000
  epsilon_decay: 500
  epsilon_end: 0.01
  epsilon_start: 0.95
  gamma: 0.95
  lr: 0.002 
  target_update: 4
@@ -0,0 +1,22 @@
 general_cfg:
  algo_name: DQN
  device: cuda
  env_name: Acrobot-v1
  mode: train
  load_checkpoint: false
  load_path: Train_CartPole-v1_DQN_20221026-054757
  max_steps: 100000
  save_fig: true
  seed: 1
  show_fig: false
  test_eps: 10
  train_eps: 100
 algo_cfg:
  batch_size: 128
  buffer_size: 200000
  epsilon_decay: 500
  epsilon_end: 0.01
  epsilon_start: 0.95
  gamma: 0.95
  lr: 0.002 
  target_update: 4
@@ -0,0 +1,22 @@
 general_cfg:
  algo_name: DQN
  device: cuda
  env_name: CartPole-v1
  mode: test
  load_checkpoint: true
  load_path: Train_CartPole-v1_DQN_20221031-001201
  max_steps: 200
  save_fig: true
  seed: 0
  show_fig: false
  test_eps: 10
  train_eps: 100
 algo_cfg:
  batch_size: 64
  buffer_size: 100000
  epsilon_decay: 500
  epsilon_end: 0.01
  epsilon_start: 0.95
  gamma: 0.95
  lr: 0.0001
  target_update: 4
@@ -0,0 +1,22 @@
 general_cfg:
  algo_name: DQN
  device: cuda
  env_name: CartPole-v1
  mode: train
  load_checkpoint: false
  load_path: Train_CartPole-v1_DQN_20221026-054757
  max_steps: 200
  save_fig: true
  seed: 0
  show_fig: false
  test_eps: 10
  train_eps: 200
 algo_cfg:
  batch_size: 64
  buffer_size: 100000
  epsilon_decay: 500
  epsilon_end: 0.01
  epsilon_start: 0.95
  gamma: 0.95
  lr: 0.0001
  target_update: 4
@@ -0,0 +1,38 @@
 #!/usr/bin/env python
 # coding=utf-8
 '''
 Author: JiangJi
 Email: johnjim0816@gmail.com
 Date: 2022-10-30 00:37:33
 LastEditor: JiangJi
 LastEditTime: 2022-10-31 00:11:57
 Discription: default parameters of DQN
 '''
 from common.config import GeneralConfig,AlgoConfig
 class GeneralConfigDQN(GeneralConfig):
    def __init__(self) -> None:
        self.env_name = "CartPole-v1" # name of environment
        self.algo_name = "DQN" # name of algorithm
        self.mode = "train" # train or test
        self.seed = 1 # random seed
        self.device = "cuda" # device to use
        self.train_eps = 100 # number of episodes for training
        self.test_eps = 10 # number of episodes for testing
        self.max_steps = 200 # max steps for each episode
        self.load_checkpoint = False
        self.load_path = "tasks" # path to load model
        self.show_fig = False # show figure or not
        self.save_fig = True # save figure or not
 class AlgoConfigDQN(AlgoConfig):
    def __init__(self) -> None:
        # set epsilon_start=epsilon_end can obtain fixed epsilon=epsilon_end
        self.epsilon_start = 0.95 # epsilon start value
        self.epsilon_end = 0.01 # epsilon end value
        self.epsilon_decay = 500 # epsilon decay rate
        self.hidden_dim = 256 # hidden_dim for MLP
        self.gamma = 0.95 # discount factor
        self.lr = 0.0001 # learning rate
        self.buffer_size = 100000 # size of replay buffer
        self.batch_size = 64 # batch size
        self.target_update = 800 # target network update frequency per steps
@@ -5,7 +5,7 @@
@Email: johnjim0816@gmail.com
@Date: 2020-06-12 00:50:49
@LastEditor: John
-LastEditTime: 2022-08-29 23:30:08
+LastEditTime: 2022-10-31 00:07:19
@Discription: 
@Environment: python 3.7.7
 '''
@@ -22,27 +22,28 @@ import numpy as np
 class DQN:
    def __init__(self,model,memory,cfg):
-        self.n_actions = cfg['n_actions']  
+        self.n_actions = cfg.n_actions  
-        self.device = torch.device(cfg['device']) 
+        self.device = torch.device(cfg.device) 
-        self.gamma = cfg['gamma']  
+        self.gamma = cfg.gamma  
        ## e-greedy parameters
        self.sample_count = 0  # sample count for epsilon decay
-        self.epsilon = cfg['epsilon_start']
+        self.epsilon = cfg.epsilon_start
        self.sample_count = 0  
-        self.epsilon_start = cfg['epsilon_start']
+        self.epsilon_start = cfg.epsilon_start
-        self.epsilon_end = cfg['epsilon_end']
+        self.epsilon_end = cfg.epsilon_end
-        self.epsilon_decay = cfg['epsilon_decay']
+        self.epsilon_decay = cfg.epsilon_decay
-        self.batch_size = cfg['batch_size']
+        self.batch_size = cfg.batch_size
        self.target_update = cfg.target_update
        self.policy_net = model.to(self.device)
        self.target_net = model.to(self.device)
        ## copy parameters from policy net to target net
        for target_param, param in zip(self.target_net.parameters(),self.policy_net.parameters()): 
            target_param.data.copy_(param.data)
        # self.target_net.load_state_dict(self.policy_net.state_dict()) # or use this to copy parameters
-        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg['lr']) 
+        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=cfg.lr) 
        self.memory = memory 
        self.update_flag = False 
-
+        
    def sample_action(self, state):
        ''' sample action with e-greedy policy
        '''
@@ -58,6 +59,21 @@ class DQN:
        else:
            action = random.randrange(self.n_actions)
        return action
    # @torch.no_grad()
    # def sample_action(self, state):
    #     ''' sample action with e-greedy policy
    #     '''
    #     self.sample_count += 1
    #     # epsilon must decay(linear,exponential and etc.) for balancing exploration and exploitation
    #     self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
    #         math.exp(-1. * self.sample_count / self.epsilon_decay) 
    #     if random.random() > self.epsilon:
    #         state = torch.tensor(state, device=self.device, dtype=torch.float32).unsqueeze(dim=0)
    #         q_values = self.policy_net(state)
    #         action = q_values.max(1)[1].item() # choose action corresponding to the maximum q value
    #     else:
    #         action = random.randrange(self.n_actions)
    #     return action
    def predict_action(self,state):
        ''' predict action
        '''
@@ -99,14 +115,16 @@ class DQN:
        for param in self.policy_net.parameters():  
            param.grad.data.clamp_(-1, 1)
        self.optimizer.step() 
        if self.sample_count % self.target_update == 0: # target net update, target_update means "C" in pseucodes
            self.target_net.load_state_dict(self.policy_net.state_dict())   
-    def save_model(self, path):
+    def save_model(self, fpath):
        from pathlib import Path
        # create path
-        Path(path).mkdir(parents=True, exist_ok=True)
+        Path(fpath).mkdir(parents=True, exist_ok=True)
-        torch.save(self.target_net.state_dict(), f"{path}/checkpoint.pt")
+        torch.save(self.target_net.state_dict(), f"{fpath}/checkpoint.pt")
-    def load_model(self, path):
+    def load_model(self, fpath):
-        self.target_net.load_state_dict(torch.load(f"{path}/checkpoint.pt"))
+        self.target_net.load_state_dict(torch.load(f"{fpath}/checkpoint.pt"))
        for target_param, param in zip(self.target_net.parameters(), self.policy_net.parameters()):
            param.data.copy_(target_param.data)
@@ -1 +0,0 @@
 {"algo_name": "DQN", "env_name": "Acrobot-v1", "train_eps": 100, "test_eps": 20, "gamma": 0.95, "epsilon_start": 0.95, "epsilon_end": 0.01, "epsilon_decay": 1500, "lr": 0.002, "memory_capacity": 200000, "batch_size": 128, "target_update": 4, "hidden_dim": 256, "device": "cuda", "seed": 10, "show_fig": false, "save_fig": true, "result_path": "C:\\Users\\jiangji\\Desktop\\rl-tutorials\\codes\\DQN/outputs/Acrobot-v1/20220824-124401/results", "model_path": "C:\\Users\\jiangji\\Desktop\\rl-tutorials\\codes\\DQN/outputs/Acrobot-v1/20220824-124401/models", "n_states": 6, "n_actions": 3}
@@ -1,21 +0,0 @@
 episodes,rewards
 0,-79.0
 1,-113.0
 2,-81.0
 3,-132.0
 4,-110.0
 5,-114.0
 6,-80.0
 7,-101.0
 8,-78.0
 9,-91.0
 10,-107.0
 11,-87.0
 12,-105.0
 13,-91.0
 14,-128.0
 15,-132.0
 16,-119.0
 17,-77.0
 18,-89.0
 19,-134.0
@@ -1,101 +0,0 @@
 episodes,rewards
 0,-500.0
 1,-500.0
 2,-500.0
 3,-370.0
 4,-449.0
 5,-500.0
 6,-312.0
 7,-374.0
 8,-180.0
 9,-154.0
 10,-137.0
 11,-185.0
 12,-135.0
 13,-302.0
 14,-146.0
 15,-137.0
 16,-119.0
 17,-149.0
 18,-217.0
 19,-191.0
 20,-157.0
 21,-166.0
 22,-138.0
 23,-135.0
 24,-182.0
 25,-130.0
 26,-175.0
 27,-222.0
 28,-133.0
 29,-108.0
 30,-250.0
 31,-119.0
 32,-135.0
 33,-148.0
 34,-194.0
 35,-194.0
 36,-186.0
 37,-131.0
 38,-185.0
 39,-79.0
 40,-129.0
 41,-271.0
 42,-117.0
 43,-159.0
 44,-156.0
 45,-117.0
 46,-158.0
 47,-153.0
 48,-119.0
 49,-164.0
 50,-134.0
 51,-231.0
 52,-117.0
 53,-119.0
 54,-136.0
 55,-173.0
 56,-202.0
 57,-133.0
 58,-142.0
 59,-169.0
 60,-137.0
 61,-123.0
 62,-205.0
 63,-107.0
 64,-194.0
 65,-150.0
 66,-143.0
 67,-218.0
 68,-145.0
 69,-90.0
 70,-107.0
 71,-169.0
 72,-125.0
 73,-142.0
 74,-145.0
 75,-94.0
 76,-150.0
 77,-134.0
 78,-159.0
 79,-137.0
 80,-146.0
 81,-191.0
 82,-242.0
 83,-117.0
 84,-92.0
 85,-193.0
 86,-239.0
 87,-173.0
 88,-140.0
 89,-157.0
 90,-133.0
 91,-148.0
 92,-87.0
 93,-398.0
 94,-98.0
 95,-121.0
 96,-102.0
 97,-120.0
 98,-195.0
 99,-219.0
@@ -1,21 +0,0 @@
 {
    "algo_name": "DQN",
    "env_name": "CartPole-v0",
    "train_eps": 200,
    "test_eps": 20,
    "gamma": 0.95,
    "epsilon_start": 0.95,
    "epsilon_end": 0.01,
    "epsilon_decay": 500,
    "lr": 0.0001,
    "memory_capacity": 100000,
    "batch_size": 64,
    "target_update": 4,
    "hidden_dim": 256,
    "device": "cpu",
    "seed": 10,
    "result_path": "C:\\Users\\jiangji\\Desktop\\rl-tutorials\\codes\\DQN/outputs/CartPole-v0/20220823-173936/results",
    "model_path": "C:\\Users\\jiangji\\Desktop\\rl-tutorials\\codes\\DQN/outputs/CartPole-v0/20220823-173936/models",
    "show_fig": false,
    "save_fig": true
 }
@@ -1,201 +0,0 @@
 episodes,rewards
 0,38.0
 1,16.0
 2,37.0
 3,15.0
 4,22.0
 5,34.0
 6,20.0
 7,12.0
 8,16.0
 9,14.0
 10,13.0
 11,21.0
 12,14.0
 13,12.0
 14,17.0
 15,12.0
 16,10.0
 17,14.0
 18,10.0
 19,10.0
 20,16.0
 21,9.0
 22,14.0
 23,13.0
 24,10.0
 25,9.0
 26,12.0
 27,12.0
 28,14.0
 29,11.0
 30,9.0
 31,8.0
 32,9.0
 33,11.0
 34,12.0
 35,10.0
 36,11.0
 37,10.0
 38,10.0
 39,18.0
 40,13.0
 41,15.0
 42,10.0
 43,9.0
 44,14.0
 45,14.0
 46,23.0
 47,17.0
 48,15.0
 49,15.0
 50,20.0
 51,28.0
 52,36.0
 53,36.0
 54,23.0
 55,27.0
 56,53.0
 57,19.0
 58,35.0
 59,62.0
 60,57.0
 61,38.0
 62,61.0
 63,65.0
 64,58.0
 65,43.0
 66,67.0
 67,56.0
 68,91.0
 69,128.0
 70,71.0
 71,126.0
 72,100.0
 73,200.0
 74,200.0
 75,200.0
 76,200.0
 77,200.0
 78,200.0
 79,200.0
 80,200.0
 81,200.0
 82,200.0
 83,200.0
 84,200.0
 85,200.0
 86,200.0
 87,200.0
 88,200.0
 89,200.0
 90,200.0
 91,200.0
 92,200.0
 93,200.0
 94,200.0
 95,200.0
 96,200.0
 97,200.0
 98,200.0
 99,200.0
 100,200.0
 101,200.0
 102,200.0
 103,200.0
 104,200.0
 105,200.0
 106,200.0
 107,200.0
 108,200.0
 109,200.0
 110,200.0
 111,200.0
 112,200.0
 113,200.0
 114,200.0
 115,200.0
 116,200.0
 117,200.0
 118,200.0
 119,200.0
 120,200.0
 121,200.0
 122,200.0
 123,200.0
 124,200.0
 125,200.0
 126,200.0
 127,200.0
 128,200.0
 129,200.0
 130,200.0
 131,200.0
 132,200.0
 133,200.0
 134,200.0
 135,200.0
 136,200.0
 137,200.0
 138,200.0
 139,200.0
 140,200.0
 141,200.0
 142,200.0
 143,200.0
 144,200.0
 145,200.0
 146,200.0
 147,200.0
 148,200.0
 149,200.0
 150,200.0
 151,200.0
 152,200.0
 153,200.0
 154,200.0
 155,200.0
 156,200.0
 157,200.0
 158,200.0
 159,200.0
 160,200.0
 161,200.0
 162,200.0
 163,200.0
 164,200.0
 165,200.0
 166,200.0
 167,200.0
 168,200.0
 169,200.0
 170,200.0
 171,200.0
 172,200.0
 173,200.0
 174,200.0
 175,200.0
 176,200.0
 177,200.0
 178,200.0
 179,200.0
 180,200.0
 181,200.0
 182,200.0
 183,200.0
 184,200.0
 185,200.0
 186,200.0
 187,200.0
 188,200.0
 189,200.0
 190,200.0
 191,200.0
 192,200.0
 193,200.0
 194,200.0
 195,200.0
 196,200.0
 197,200.0
 198,200.0
 199,200.0
--- a/Show More
+++ b/Show More
		`@@ -1 +0,0 @@`
			{"algo_name": "A2C", "env_name": "CartPole-v0", "train_eps": 1600, "test_eps": 20, "ep_max_steps": 100000, "gamma": 0.99, "actor_lr": 0.0003, "critic_lr": 0.001, "actor_hidden_dim": 256, "critic_hidden_dim": 256, "device": "cpu", "seed": 10, "show_fig": false, "save_fig": true, "result_path": "/Users/jj/Desktop/rl-tutorials/codes/A2C/outputs/CartPole-v0/20220829-143327/results/", "model_path": "/Users/jj/Desktop/rl-tutorials/codes/A2C/outputs/CartPole-v0/20220829-143327/models/", "n_states": 4, "n_actions": 2}
		`@@ -1 +0,0 @@`
			{"algo_name": "DQN", "env_name": "Acrobot-v1", "train_eps": 100, "test_eps": 20, "gamma": 0.95, "epsilon_start": 0.95, "epsilon_end": 0.01, "epsilon_decay": 1500, "lr": 0.002, "memory_capacity": 200000, "batch_size": 128, "target_update": 4, "hidden_dim": 256, "device": "cuda", "seed": 10, "show_fig": false, "save_fig": true, "result_path": "C:\\Users\\jiangji\\Desktop\\rl-tutorials\\codes\\DQN/outputs/Acrobot-v1/20220824-124401/results", "model_path": "C:\\Users\\jiangji\\Desktop\\rl-tutorials\\codes\\DQN/outputs/Acrobot-v1/20220824-124401/models", "n_states": 6, "n_actions": 3}