add MonteCarlo

2021-03-12 16:21:09 +08:00
parent 47390be0cf
commit f4a37178d2
10 changed files with 196 additions and 0 deletions
--- a/codes/MonteCarlo/Figure_1.png
+++ b/codes/MonteCarlo/Figure_1.png
--- a/codes/MonteCarlo/README.md
+++ b/codes/MonteCarlo/README.md
@@ -0,0 +1,44 @@
+# *On-Policy First-Visit MC Control*
+
+## The Racetrack Environment
+We have implemented a custom environment called "Racetrack" for you to use during this piece of coursework. It is inspired by the environment described in the course textbook (Reinforcement Learning, Sutton & Barto, 2018, Exercise 5.12), but is not exactly the same.
+
+### Environment Description
+Consider driving a race car around a turn on a racetrack. In order to complete the race as quickly as possible, you would want to drive as fast as you can but, to avoid running off the track, you must slow down while turning.
+
+In our simplified racetrack environment, the agent is at one of a discrete set of grid positions. The agent also has a discrete speed in two directions, $x$ and $y$. So the state is represented as follows:
+$$(\text{position}_y, \text{position}_x, \text{velocity}_y, \text{velocity}_x)$$
+
+The agent collects a reward of -1 at each time step, an additional -10 for leaving the track (i.e., ending up on a black grid square in the figure below), and an additional +10 for reaching the finish line (any of the red grid squares). The agent starts each episode in a randomly selected  grid-square on the starting line (green grid squares) with a speed of zero in both directions. At each time step, the agent can change its speed in both directions. Each speed can be changed by +1, -1 or 0, giving a total of nine actions. For example, the agent may increase its speed in the $x$ direction by -1 and its speed in the $y$ direction by +1. The agent's speed cannot be greater than +10 or less than -10 in either direction.
+
+![track_big](assets/track_big.png)
+
+
+The agent's next state is determined by its current grid square, its current speed in two directions, and the changes it  makes to its speed in the two directions. This environment is stochastic. When the agent tries to change its speed, no change occurs (in either direction) with probability 0.2. In other words, 20% of the time, the agent's action is ignored and the car's speed remains the same in both directions.
+
+If the agent leaves the track, it is returned to a random start grid-square and has its speed set to zero in both directions; the episode continues. An episode ends only when the agent transitions to a goal grid-square.
+
+
+
+### Environment Implementation
+We have implemented the above environment in the `racetrack_env.py` file, for you to use in this coursework. Please use this implementation instead of writing your own, and please do not modify the environment.
+
+We provide a `RacetrackEnv` class for your agents to interact with. The class has the following methods:
+- **`reset()`** - this method initialises the environment, chooses a random starting state, and returns it. This method should be called before the start of every episode.
+- **`step(action)`** - this method takes an integer action (more on this later), and executes one time-step in the environment. It returns a tuple containing the next state, the reward collected, and whether the next state is a terminal state.
+- **`render(sleep_time)`** - this method renders a matplotlib graph representing the environment. It takes an optional float parameter giving the number of seconds to display each time-step. This method is useful for testing and debugging, but should not be used during training since it is *very* slow. **Do not use this method in your final submission**.
+- **`get_actions()`** - a simple method that returns the available actions in the current state. Always returns a list containing integers in the range [0-8] (more on this later).
+
+In our code, states are represented as Python tuples - specifically a tuple of four integers. For example, if the agent is in a grid square with coordinates ($Y = 2$, $X = 3$), and is moving zero cells vertically and one cell horizontally per time-step, the state is represented as `(2, 3, 0, 1)`. Tuples of this kind will be returned by the `reset()` and `step(action)` methods.
+
+There are nine actions available to the agent in each state, as described above. However, to simplify your code, we have represented each of the nine actions as an integer in the range [0-8]. The table below shows the index of each action, along with the corresponding changes it will cause to the agent's speed in each direction.
+
+<img src="assets/action_grid.png" alt="action_grid" style="zoom:50%;" />
+
+For example, taking action 8 will increase the agent's speed in the $x$ direction, but decrease its speed in the $y$ direction.
+
+## First-Visit MC 介绍
+
+伪代码：
+
+![mc_control_algo](assets/mc_control_algo.png)
--- a/codes/MonteCarlo/agent.py
+++ b/codes/MonteCarlo/agent.py
@@ -0,0 +1,64 @@
+#!/usr/bin/env python
+# coding=utf-8
+'''
+Author: John
+Email: johnjim0816@gmail.com
+Date: 2021-03-12 16:14:34
+LastEditor: John
+LastEditTime: 2021-03-12 16:15:12
+Discription: 
+Environment: 
+'''
+import numpy as np
+from collections import defaultdict
+import torch
+
+class FisrtVisitMC:
+    ''' On-Policy First-Visit MC Control
+    '''
+    def __init__(self,n_actions,cfg):
+        self.n_actions = n_actions
+        self.epsilon = cfg.epsilon
+        self.gamma = cfg.gamma 
+        self.Q = defaultdict(lambda: np.zeros(n_actions))
+        self.returns_sum = defaultdict(float) # sum of returns
+        self.returns_count = defaultdict(float)
+        
+    def choose_action(self,state):
+        ''' e-greed policy '''
+        best_action = np.argmax(self.Q[state])
+        # action = best_action
+        action_probs = np.ones(self.n_actions, dtype=float) * self.epsilon / self.n_actions
+        action_probs[best_action] += (1.0 - self.epsilon)
+        action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
+        return action
+    def update(self,one_ep_transition):
+        # Find all (state, action) pairs we've visited in this one_ep_transition
+        # We convert each state to a tuple so that we can use it as a dict key
+        sa_in_episode = set([(tuple(x[0]), x[1]) for x in one_ep_transition])
+        for state, action in sa_in_episode:
+            sa_pair = (state, action)
+            # Find the first occurence of the (state, action) pair in the one_ep_transition
+            first_occurence_idx = next(i for i,x in enumerate(one_ep_transition)
+                                       if x[0] == state and x[1] == action)
+            # Sum up all rewards since the first occurance
+            G = sum([x[2]*(self.gamma**i) for i,x in enumerate(one_ep_transition[first_occurence_idx:])])
+            # Calculate average return for this state over all sampled episodes
+            self.returns_sum[sa_pair] += G
+            self.returns_count[sa_pair] += 1.0
+            self.Q[state][action] = self.returns_sum[sa_pair] / self.returns_count[sa_pair]
+    def save(self,path):
+        '''把 Q表格 的数据保存到文件中
+        '''
+        import dill
+        torch.save(
+            obj=self.Q,
+            f=path,
+            pickle_module=dill
+        )
+
+    def load(self, path):
+        '''从文件中读取数据到 Q表格
+        '''
+        import dill
+        self.Q =torch.load(f=path,pickle_module=dill)
--- a/codes/MonteCarlo/assets/action_grid.png
+++ b/codes/MonteCarlo/assets/action_grid.png
--- a/codes/MonteCarlo/assets/mc_control_algo.png
+++ b/codes/MonteCarlo/assets/mc_control_algo.png
--- a/codes/MonteCarlo/assets/track_big.png
+++ b/codes/MonteCarlo/assets/track_big.png
--- a/codes/MonteCarlo/main.py
+++ b/codes/MonteCarlo/main.py
@@ -0,0 +1,88 @@
+#!/usr/bin/env python
+# coding=utf-8
+'''
+Author: John
+Email: johnjim0816@gmail.com
+Date: 2021-03-11 14:26:44
+LastEditor: John
+LastEditTime: 2021-03-12 16:15:46
+Discription: 
+Environment: 
+'''
+import sys,os
+sys.path.append(os.getcwd())
+import argparse
+import datetime
+
+from envs.racetrack_env import RacetrackEnv
+from MonteCarlo.agent import FisrtVisitMC
+from common.plot import plot_rewards
+from common.utils import save_results
+
+SEQUENCE = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") # 获取当前时间
+SAVED_MODEL_PATH = os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"+SEQUENCE+'/' # 生成保存的模型路径
+if not os.path.exists(os.path.split(os.path.abspath(__file__))[0]+"/saved_model/"): # 检测是否存在文件夹
+    os.mkdir(os.path.split(os.path.abspath(__file__))[0]+"/saved_model/")
+if not os.path.exists(SAVED_MODEL_PATH): # 检测是否存在文件夹
+    os.mkdir(SAVED_MODEL_PATH)
+RESULT_PATH = os.path.split(os.path.abspath(__file__))[0]+"/results/"+SEQUENCE+'/' # 存储reward的路径
+if not os.path.exists(os.path.split(os.path.abspath(__file__))[0]+"/results/"): # 检测是否存在文件夹
+    os.mkdir(os.path.split(os.path.abspath(__file__))[0]+"/results/")
+if not os.path.exists(RESULT_PATH): # 检测是否存在文件夹
+    os.mkdir(RESULT_PATH)
+
+class MCConfig:
+    def __init__(self): 
+        self.epsilon = 0.15 # epsilon: The probability to select a random action . 
+        self.gamma = 0.9 # gamma: Gamma discount factor.
+        self.n_episodes = 300
+        self.n_steps = 2000
+
+def get_mc_args():
+    '''set parameters
+    '''
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--epsilon", default=0.15, type=float)  # epsilon: The probability to select a random action . float between 0 and 1.
+    parser.add_argument("--gamma", default=0.9, type=float)  # gamma: Gamma discount factor.
+    parser.add_argument("--n_episodes", default=150, type=int)
+    parser.add_argument("--n_steps", default=2000, type=int)
+    mc_cfg = parser.parse_args()
+    return mc_cfg
+
+
+
+def mc_train(cfg,env,agent):
+    rewards = []
+    ma_rewards = [] # moving average rewards
+    for i_episode in range(cfg.n_episodes):
+        one_ep_transition = []
+        state = env.reset()
+        ep_reward = 0
+        # while True:
+        for t in range(cfg.n_steps):
+            action = agent.choose_action(state)
+            next_state, reward, done = env.step(action)
+            ep_reward+=reward
+            one_ep_transition.append((state, action, reward))
+            state = next_state
+            if done:
+                break
+        rewards.append(ep_reward)
+        if ma_rewards:
+            ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)
+        else:
+            ma_rewards.append(ep_reward)
+        agent.update(one_ep_transition)
+        if (i_episode+1)%10==0:
+            print("Episode:{}/{}: Reward:{}".format(i_episode+1, mc_cfg.n_episodes,ep_reward))
+    return rewards,ma_rewards
+if __name__ == "__main__":
+    mc_cfg = MCConfig()
+    env = RacetrackEnv()
+    n_actions=9
+    agent = FisrtVisitMC(n_actions,mc_cfg)
+    rewards,ma_rewards= mc_train(mc_cfg,env,agent)
+    save_results(rewards,ma_rewards,tag='train',path=RESULT_PATH)
+    plot_rewards(rewards,ma_rewards,tag="train",algo = "On-Policy First-Visit MC Control",path=RESULT_PATH)
+    
+
--- a/codes/MonteCarlo/results/20210312-161601/ma_rewards_train.npy
+++ b/codes/MonteCarlo/results/20210312-161601/ma_rewards_train.npy
--- a/codes/MonteCarlo/results/20210312-161601/rewards_curve_train.png
+++ b/codes/MonteCarlo/results/20210312-161601/rewards_curve_train.png
--- a/codes/MonteCarlo/results/20210312-161601/rewards_train.npy
+++ b/codes/MonteCarlo/results/20210312-161601/rewards_train.npy