initialize repository

2020-07-03 10:24:42 +08:00
parent 9b25d0a1a1
commit ed244a426d
11 changed files with 37 additions and 24 deletions
--- a/docs/README.md
+++ b/docs/README.md
@@ -1 +1,18 @@
-# LeeRL-notes
+# 李宏毅强化学习笔记(LeeRL-notes)
+
+
+
+# 主要贡献者
+
+- [@qiwang067](https://github.com/qiwang067)
+
+
+
+# 关注我们
+
+<div align=center><img src="https://raw.githubusercontent.com/datawhalechina/pumpkin-book/master/res/qrcode.jpeg" width = "250" height = "270" alt="Datawhale，一个专注于AI领域的学习圈子。初衷是for the learner，和学习者一起成长。目前加入学习社群的人数已经数千人，组织了机器学习，深度学习，数据分析，数据挖掘，爬虫，编程，统计学，Mysql，数据竞赛等多个领域的内容学习，微信搜索公众号Datawhale可以加入我们。"></div>
+
+
+
+
+
--- a/docs/_sidebar.md
+++ b/docs/_sidebar.md
@@ -1,12 +1,12 @@
 - 目录
- - [P1 机器学习介绍](chapter1/chapter1.md)
- - [P2 为什么要学习机器学习](chapter2/chapter2.md)
- - [P3 回归](chapter3/chapter3.md)
- - [P4 回归-演示](chapter4/chapter4.md)
- - [P5 误差从哪来？](chapter5/chapter5.md)
- - [P6 梯度下降](chapter6/chapter6.md)
- - [P7 梯度下降（用AOE演示）](chapter7/chapter7.md)
- - [P8 梯度下降（用Minecraft演示）](chapter8/chapter8.md)
+ - [P1 Policy Gradient](chapter1/chapter1.md)
+ - [P2 Proximal Policy Optimization (PPO)](chapter2/chapter2.md)
+ - [P3 Q-learning (Basic Idea)](chapter3/chapter3.md)
+ - [P4 Q-learning (Advanced Tips)](chapter4/chapter4.md)
+ - [P5 Q-learning (Continuous Action)](chapter5/chapter5.md)
+ - [P6 Actor-Critic](chapter6/chapter6.md)
+ - [P7 Sparse Reward](chapter7/chapter7.md)
+ - [P8 Imitation Learning](chapter8/chapter8.md)



--- a/docs/chapter1/chapter1.md
+++ b/docs/chapter1/chapter1.md
@@ -1,4 +1,3 @@
-[toc]
 # Policy Gradient
 ##  Policy Gradient

--- a/docs/chapter2/chapter2.md
+++ b/docs/chapter2/chapter2.md
@@ -1,4 +1,3 @@
-[toc]
 # PPO
 ## On-policy and Off-policy
 在讲 PPO 之前，我们先讲一下 on-policy and off-policy 这两种 training 方法的区别。
@@ -84,7 +83,7 @@ $$

 我们用 $\theta$ 这个actor 去sample 出$s_t$ 跟$a_t$，sample 出state 跟action 的pair，我们会计算这个state 跟action pair 它的advantage， 就是它有多好。$A^{\theta}\left(s_{t}, a_{t}\right)$就是 accumulated 的 reward 减掉 bias，这一项就是估测出来的。它要估测的是，在state $s_t$ 采取action $a_t$ 是好的，还是不好的。那接下来后面会乘上$\nabla \log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)$，也就是说如果$A^{\theta}\left(s_{t}, a_{t}\right)$是正的，就要增加概率， 如果是负的，就要减少概率。

-那现在用了 importance sampling 的技术把 on-policy 变成 off-policy，就从 $\theta$ 变成 $\theta'$。所以现在$s_t$、$a_t$ 是$\theta'$ ，另外一个actor 跟环境互动以后所sample 到的data。 但是拿来训练要调整参数是 model $\theta$。因为$\theta'$  跟 $\theta$ 是不同的model，所以你要做一个修正的项。这项修正的项，就是用 importance sampling 的技术，把$s_t$、$a_t$ 用 $\theta$ sample 出来的概率除掉$s_t、$$a_t$  用 $\theta'$  sample 出来的概率。
+那现在用了 importance sampling 的技术把 on-policy 变成 off-policy，就从 $\theta$ 变成 $\theta'$。所以现在$s_t$、$a_t$ 是$\theta'$ ，另外一个actor 跟环境互动以后所sample 到的data。 但是拿来训练要调整参数是 model $\theta$。因为$\theta'$  跟 $\theta$ 是不同的model，所以你要做一个修正的项。这项修正的项，就是用 importance sampling 的技术，把$s_t$、$a_t$ 用 $\theta$ sample 出来的概率除掉$s_t$、$a_t$  用 $\theta'$  sample 出来的概率。

 $$
 =E_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{P_{\theta}\left(s_{t}, a_{t}\right)}{P_{\theta^{\prime}}\left(s_{t}, a_{t}\right)} A^{\theta}\left(s_{t}, a_{t}\right) \nabla \log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)\right]
--- a/docs/chapter3/chapter3.md
+++ b/docs/chapter3/chapter3.md
@@ -1,5 +1,3 @@
-[toc]
-
 # Q-learning

 ## Q-learning
--- a/docs/chapter4/chapter4.md
+++ b/docs/chapter4/chapter4.md
@@ -1,4 +1,3 @@
-[toc]
 # Tips of Q-learning
 ## Double DQN
 ![](img/4.1.png)
--- a/docs/chapter5/chapter5.md
+++ b/docs/chapter5/chapter5.md
@@ -1,5 +1,3 @@
-
-[toc]
 # Q-learning for Continuous Actions

 ![](img/5.1.png)
--- a/docs/chapter6/chapter6.md
+++ b/docs/chapter6/chapter6.md
@@ -1,5 +1,3 @@
-[toc]
-
 # Actor-Critic

 ## Actor-Critic
--- a/docs/chapter7/chapter7.md
+++ b/docs/chapter7/chapter7.md
@@ -1,4 +1,3 @@
-[toc]
 # Sparse Reward 
 实际上我们用 reinforcement learning learn agent 的时候，多数的时候 agent 都是没有办法得到 reward 的。那在没有办法得到reward 的情况下，对agent 来说它的训练是非常困难的。举例来说，假设你今天要训练一个机器手臂，然后桌上有一个螺丝钉跟螺丝起子，那你要训练它用螺丝起子把螺丝钉栓进去，那这个很难，为什么？因为你知道一开始你的 agent 是什么都不知道的，它唯一能够做不同的action 的原因是 exploration。举例来说，你在做Q learning 的时候，会有一些随机性，让它去采取一些过去没有采取过的 action，那你要随机到说它把螺丝起子捡起来，再把螺丝栓进去，然后就会得到reward 1，这件事情是永远不可能发生的。所以，不管你的actor 做了什么事情，它得到reward 永远都是 0，对它来说不管采取什么样的 action 都是一样糟或者是一样得好。所以，它最后什么都不会学到。如果环境中的 reward 非常的 sparse，reinforcement learning 的问题就会变得非常的困难。但是人类可以在非常 sparse 的reward 上面去学习，我们的人生通常多数的时候，我们就只是活在那里，都没有得到什么reward 或者是penalty。但是，人还是可以采取各种各式各样的行为。所以，一个真正厉害的 AI 应该能够在 sparse reward 的情况下也学到要怎么跟这个环境互动。

--- a/docs/chapter8/chapter8.md
+++ b/docs/chapter8/chapter8.md
@@ -1,4 +1,3 @@
-[toc]
 # Imitation Learning 
 ![](img/8.1.png)
 Imitation learning 讨论的问题是，假设我们连 reward 都没有，那要怎么办呢？Imitation learning 又叫做 `learning by demonstration` 或者叫做 `apprenticeship learning`。在 Imitation learning 里面，你有一些 expert 的 demonstration，那 machine 也可以跟环境互动，但它没有办法从环境里面得到任何的 reward，它只能看着 expert 的 demonstration 来学习什么是好，什么是不好。其实，多数的情况，我们都没有办法真的从环境里面得到非常明确的reward。举例来说，如果是棋类游戏或者是电玩，你有非常明确的 reward。但是其实多数的任务，都是没有 reward 的。以 chat-bot 为例，机器跟人聊天，聊得怎么样算是好，聊得怎么样算是不好，你无法给出明确的 reward。所以很多 task 是根本就没有办法给出 reward 的。
--- a/docs/index.html
+++ b/docs/index.html
@@ -2,7 +2,7 @@
 <html lang="en">
 <head>
  <meta charset="UTF-8">
-  <title>Document</title>
+  <title>LeeRL-Notes</title>
  <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
  <meta name="description" content="Description">
  <meta name="viewport" content="width=device-width, user-scalable=no, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0">
@@ -13,10 +13,17 @@
  <script>
    window.$docsify = {
      name: '',
-      repo: '',
-      loadSidebar: true
+      repo: 'LeeRL-Notes',
+      loadSidebar: true,
+      subMaxLevel: 3
    }
  </script>
+  <!-- CDN files for docsify-katex -->
+  <script src="//cdn.jsdelivr.net/npm/docsify-katex@latest/dist/docsify-katex.js"></script>
+  <!-- or <script src="//cdn.jsdelivr.net/gh/upupming/docsify-katex/dist/docsify-katex.js"></script> -->
+  <link rel="stylesheet" href="//cdn.jsdelivr.net/npm/katex@latest/dist/katex.min.css">
+
+  <!-- Put them above docsify.min.js -->
  <script src="//cdn.jsdelivr.net/npm/docsify/lib/docsify.min.js"></script>
 </body>
 </html>