update codes
This commit is contained in:
@@ -8,12 +8,16 @@ Policy-based方法是强化学习中与Value-based(比如Q-learning)相对的方
|
||||
|
||||
结合REINFORCE原理,其伪代码如下:
|
||||
|
||||
<img src="assets/image-20211016004808604.png" alt="image-20211016004808604" style="zoom:50%;" />
|
||||
|
||||
https://pytorch.org/docs/stable/distributions.html
|
||||
|
||||
加负号的原因是,在公式中应该是实现的梯度上升算法,而loss一般使用随机梯度下降的,所以加个负号保持一致性。
|
||||
|
||||

|
||||
|
||||
## 实现
|
||||
|
||||
|
||||
|
||||
## 参考
|
||||
|
||||
[REINFORCE和Reparameterization Trick](https://blog.csdn.net/JohnJim0/article/details/110230703)
|
||||
|
||||
Reference in New Issue
Block a user