Update 第二章 Transformer架构.md

add pre-norm
2025-06-23 11:02:23 +08:00
parent 5f2ccc44bf
commit 98a122e323
1 changed files with 2 additions and 0 deletions
--- a/docs/chapter2/第二章
+++ b/docs/chapter2/第二章
@@ -777,6 +777,8 @@ class PositionalEncoding(nn.Module):
  <p>图2.7 Transformer 模型结构</p>
 </div>
 但需要注意的是，上图是原论文《Attention is all you need》配图，LayerNorm 层放在了 Attention 层后面，也就是“Post-Norm”结构，但在其发布的源代码中，LayerNorm 层是放在 Attention 层前面的，也就是“Pre Norm”结构。考虑到目前 LLM 一般采用“Pre-Norm”结构（可以使 loss 更稳定），本文在实现时采用“Pre-Norm”结构。
 如图，经过 tokenizer 映射后的输出先经过 Embedding 层和 Positional Embedding 层编码，然后进入上一节讲过的 N 个 Encoder 和 N 个 Decoder（在 Transformer 原模型中，N 取为6），最后经过一个线性层和一个 Softmax 层就得到了最终输出。
 基于之前所实现过的组件，我们实现完整的 Transformer 模型：