docs(chapter5): 更新模型文档并添加数据处理脚本

- 更新LLaMA2模型文档，修正图片引用和编号 - 添加Attention结构示意图 - 新增数据处理脚本download_dataset.sh和deal_dataset.py - 优化文档中的代码示例说明
2025-06-18 16:26:33 +08:00
parent ada2e0c44f
commit ce535629ca
4 changed files with 47 additions and 29 deletions
@@ -1,32 +1,24 @@
-import os 
+import os
 from tqdm import tqdm
 import json
 from tqdm import tqdm
-# 设置环境变量
+# pretrain_data 为运行download_dataset.sh时，下载的pretrain_data本地路径
-os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
+pretrain_data = 'your local pretrain_data'
-
+output_pretrain_data = 'seq_monkey_datawhale.jsonl'
 # 下载预训练数据集
 os.system("modelscope download --dataset ddzhu123/seq-monkey mobvoi_seq_monkey_general_open_corpus.jsonl.tar.bz2 --local_dir your_local_dir")
 # 解压预训练数据集
 os.system("tar -xvf your_local_dir/mobvoi_seq_monkey_general_open_corpus.jsonl.tar.bz2 -C your_local_dir")
 # 下载SFT数据集
 os.system(f'huggingface-cli download --repo-type dataset --resume-download BelleGroup/train_3.5M_CN --local-dir BelleGroup')
 # sft_data 为运行download_dataset.sh时，下载的sft_data本地路径
 sft_data = 'your local sft_data'
 output_sft_data = 'BelleGroup_sft.jsonl'
 # 1 处理预训练数据
 def split_text(text, chunk_size=512):
    """将文本按指定长度切分成块"""
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
-input_file = 'mobvoi_seq_monkey_general_open_corpus.jsonl'
+with open(output_pretrain_data, 'a', encoding='utf-8') as pretrain:
-
+    with open(pretrain_data, 'r', encoding='utf-8') as f:
 with open('seq_monkey_datawhale.jsonl', 'a', encoding='utf-8') as pretrain:
    with open(input_file, 'r', encoding='utf-8') as f:
        data = f.readlines()
-        for line in tqdm(data, desc=f"Processing lines in {input_file}", leave=False):  # 添加行级别的进度条
+        for line in tqdm(data, desc=f"Processing lines in {pretrain_data}", leave=False):  # 添加行级别的进度条
            line = json.loads(line)
            text = line['text']
            chunks = split_text(text)
@@ -34,7 +26,6 @@ with open('seq_monkey_datawhale.jsonl', 'a', encoding='utf-8') as pretrain:
                pretrain.write(json.dumps({'text': chunk}, ensure_ascii=False) + '\n')
 # 2 处理SFT数据
 def convert_message(data):
    """
    将原始数据转换为标准格式
@@ -49,10 +40,10 @@ def convert_message(data):
            message.append({'role': 'assistant', 'content': item['value']})
    return message
-with open('BelleGroup_sft.jsonl', 'a', encoding='utf-8') as sft:
+with open(output_sft_data, 'a', encoding='utf-8') as sft:
-    with open('BelleGroup/train_3.5M_CN.json', 'r') as f:
+    with open(sft_data, 'r') as f:
        data = f.readlines()
        for item in tqdm(data, desc="Processing", unit="lines"):
            item = json.loads(item)
            message = convert_message(item['conversations'])
-            sft.write(json.dumps(message, ensure_ascii=False) + '\n')
+            sft.write(json.dumps(message, ensure_ascii=False) + '\n')
@@ -0,0 +1,20 @@
 #!/bin/bash
 # 设置环境变量
 export HF_ENDPOINT=https://hf-mirror.com
 # dataset dir 下载到本地目录
 dataset_dir="your local dataset dir"
 # 下载预训练数据集
 modelscope download --dataset ddzhu123/seq-monkey mobvoi_seq_monkey_general_open_corpus.jsonl.tar.bz2 --local_dir ${dataset_dir}
 # 解压预训练数据集
 tar -xvf "${dataset_dir}/mobvoi_seq_monkey_general_open_corpus.jsonl.tar.bz2" -C "${dataset_dir}"
 # 下载SFT数据集
 huggingface-cli download \
  --repo-type dataset \
  --resume-download \
  BelleGroup/train_3.5M_CN \
  --local-dir "${dataset_dir}/BelleGroup"
@@ -4,11 +4,11 @@
 Meta（原Facebook）于2023年2月发布第一款基于Transformer结构的大型语言模型LLaMA，并于同年7月发布同系列模型LLaMA2。我们在第四章已经学习了解的了LLM，记忆如何训练LLM等等。那本小节我们就来学习，如何动手写一个LLaMA2模型。
-LLaMA2 模型结构如下图5.0所示：
+LLaMA2 模型结构如下图5.1所示：
 <div align='center'>
    <img src="https://raw.githubusercontent.com/datawhalechina/happy-llm/main/docs/images/5-images/LLama2.png" alt="alt text" width="100%">
-    <p>图 5.0 LLaMA2结构</p>
+    <p>图 5.1 LLaMA2结构</p>
 </div>
 ### 5.1.1 定义超参数
@@ -51,6 +51,8 @@ class ModelConfig(PretrainedConfig):
        super().__init__(**kwargs)
 ```
 > 在以下代码中出现 `args` 时，即默认为以上 `ModelConfig` 参数配置。
 我们来看一下其中的一些超参数的含义，比如`dim`是模型维度，`n_layers`是Transformer的层数，`n_heads`是注意力机制的头数，`vocab_size`是词汇表大小，`max_seq_len`是输入的最大序列长度等等。上面的代码中也对每一个参数做了详细的注释，在后面的代码中我们会根据这些超参数来构建我们的模型。
 ### 5.1.2 构建 RMSNorm
@@ -111,6 +113,11 @@ torch.Size([1, 50, 768])
 在 LLaMA2 模型中，虽然只有 LLaMA2-70B模型使用了分组查询注意力机制（Grouped-Query Attention，GQA），但我们依然选择使用 GQA 来构建我们的 LLaMA Attention 模块，它可以提高模型的效率，并节省一些显存占用。
 <div align='center'>
    <img src="https://raw.githubusercontent.com/datawhalechina/happy-llm/main/docs/images/5-images/Attention.png" alt="alt text" width="100%">
    <p>图 5.2 LLaMA2 Attention 结构</p>
 </div>
 #### 5.1.3.1 repeat_kv
 在 LLaMA2 模型中，我们需要将键和值的维度扩展到和查询的维度一样，这样才能进行注意力计算。我们可以通过如下代码实现`repeat_kv`：
@@ -1330,11 +1337,11 @@ class PretrainDataset(Dataset):
        return torch.from_numpy(X), torch.from_numpy(Y), torch.from_numpy(loss_mask)
 ```
-在以上代码和图5.1可以看出，`Pretrain Dataset` 主要是将 `text` 通过 `tokenizer` 转换成 `input_id`，然后将 `input_id` 拆分成 `X` 和 `Y`，其中 `X` 为 `input_id` 的前 n-1 个元素，`Y` 为 `input_id` 的后 n-1 `个元素。loss_mask` 主要是用来标记哪些位置需要计算损失，哪些位置不需要计算损失。
+在以上代码和图5.3可以看出，`Pretrain Dataset` 主要是将 `text` 通过 `tokenizer` 转换成 `input_id`，然后将 `input_id` 拆分成 `X` 和 `Y`，其中 `X` 为 `input_id` 的前 n-1 个元素，`Y` 为 `input_id` 的后 n-1 `个元素。loss_mask` 主要是用来标记哪些位置需要计算损失，哪些位置不需要计算损失。
 <div align='center'>
    <img src="https://raw.githubusercontent.com/datawhalechina/happy-llm/main/docs/images/5-images/pretrain_dataset.png" alt="alt text" width="100%">
-    <p>图5.1 预训练损失函数计算</p>
+    <p>图5.3 预训练损失函数计算</p>
 </div>
 图中示例展示了当`max_length=9`时的处理过程：
@@ -1417,11 +1424,11 @@ class SFTDataset(Dataset):
        return torch.from_numpy(X), torch.from_numpy(Y), torch.from_numpy(loss_mask)
 ```
-在 SFT 阶段，这里使用的是多轮对话数据集，所以就需要区分哪些位置需要计算损失，哪些位置不需要计算损失。在上面的代码中，我使用了一个 `generate_loss_mask` 函数来生成 `loss_mask`。这个函数主要是用来生成 `loss_mask`，其中 `loss_mask` 的生成规则是：当遇到 `|<im_start|>assistant\n` 时，就开始计算损失，直到遇到 `|<im_end|>` 为止。这样就可以保证我们的模型在 SFT 阶段只计算当前轮的对话内容，如图5.2所示。
+在 SFT 阶段，这里使用的是多轮对话数据集，所以就需要区分哪些位置需要计算损失，哪些位置不需要计算损失。在上面的代码中，我使用了一个 `generate_loss_mask` 函数来生成 `loss_mask`。这个函数主要是用来生成 `loss_mask`，其中 `loss_mask` 的生成规则是：当遇到 `|<im_start|>assistant\n` 时，就开始计算损失，直到遇到 `|<im_end|>` 为止。这样就可以保证我们的模型在 SFT 阶段只计算当前轮的对话内容，如图5.4所示。
 <div align='center'>
    <img src="https://raw.githubusercontent.com/datawhalechina/happy-llm/main/docs/images/5-images/sftdataset.png" alt="alt text" width="90%">
-    <p>图5.2 SFT 损失函数计算</p>
+    <p>图5.4 SFT 损失函数计算</p>
 </div>
 可以看到，其实 SFT Dataset 和 Pretrain Dataset 的 `X` 和 `Y` 是一样的，只是在 SFT Dataset 中我们需要生成一个 `loss_mask` 来标记哪些位置需要计算损失，哪些位置不需要计算损失。 图中 `Input ids` 中的蓝色小方格就是AI的回答，所以是需要模型学习的地方。所以在 `loss_mask` 中，蓝色小方格对应的位置是黄色，其他位置是灰色。在代码 `loss_mask` 中的 1 对应的位置计算损失，0 对应的位置不计算损失。