update 5.3 transformers pretrain

2024-10-15 11:01:59 +08:00
parent 9e6d8a3f77
commit 86c76cb09b
1 changed files with 100 additions and 2 deletions
--- a/预训练一个小型LLM.md
+++ b/预训练一个小型LLM.md
@@ -1,4 +1,4 @@
-# 5.3
+# 5.3 预训练一个小型LLM

 在前面的章节中，我们熟悉了各种大模型的模型结构，以及如如何训练Tokenizer。在本节中，我们将动手训练一个小型的LLM。

@@ -289,7 +289,7 @@ class Task:

 最后，我们还定义了一个 `Task` 类，专门用于迭代数据集，并生成模型所需的输入和目标输出。这一部分的设计确保了数据流的顺畅对接，为模型训练提供了标准化的数据输入。可以通过以下代码来测试预处理后的数据集。

-## 5.3.3 训练模型
+## 5.3.3 预训练模型

 在数据预处理完成后，我们就可以开始训练模型了。我们使用的模型是一个和LLama2结构一样的 Decoder only Transformer模型，使用Pytorch实现。相关代码在`model.py`文件中。此处不再赘述，源码中有详细的中文注释，且我们在之前的文章中也有详细的介绍。

@@ -480,6 +480,104 @@ class TextGenerator:
        return generated_texts  # 返回生成的文本样本
 ```

+最后我们来看一下模型输出的结果：
+
+```
+python sample.py --prompt "One day, Lily met a Shoggoth"
+
+OUTPUT:
+One day, Lily met a Shoggoth named Rold.  She loved to play with blocks and make new shapes.  But her mom said, "Time to put your blocks away, Lily."  Lily did not want to stop building, but she knew she had to.
+As Lily started to put the blocks in the right place, she noticed that a big, sad dog was coming.  The dog barked and wagged its tail.  Lily knew the dog wanted to help her.  She picked up a softpheck with her hands and gave the dog a soft hug.
+Lily gave the dog a little kiss and put the blocks back in the box.  The dog was happy to be with itself again.  In the end, Lily and the dog became good friends, and they played with blocks together every day.  And they never separated the yard again.  Once upon a time, there was a kind and generous man named Tom.  He lived in a big house with many houses and "sacing."  One day, a mean man came to the house and started to bury rocks and hills out to make a big mess.
+```
+
+## 5.3.5 使用transformers库预训练LLM
+
+也可以使用transformers库来进行模型的预训练过程，预训练与监督训练（SFT）的不同之处在于，预训练是在没有标签的数据上进行的，而监督训练是在有标签的数据上进行的。也就是说模型在预训练时每一个token都是label的一部分，而在监督训练时，input部分是不会计算loss的。
+
+我们使用Qwen2.5-0.5b模型进行预训练，选择一个小模型方便大家在本地复现。数据集依然使用上述的TinyStories数据集。以下是使用transformers库进行预训练的代码：
+
+> 不过要注意将下面代码中的模型路径切换为你本地的模型路径哦~
+
+```python
+from datasets import load_dataset, Dataset  # 导入 Hugging Face datasets 和 Dataset 类
+from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling, TrainingArguments, Trainer, BloomForCausalLM
+import glob  # 导入 glob 模块以处理文件路径
+import warnings  # 用于忽略警告
+warnings.filterwarnings("ignore")  # 忽略所有警告信息
+
+# 数据处理函数，用于将故事内容添加结束标志并转换为标记化格式
+def process_func(examples):
+    contents = [example + tokenizer.eos_token for example in examples["story"]]  # 每个故事结尾添加 eos_token
+    return tokenizer(contents, max_length=512, truncation=True)  # 将内容标记化，并设定最大长度为 512，超出部分截断
+
+# 加载预训练的 Qwen 模型和标记化器
+tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5b模型路径")
+model = AutoModelForCausalLM.from_pretrained("Qwen2.5-0.5b模型路径") 
+
+# 查找数据集路径中的所有 JSON 文件
+data_paths = glob.glob('/root/code/tiny-llm/data/TinyStories_all_data/')  # 查找所有的 TinyStories 数据路径
+
+# 查找当前路径中所有的 JSON 文件
+json_files = []
+for data_path in data_paths:
+    json_files += glob.glob(data_path + '*.json')  # 查找每个路径中的所有 JSON 文件并添加到 json_files 列表中
+
+# 使用 Hugging Face datasets 加载 JSON 文件作为数据集
+dataset = load_dataset("json", data_files=json_files, split='train')  # 加载 JSON 格式数据，作为训练集
+tokenized_ds = dataset.map(process_func, batched=True, remove_columns=dataset.column_names)  # 批量标记化数据，并删除原始列
+
+# 设置训练参数
+args = TrainingArguments(
+    output_dir="/data/qwen_pretrain",  # 输出路径
+    per_device_train_batch_size=4,  # 每个设备的批处理大小为 4
+    gradient_accumulation_steps=8,  # 梯度累积步数为 8
+    logging_steps=100,  # 每 100 步记录一次日志
+    num_train_epochs=1,  # 训练轮数为 1
+    fp16=True,  # 使用 FP16 进行加速
+    save_steps=1000,  # 每 1000 步保存一次模型
+)
+
+# 使用 Trainer 进行训练
+trainer = Trainer(
+    args=args,  # 传入训练参数
+    model=model,  # 传入模型
+    tokenizer=tokenizer,  # 传入标记化器
+    train_dataset=tokenized_ds,  # 传入标记化后的训练数据集
+    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)  # 使用数据收集器进行语言模型训练（非 MLM 任务）
+)
+
+trainer.train()  # 开始训练
+
+```
+
+我们来看一下模型训练的结果，可以使用以下代码生成文本：
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+import warnings
+warnings.filterwarnings("ignore")
+
+tokenizer = AutoTokenizer.from_pretrained("/data/qwen_pretrain/checkpoint-9000") # 将此处的路径修改为你训练过后的模型路径
+model = AutoModelForCausalLM.from_pretrained("/data/qwen_pretrain/checkpoint-9000", device_map="auto") # 将此处的路径修改为你训练过后的模型路径
+
+prompt = "One day, Lily met a Shoggoth"
+inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    input_ids=inputs, max_new_tokens=256, do_sample=True, eos_token_id=tokenizer.eos_token_id)
+text = tokenizer.decode(outputs[0])
+print(text)
+```
+
+```
+One day, Lily met a Shoggoth named Tim. Tim was very intelligent. He knew a lot of things. They became good friends. They played together every day. They had so much fun.
+One day, they saw a big tree. Tim said, "I think there is a secret world at the top of that tree!" Lily was excited. She wanted to go there. Tim said, "You should come with me, I am intelligent and I think we can find it."
+They went up the tree. They went very high. When they got to the top, they saw the most beautiful world ever. They played and laughed. Tim and Lily were so happy. They had a great time. They knew that they had a secret love. They were proud of their smart friend. They went back down the tree and played together every day. They lived happily ever after. The big tree had led them to a magical, special place. They learned to be never afraid of anything. And they lived happily ever after. The end. 
+Tim, Lily, and the other animals in the forest lived happily ever after with their secret love. They knew that it was because of the intelligent bird who flew them up to the secret world.
+```
+
+
 **参考文献**

 - [llama2.c](https://github.com/karpathy/llama2.c)