LongWriter train 成功了，分享一些细节。 trained successfully! Share some details with guys.

train成功了，分享一下环境：

系统环境：
- python==3.11.9
- transformers==4.33.0
- pytorch==2.2.0
- flash-attn==2.6.3
- ninja==1.11.1.1
- deepspeed==0.15.0
- wandb==0.17.8
- /glm-4-9b 目录下的 modeling_chatglm.py和 tokenization_chatglm.py 都已经替换
- 在 /ds_config/stage3.json 中设置 "stage3_prefetch_bucket_size": 15099494
- 在 /glm-4-9b/config.json 中设置 "seq_length": 131072
general.json: (仅供测试)

{"messages":[{"role": "user", "content": "北京的天气怎么样？"}, {"role": "assistant", "content": "今天北京的天气晴朗，气温在25度左右。"}]}
{"messages":[{"role": "user", "content": "如何在Java中实现单例模式？"}, {"role": "assistant", "content": "你可以使用懒汉式或饿汉式来实现单例模式。以下是饿汉式的示例代码：\n```java\npublic class Singleton {\n    private static final Singleton instance = new Singleton();\n    private Singleton() {}\n    public static Singleton getInstance() {\n        return instance;\n    }\n}\n```"}]}
{"messages":[{"role": "user", "content": "你知道中国的四大发明是什么吗？"}, {"role": "assistant", "content": "中国的四大发明是造纸术、指南针、火药和印刷术。"}]}
{"messages":[{"role": "user", "content": "有什么推荐的中文书籍吗？"}, {"role": "assistant", "content": "如果你喜欢历史类书籍，我推荐《资治通鉴》。如果你对文学感兴趣，可以看看《红楼梦》。"}]}
{"messages":[{"role": "user", "content": "What's the weather like in New York?"}, {"role": "assistant", "content": "The weather in New York is cloudy with a temperature around 18°C."}]}
{"messages":[{"role": "user", "content": "How can I implement a stack in Python?"}, {"role": "assistant", "content": "You can implement a stack using a list in Python. Here’s a simple example:\n```python\nstack = []\nstack.append(1)\nstack.append(2)\nprint(stack.pop())  # Outputs 2\n```"}]}
{"messages":[{"role": "user", "content": "Can you explain the concept of machine learning?"}, {"role": "assistant", "content": "Machine learning is a field of artificial intelligence where algorithms learn from data to make predictions or decisions without being explicitly programmed."}]}
{"messages":[{"role": "user", "content": "What's a good book to learn about AI?"}, {"role": "assistant", "content": "A great book to start with is 'Artificial Intelligence: A Modern Approach' by Stuart Russell and Peter Norvig. It covers a wide range of AI topics comprehensively."}]}

tips:
- /glm-4-9b 目录下的 modeling_chatglm.py和 tokenization_chatglm.py 都已经替换
  
  这一步不确定自己是否成功的小伙伴，可以在这两个代码中自己加一些注释，如果报错了就检查一下报错代码中有没有这些注释。

Sep 04 '24 02:09 LYCnight

Thanks! Here is the English version: The training was successful! Here are the details of the environment:

Environment:
- python==3.11.9
- transformers==4.33.0
- pytorch==2.2.0
- flash-attn==2.6.3
- ninja==1.11.1.1
- deepspeed==0.15.0
- wandb==0.17.8
- Both modeling_chatglm.py and tokenization_chatglm.py in the /glm-4-9b directory have been replaced.
- Set "stage3_prefetch_bucket_size": 15099494 in /ds_config/stage3.json.
- Set "seq_length": 131072 in /glm-4-9b/config.json.
general.json: (for testing purposes only)

{"messages":[{"role": "user", "content": "What's the weather like in Beijing?"}, {"role": "assistant", "content": "Today's weather in Beijing is clear with temperatures around 25°C."}]}
{"messages":[{"role": "user", "content": "How can I implement the Singleton pattern in Java?"}, {"role": "assistant", "content": "You can use the eager or lazy initialization method to implement the Singleton pattern. Here's an example of eager initialization:\n```java\npublic class Singleton {\n    private static final Singleton instance = new Singleton();\n    private Singleton() {}\n    public static Singleton getInstance() {\n        return instance;\n    }\n}\n```"}]}
{"messages":[{"role": "user", "content": "Do you know what the Four Great Inventions of China are?"}, {"role": "assistant", "content": "China's Four Great Inventions are paper making, the compass, gunpowder, and printing."}]}
{"messages":[{"role": "user", "content": "Can you recommend any Chinese books?"}, {"role": "assistant", "content": "If you're interested in historical books, I recommend *Zizhi Tongjian*. If you're into literature, you might enjoy *Dream of the Red Chamber*."}]}
{"messages":[{"role": "user", "content": "What's the weather like in New York?"}, {"role": "assistant", "content": "The weather in New York is cloudy with a temperature around 18°C."}]}
{"messages":[{"role": "user", "content": "How can I implement a stack in Python?"}, {"role": "assistant", "content": "You can implement a stack using a list in Python. Here’s a simple example:\n```python\nstack = []\nstack.append(1)\nstack.append(2)\nprint(stack.pop())  # Outputs 2\n```"}]}
{"messages":[{"role": "user", "content": "Can you explain the concept of machine learning?"}, {"role": "assistant", "content": "Machine learning is a field of artificial intelligence where algorithms learn from data to make predictions or decisions without being explicitly programmed."}]}
{"messages":[{"role": "user", "content": "What's a good book to learn about AI?"}, {"role": "assistant", "content": "A great book to start with is 'Artificial Intelligence: A Modern Approach' by Stuart Russell and Peter Norvig. It covers a wide range of AI topics comprehensively."}]}

Tips:
- Both modeling_chatglm.py and tokenization_chatglm.py in the /glm-4-9b directory have been replaced.

If you're unsure whether this step was successful, you can add some comments in these two files yourself. If an error occurs, check if the error code contains those comments.

Sep 04 '24 03:09 bys0318

@LYCnight 大佬训练的时候预计要多大的显存

Sep 05 '24 02:09 thunder95

@LYCnight 按照大佬的环境，成功的开启了训练，但是为什么训练完后的文件超级大呢？存储空间不够了…… -rw-r--r-- 1 root root 4984147224 Sep 5 10:49 model-00001-of-00004.safetensors -rw-r--r-- 1 root root 4895071360 Sep 5 10:49 model-00002-of-00004.safetensors -rw-r--r-- 1 root root 4895071384 Sep 5 10:49 model-00003-of-00004.safetensors -rw-r--r-- 1 root root 4025651256 Sep 5 10:49 model-00004-of-00004.safetensors

Sep 05 '24 03:09 zhoufn

@LYCnight 大佬训练的时候预计要多大的显存

GLM-4-9b 32k训练需要8卡80G。如果显存不够可以试试lora或者qlora。

Sep 05 '24 03:09 bys0318

@LYCnight 按照大佬的环境，成功的开启了训练，但是为什么训练完后的文件超级大呢？存储空间不够了…… -rw-r--r-- 1 root root 4984147224 Sep 5 10:49 model-00001-of-00004.safetensors -rw-r--r-- 1 root root 4895071360 Sep 5 10:49 model-00002-of-00004.safetensors -rw-r--r-- 1 root root 4895071384 Sep 5 10:49 model-00003-of-00004.safetensors -rw-r--r-- 1 root root 4025651256 Sep 5 10:49 model-00004-of-00004.safetensors

9B的模型存下来就是这么大的，你可以设置只存模型参数不存训练的optimizer state，这样文件能小一些。

Sep 05 '24 03:09 bys0318

@LYCnight 大佬训练的时候预计要多大的显存

我的是 8 * 80G 全部跑满

Sep 05 '24 05:09 LYCnight

@LYCnight 按照大佬的环境，成功的开启了训练，但是为什么训练完后的文件超级大呢？存储空间不够了…… -rw-r--r-- 1 root root 4984147224 Sep 5 10:49 model-00001-of-00004.safetensors -rw-r--r-- 1 root root 4895071360 Sep 5 10:49 model-00002-of-00004.safetensors -rw-r--r-- 1 root root 4895071384 Sep 5 10:49 model-00003-of-00004.safetensors -rw-r--r-- 1 root root 4025651256 Sep 5 10:49 model-00004-of-00004.safetensors

确实大，等等官方的压缩大小的设置的文档吧

Sep 05 '24 05:09 LYCnight

你好，我在运行的时候，配置完毕transformers=4.33.0但是运行的时候出现了下面的错误，暂时不知道怎么解决，其他内容跟楼主参数一致 File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 342, in init self.create_accelerator_and_postprocess() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3883, in create_accelerator_and_postprocess self.accelerator = Accelerator( TypeError: Accelerator.init() got an unexpected keyword argument 'dispatch_batches'

Dec 19 '24 11:12 zhangnn520