使用预训练脚本, 输入目录下是txt文本, 为啥有这个报错, 貌似用json来解析?可能是哪里传参不对?
Traceback (most recent call last): File "/home/data/LLM/llama/data/run_clm_pt_with_peft.py", line 461, in main processed_dataset = datasets.load_from_disk(cache_path, keep_in_memory=False) File "/home/data/anaconda3/lib/python3.10/site-packages/datasets/load.py", line 1886, in load_from_disk return DatasetDict.load_from_disk(dataset_path, keep_in_memory=keep_in_memory, storage_options=storage_options) File "/home/data/anaconda3/lib/python3.10/site-packages/datasets/dataset_dict.py", line 1308, in load_from_disk splits = json.load(f)["splits"] File "/home/data/anaconda3/lib/python3.10/json/init.py", line 293, in load return loads(fp.read(), File "/home/data/anaconda3/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/home/data/anaconda3/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/home/data/anaconda3/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
删掉所有的数据缓存再试试
试了一下, 如果用deepspeed, 不能用stage 3, 否则后续会带来这个问题. 另外我还想问一下, txt格式, 如果我有多篇没有关联的文章, 有两种处理方式: 每篇文章一个txt, 以及把文章都合并成一个大的txt. 哪种方式更合理, 还是其实没有关系?
试了一下, 如果用deepspeed, 不能用stage 3, 否则后续会带来这个问题. 另外我还想问一下, txt格式, 如果我有多篇没有关联的文章, 有两种处理方式: 每篇文章一个txt, 以及把文章都合并成一个大的txt. 哪种方式更合理, 还是其实没有关系?
感谢反馈。
这两种没有区别,模型每次读取的文本长度由block_size决定;训练数据会以block_size切割成多个样本,而训练时都会做shuffle的,不同样本之间训练顺序不固定。
预训练过程中, 实际到底需要多少显存才能支持? 这个有具体数字么?
Hi @airaria ! 借楼请问一下预训练只使用了ZeRO2而没有ZeRO3是出于什么考量呢?谢谢!
@ruanshudong 应该是16 × A100(40GB). See here and https://github.com/ymcui/Chinese-LLaMA-Alpaca/issues/1
但是实际应该用不了这么大吧?
Hi @airaria ! 借楼请问一下预训练只使用了ZeRO2而没有ZeRO3是出于什么考量呢?谢谢!
考虑到ZeRO-3可能会更慢,而且用ZeRO-2已经能装得下,所以没尝试ZeRO-3了
但是实际应该用不了这么大吧?
没有做过极限测试。但如果你把长度缩短至256,多卡 ZeRO-2 + 24G的显存是可以训练的
我现在只有132G显存, 想跑起来看看效果, 不知道参数怎么调整合适?调整block_size的大小为256?
我现在只有132G显存, 想跑起来看看效果, 不知道参数怎么调整合适?调整block_size的大小为256?
合适的参数还是要靠自己尝试,你可以逐步减小block_size或per_device_train_batch_size,直到能跑起来为止。
感谢 @airaria , 调整参数以后, 跑起来了, 但是出现以下错误:
[INFO|trainer.py:1769] 2023-05-05 15:40:33,704 >> ***** Running training *****
[INFO|trainer.py:1770] 2023-05-05 15:40:33,704 >> Num examples = 88,636
[INFO|trainer.py:1771] 2023-05-05 15:40:33,704 >> Num Epochs = 1
[INFO|trainer.py:1772] 2023-05-05 15:40:33,704 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1773] 2023-05-05 15:40:33,705 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1774] 2023-05-05 15:40:33,705 >> Gradient Accumulation steps = 2
[INFO|trainer.py:1775] 2023-05-05 15:40:33,705 >> Total optimization steps = 3,000
[INFO|trainer.py:1776] 2023-05-05 15:40:33,708 >> Number of trainable parameters = 533,397,504
0%| | 0/3000 [00:00<?, ?it/s][2023-05-05 15:40:35,733] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2023-05-05 15:40:36,068] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
0%| | 1/3000 [00:02<1:57:57, 2.36s/it][2023-05-05 15:40:36,362] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
[2023-05-05 15:40:36,653] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192
0%|▏ | 2/3000 [00:02<1:05:44, 1.32s/it][2023-05-05 15:40:36,944] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096
[2023-05-05 15:40:37,234] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048
0%|▏ | 3/3000 [00:03<48:57, 1.02it/s][2023-05-05 15:40:37,526] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, reducing to 1024
[2023-05-05 15:40:37,823] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 512
0%|▎ | 4/3000 [00:04<41:13, 1.21it/s][2023-05-05 15:40:38,114] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 512, reducing to 256
[2023-05-05 15:40:38,414] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 256, reducing to 128
[2023-05-05 15:40:38,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=10, lr=[0.0], mom=[(0.9, 0.999)]
[2023-05-05 15:40:38,415] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=27.92999445427692, CurrSamplesPerSec=27.131727961718187, MemAllocated=14.33GB, MaxMemAllocated=16.32GB
0%|▎ | 5/3000 [00:04<37:00, 1.35it/s][2023-05-05 15:40:38,706] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 128, reducing to 64
[2023-05-05 15:40:38,996] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 64, reducing to 32
0%|▍ | 6/3000 [00:05<34:16, 1.46it/s][2023-05-05 15:40:39,291] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32, reducing to 16
[2023-05-05 15:40:39,583] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16, reducing to 8
0%|▌ | 7/3000 [00:05<32:38, 1.53it/s][2023-05-05 15:40:39,876] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8, reducing to 4
[2023-05-05 15:40:40,170] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4, reducing to 2
0%|▌ | 8/3000 [00:06<31:33, 1.58it/s][2023-05-05 15:40:40,463] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2, reducing to 1
这个知道是什么原因么?
调整deepspeed的参数 ,终于算是能正常跑了, 得到的lora模型怎么用inference_hf.py 推理, 我执行:
python inference_hf.py --base_model=llama-7b --lora_model=lora-llama-7b/ --interactive
直接报错:
Traceback (most recent call last):
File "/home/data/LLM/llama/data/inference_hf.py", line 74, in
看了一下, 貌似没有config.json这个文件呢?
打印了一下, 需要: adapter_config.json, 保存lora模型时并没有这个文件呢?
感谢 @airaria , 调整参数以后, 跑起来了, 但是出现以下错误: [INFO|trainer.py:1769] 2023-05-05 15:40:33,704 >> ***** Running training ***** [INFO|trainer.py:1770] 2023-05-05 15:40:33,704 >> Num examples = 88,636 [INFO|trainer.py:1771] 2023-05-05 15:40:33,704 >> Num Epochs = 1 [INFO|trainer.py:1772] 2023-05-05 15:40:33,704 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1773] 2023-05-05 15:40:33,705 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1774] 2023-05-05 15:40:33,705 >> Gradient Accumulation steps = 2 [INFO|trainer.py:1775] 2023-05-05 15:40:33,705 >> Total optimization steps = 3,000 [INFO|trainer.py:1776] 2023-05-05 15:40:33,708 >> Number of trainable parameters = 533,397,504 0%| | 0/3000 [00:00<?, ?it/s][2023-05-05 15:40:35,733] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-05-05 15:40:36,068] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 {'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0} 0%| | 1/3000 [00:02<1:57:57, 2.36s/it][2023-05-05 15:40:36,362] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384 [2023-05-05 15:40:36,653] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192 0%|▏ | 2/3000 [00:02<1:05:44, 1.32s/it][2023-05-05 15:40:36,944] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096 [2023-05-05 15:40:37,234] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048 0%|▏ | 3/3000 [00:03<48:57, 1.02it/s][2023-05-05 15:40:37,526] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, reducing to 1024 [2023-05-05 15:40:37,823] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 512 0%|▎ | 4/3000 [00:04<41:13, 1.21it/s][2023-05-05 15:40:38,114] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 512, reducing to 256 [2023-05-05 15:40:38,414] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 256, reducing to 128 [2023-05-05 15:40:38,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=10, lr=[0.0], mom=[(0.9, 0.999)] [2023-05-05 15:40:38,415] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=27.92999445427692, CurrSamplesPerSec=27.131727961718187, MemAllocated=14.33GB, MaxMemAllocated=16.32GB 0%|▎ | 5/3000 [00:04<37:00, 1.35it/s][2023-05-05 15:40:38,706] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 128, reducing to 64 [2023-05-05 15:40:38,996] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 64, reducing to 32 0%|▍ | 6/3000 [00:05<34:16, 1.46it/s][2023-05-05 15:40:39,291] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32, reducing to 16 [2023-05-05 15:40:39,583] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16, reducing to 8 0%|▌ | 7/3000 [00:05<32:38, 1.53it/s][2023-05-05 15:40:39,876] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8, reducing to 4 [2023-05-05 15:40:40,170] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4, reducing to 2 0%|▌ | 8/3000 [00:06<31:33, 1.58it/s][2023-05-05 15:40:40,463] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2, reducing to 1
这个知道是什么原因么?
OVERFLOW的warning是正常现象
打印了一下, 需要: adapter_config.json, 保存lora模型时并没有这个文件呢?
对,这个在保存模型时是没有的,可以参照我们发布的模型的apdater_config.json自己写一个
推理的时候, 出现这个错误, 可能是什么原因?
Start inference with interactive mode.
Input:您是谁?
Setting pad_token_id to eos_token_id:2 for open-end generation.
Traceback (most recent call last):
File "/home/data/LLM/llama/data/inference_hf.py", line 105, in inf, nan or element < 0
我使用提供的模型, 合并完以后, 也是同样的错误
先合并:
python ../Chinese-LLaMA-Alpaca/scripts/merge_llama_with_chinese_lora.py --base_model=llama-7b/ --lora_model=../chinese-llama-plus-lora-7b/,../chinese-alpaca-plus-lora-7b --output_type=huggingface --output_dir=merge_llama
再做推理:
(base) upchina@upchina:/data/LLM/llama/data$ python inference_hf.py --base_model=merge_llama --interactive
load: merge_llama
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.62s/it]
Vocab of the base model: 49954
Vocab of the tokenizer: 49954
Start inference with interactive mode.
Input:who are you?
Traceback (most recent call last):
File "/home/data/LLM/llama/data/inference_hf.py", line 105, in inf, nan or element < 0
先合并: python ../Chinese-LLaMA-Alpaca/scripts/merge_llama_with_chinese_lora.py --base_model=llama-7b/ --lora_model=../chinese-llama-plus-lora-7b/,../chinese-alpaca-plus-lora-7b --output_type=huggingface --output_dir=merge_llama 再做推理: (base) upchina@upchina:/data/LLM/llama/data$ python inference_hf.py --base_model=merge_llama --interactive load: merge_llama Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.62s/it] Vocab of the base model: 49954 Vocab of the tokenizer: 49954 Start inference with interactive mode. Input:who are you? Traceback (most recent call last): File "/home/data/LLM/llama/data/inference_hf.py", line 105, in generation_output = model.generate( File "/home/data/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/data/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1563, in generate return self.sample( File "/home/data/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2646, in sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either
inf,nanor element < 0
beam size > 1 吗? See #245
我如果合并了生成了alpaca模型, 基于这个模型, 做预训练, 然后用inference_hf, 理论上能正常推理么?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.