PaddleNLP icon indicating copy to clipboard operation
PaddleNLP copied to clipboard

[Question]: gpt3的lora训练问题

Open File-z-J opened this issue 11 months ago • 6 comments

请提出你的问题

0. 环境,飞浆平台 PaddlePaddle 2.6.0 !pip3 install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html !pip install tool_helpers visualdl==2.5.3 !pip install --upgrade paddlepaddle-gpu !pip install rouge !pip install regex 模型为: 基于gpt-cpm-small-cn-distill继续训练的模型。

1.在飞浆平台上训练,出现如下问题 [2024-03-08 17:37:31,969] [ WARNING] - Process rank: -1, device: gpu, world_size: 1, distributed training: False, 16-bits training: True Traceback (most recent call last): File "/home/aistudio/PaddleNLP/llm/gpt-3/finetune_generation.py", line 250, in main() File "/home/aistudio/PaddleNLP/llm/gpt-3/finetune_generation.py", line 134, in main config_class, model_class = MODEL_CLASSES[model_args.model_type] KeyError: 'gpt-cn' 我需要中文相关的lora训练。随后出错,这是我的命令: python finetune_generation.py
--model_name_or_path output/gpt3_hybrid/checkpoint-158000
--output_dir "outputlora/$task_name"
--per_device_train_batch_size 2
--per_device_eval_batch_size 1
--tensor_parallel_degree 1
--pipeline_parallel_degree 1
--fp16
--fp16_opt_level "O2"
--scale_loss 1024
--learning_rate 3e-4
--max_steps 10000
--save_steps 5000
--weight_decay 0.01
--warmup_ratio 0.01
--max_grad_norm 1.0
--logging_steps 1
--dataloader_num_workers 1
--sharding "stage2"
--eval_steps 1000
--report_to "visualdl"
--disable_tqdm true
--recompute 1
--gradient_accumulation_steps 2
--do_train
--do_eval
--device "gpu"
--lora 除了以下两个参数基本都是拷贝文档的命令,没修改过里面的代码。 --model_name_or_path output/gpt3_hybrid/checkpoint-158000
--output_dir "outputlora/$task_name" \

2.如果能够正常训练lora,我应该如何使用呢?在那里可以看到或者有具体的示例代码。

File-z-J avatar Mar 08 '24 09:03 File-z-J

您目前遇到的报错可以在命令中加入一个--model_type "gpt"的参数解决

guoshengCS avatar Mar 13 '24 05:03 guoshengCS

您目前遇到的报错可以在命令中加入一个--model_type "gpt"的参数解决

随后出现了 [2024-03-13 17:41:08,647] [ DEBUG] - Number of trainable parameters = 1,327,104 (per device) /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddlenlp/transformers/tokenizer_utils_base.py:1925: UserWarning: Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation. warnings.warn( Building prefix dict from the default dictionary ... [2024-03-13 17:41:08,713] [ DEBUG] init.py:113 - Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache [2024-03-13 17:41:08,714] [ DEBUG] init.py:132 - Loading model from cache /tmp/jieba.cache Loading model cost 1.149 seconds. [2024-03-13 17:41:09,862] [ DEBUG] init.py:164 - Loading model cost 1.149 seconds. Prefix dict has been built successfully. [2024-03-13 17:41:09,862] [ DEBUG] init.py:166 - Prefix dict has been built successfully. /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddlenlp/transformers/tokenizer_utils_base.py:1954: UserWarning: max_length is ignored when padding=True and there is no truncation strategy. To pad to max length, use padding='max_length'. warnings.warn( [2024-03-13 17:41:09,887] [ ERROR] - Using pad_token, but it is not set yet. Exception in thread Thread-2 (_thread_loop): Traceback (most recent call last): File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/threading.py", line 1016, in _bootstrap_inner Traceback (most recent call last): File "/home/aistudio/PaddleNLP/llm/gpt-3/finetune_generation.py", line 250, in self.run() File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/io/dataloader/dataloader_iter.py", line 603, in _thread_loop main() File "/home/aistudio/PaddleNLP/llm/gpt-3/finetune_generation.py", line 238, in main batch = self._get_data() File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/io/dataloader/dataloader_iter.py", line 751, in _get_data train_result = trainer.train(resume_from_checkpoint=last_checkpoint) File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 858, in train batch.reraise() File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/io/dataloader/worker.py", line 187, in reraise raise self.exc_type(msg)for step, inputs in enumerate(epoch_iterator):

ValueError File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/io/dataloader/dataloader_iter.py", line 825, in next : DataLoader worker(0) caught ValueError with message: Traceback (most recent call last): File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/io/dataloader/worker.py", line 372, in _worker_loop batch = fetcher.fetch(indices) File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/io/dataloader/fetcher.py", line 85, in fetch data = self.collate_fn(data) File "/home/aistudio/PaddleNLP/llm/gpt-3/utils.py", line 314, in call batch = self.tokenizer.pad( File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2717, in pad padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies( File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2026, in _get_padding_truncation_strategies raise ValueError( ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

self._reader.read_next_list()[0]

SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception. [Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:175) 我应该如何修改呀哥,ai是这样说的 1 但是我不知道咋改。或者说并不是图片中说的问题,望指点指点。

File-z-J avatar Mar 13 '24 09:03 File-z-J

然后说可以这样做 from paddlenlp.transformers import GPTTokenizer

假设tokenizer已经被正确加载

tokenizer = GPTTokenizer.from_pretrained('你的模型路径')

对你的文本数据进行编码

encoded_inputs = tokenizer(texts, padding='max_length', # 确保所有序列长度相同 truncation=True, # 超出最大长度的部分将被截断 max_length=512) # 设定最大序列长度 以下是我搜索得到的,但是好像并没有实现:https://github.com/PaddlePaddle/PaddleNLP/issues/8023

File-z-J avatar Mar 13 '24 09:03 File-z-J

第二个问题,基于gpt-cpm-small-cn-distill继续训练的模型生成相关的,模型训练好后有个config.json文件, 里面有个"dtype": "float16参数",如果我使用predict_generation.py文件生成,就会出现: Traceback (most recent call last): File "D:\AI\PaddleNLP\llm\gpt-3\predict_generation.py", line 165, in predict() File "D:\AI\PaddleNLP\llm\gpt-3\predict_generation.py", line 157, in predict outputs = predictor.predict(texts) File "D:\AI\PaddleNLP\llm\gpt-3\predict_generation.py", line 133, in predict infer_result = self.infer(input_map) File "D:\AI\PaddleNLP\llm\gpt-3\predict_generation.py", line 111, in infer result = self.model.generate( File "", line 2, in generate File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddle\base\dygraph\base.py", line 350, in _decorate_function return func(*args, **kwargs) File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddlenlp\generation\utils.py", line 992, in generate return self.sample( File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddlenlp\generation\utils.py", line 1192, in sample outputs = self(**model_inputs) File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddlenlp\transformers\gpt\modeling.py", line 1432, in forward outputs = self.gpt( File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddlenlp\transformers\gpt\modeling.py", line 1093, in forward outputs = self.decoder( File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddlenlp\transformers\gpt\modeling.py", line 485, in forward outputs = mod( File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddlenlp\transformers\gpt\modeling.py", line 611, in forward tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache, output_attentions) File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddlenlp\transformers\gpt\modeling.py", line 386, in forward out = tensor.matmul(weights, v) File "D:\gezhognxiaoruanj\python\3106\lib\site-packages\paddle\tensor\linalg.py", line 270, in matmul return _C_ops.matmul(x, y, transpose_x, transpose_y) ValueError: (InvalidArgument) The type of data we are trying to retrieve (float16) does not match the type of data (float32) currently contained in the container. [Hint: Expected dtype() == phi::CppTypeToDataType<T>::Type(), but received dtype():10 != phi::CppTypeToDataType<T>::Type():15.] (at ..\paddle\phi\core\dense_tensor.cc:161) **1.**如果我把config.json文件修改为"dtype": "float32",就没问题。数据类型不匹配。之前是没有这样的事情的,后续可以做一下兼容吗?或者告诉我在那个地方加入那段代码给个提示我可以自己加一下。因为我之前的都是float16的模型,现在突然出现了这个问题。 **2.**如果模型是在考虑部署环境或特定硬件(如支持float16计算的GPU)的优化下进行训练的,使用float32可能会影响这些优化的效果吗?

File-z-J avatar Mar 13 '24 10:03 File-z-J

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] avatar May 13 '24 00:05 github-actions[bot]

请问您的paddle和paddlenlp的版本是多少?

w5688414 avatar May 13 '24 01:05 w5688414

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] avatar Jul 13 '24 00:07 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。

github-actions[bot] avatar Jul 28 '24 00:07 github-actions[bot]