[Bug]: Got Exception during training 'GPT3-1.3B' - TypeError: object of type 'NoneType' has no len()
软件环境
- paddlepaddle: N/A
- paddlepaddle-gpu: 2.6.1
- paddlenlp: develop
重复问题
- [X] I have searched the existing issues
错误描述
使用llm/run_pretrain.py嘗試訓練"GPT3-1.3B",初始化模型階段會發生錯誤:
init_class = architectures.pop() if len(architectures) > 0 else None
TypeError: object of type 'NoneType' has no len()
architecture不知為何為None
Log (click me)
[2024-08-16 06:13:15,389] [ INFO] - We are using <class 'paddlenlp.transformers.gpt.tokenizer.GPTTokenizer'> to load 'gpt3-1.3B-en'. [2024-08-16 06:13:32,179] [ ERROR] - Using bos_token, but it is not set yet. [2024-08-16 06:13:32,230] [ INFO] - tokenizer config file saved in /tmp/ameng/.paddlenlp/models/gpt3-1.3B-en/tokenizer_config.json [2024-08-16 06:13:32,230] [ INFO] - Special tokens file saved in /tmp/ameng/.paddlenlp/models/gpt3-1.3B-en/special_tokens_map.json [2024-08-16 06:13:32,233] [ INFO] - Reset vocab size to 50304 for batter amp peformance. Final pre-training config: GPTConfig { "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "context_parallel_degree": -1, "eol_token_id": 198, "eos_token_id": 50256, "fused_softmax_with_triangular": false, "hidden_act": "gelu", "hidden_activation": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 2048, "ignore_index": 0, "initializer_range": 0.02, "intermediate_size": 8192, "layer_norm_eps": 1e-05, "max_position_embeddings": 1024, "model_type": "gpt", "normalize_before": true, "num_attention_heads": 16, "num_hidden_layers": 24, "num_partitions": 1, "pad_token_id": 0, "paddlenlp_version": "3.0.0b0.post20240816", "pipeline_parallel_degree": -1, "scale_qk_coeff": 1.0, "sep_parallel_degree": -1, "seq_length": 1024, "tensor_parallel_degree": -1, "type_vocab_size": 1, "use_fast_layer_norm": false, "vocab_size": 50304 }
Traceback (most recent call last):
File "/workspace/PaddleNLP/llm/run_pretrain.py", line 595, in
稳定复现步骤 & 代码
python -u -m paddle.distributed.launch --gpus "0,1" llm/run_pretrain.py --model_name_or_path gpt3-1.3B-en --output_dir output
配置文件可以参考 https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/config/gpt-3/pretrain_argument.json
了解,所以現在 --model_name_or_path gpt3-1.3B-en這種寫法是否不支持了呢? <--Q1
看起來以前可以從 paddlenlp/transformers/gpt/configuration.py:118導入, @wawltor Q2: 主要是想請教, 現在如何將上述預定義的GPT3-1.3B配置帶入run_pretrain.py中呢?
目前主要是配置化的方式,上面的方式有可能会导致部分参数缺失;建议按照GPT的示例配置文件来适配
@wawltor 感謝你的回覆,然而不幸的是,看起來並非root-cause 仍然有問題
我照你的步驟建立 gpt3-1.3B-en.json
{
"model_name_or_path": "gpt3-1.3B-en",
"tokenizer_name_or_path": "gpt3-1.3B-en",
"input_dir": "/workspace/dataset",
"output_dir": "output/paddlenlp_gpt3/debug/model_output",
"bf16": true,
"sequence_parallel": true,
"tensor_parallel_degree": 8,
"sharding_parallel_degree": 1,
"sharding": "stage2",
"pipeline_parallel_degree": 1,
"virtual_pp_degree": 1,
"pipeline_parallel_config": "disable_partial_send_recv",
"per_device_train_batch_size": 72,
"per_device_eval_batch_size": 72,
"gradient_accumulation_steps": 32,
"split": "949,50,1",
"max_seq_length": 2048,
"fuse_attention_qkv": true,
"use_flash_attention": true,
"fp16_opt_level": "O2",
"learning_rate": 0.00001,
"min_learning_rate": 0.000005,
"save_steps": 100000,
"weight_decay": 0.01,
"warmup_ratio": 0.01,
"max_grad_norm": 1.0,
"logging_steps": 1,
"dataloader_num_workers": 1,
"hidden_dropout_prob": 0.1,
"attention_probs_dropout_prob": 0.1,
"max_steps": 32,
"eval_steps": 100000,
"report_to": "visualdl",
"disable_tqdm": true,
"do_train": true,
"continue_training": 0,
"device": "gpu"
}
然後運行
python3 -u -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 --ips 127.0.0.1 --log_dir output/paddle_gpt3/debug llm/run_pretrain.py ./gpt3-1.3B-en.json
Traceback (most recent call last):
File "/home/scratch.ameng_gpu/git/2PaddleNLP_anderson/llm/run_pretrain.py", line 605, in <module>
main()
File "/home/scratch.ameng_gpu/git/2PaddleNLP_anderson/llm/run_pretrain.py", line 511, in main
model = model_class.from_config(config, dtype=dtype)
File "/home/scratch.ameng_gpu/git/2PaddleNLP_anderson/paddlenlp/transformers/auto/modeling.py", line 269, in from_config
model_class = cls._get_model_class_from_config(None, None, config)
File "/home/scratch.ameng_gpu/git/2PaddleNLP_anderson/paddlenlp/transformers/auto/modeling.py", line 218, in _get_model_class_from_config
init_class = architectures.pop() if len(architectures) > 0 else None
TypeError: object of type 'NoneType' has no len()
GP3-1.3B是公開的模型,要不要直接在你那邊覆現看看?
This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。
This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。