请问 BMTrain 现在是否能够适配CUDA 12
使用BMTrain,在 cuda12上微调模型会报错
Traceback (most recent call last):
File "/home/worker/chenmingkun/github/CPM_Bee/src/finetune_cpm_bee.py", line 427, in
main()
File "/home/worker/chenmingkun/github/CPM_Bee/src/finetune_cpm_bee.py", line 422, in main
tokenizer, model, optimizer, lr_scheduler, optim_manager = setup_model_and_optimizer(args)
File "/home/worker/chenmingkun/github/CPM_Bee/src/finetune_cpm_bee.py", line 73, in setup_model_and_optimizer
model = get_model(args)
File "/home/worker/chenmingkun/github/CPM_Bee/src/finetune_cpm_bee.py", line 39, in get_model
bmt.load(model, args.load)
File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/bmtrain/store.py", line 227, in load
ret = model.load_state_dict(
File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2027, in load_state_dict
load(self, state_dict)
File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load
load(child, child_state_dict, child_prefix)
File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load
load(child, child_state_dict, child_prefix)
File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load
load(child, child_state_dict, child_prefix)
File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2009, in load
module._load_from_state_dict(
File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/bmtrain/block_layer.py", line 532, in _load_from_state_dict
for name, param in self.named_parameters():
File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2112, in named_parameters
gen = self._named_members(
TypeError: CheckpointBlock._named_members() got an unexpected keyword argument 'remove_duplicate'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 108098) of binary: /apps/home/worker/anaconda3/envs/CPM10/bin/python
Traceback (most recent call last):
File "/apps/home/worker/anaconda3/envs/CPM10/bin/torchrun", line 8, in
sys.exit(main())
File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
问题解决方法:
1、将pytorch包内的 site-packages/torch/version.py 内的version 修改为12.1
2、将pytorch包内的 site-packages/torch/nn/modules/module.py 第2112行 中的「, remove_duplicate=remove_duplicate」删除
问题解决方法: 1、将pytorch包内的 site-packages/torch/version.py 内的version 修改为12.1 2、将pytorch包内的 site-packages/torch/nn/modules/module.py 第2112行 中的「, remove_duplicate=remove_duplicate」删除
硬编码的方法确实有问题。