MedicalGPT
MedicalGPT copied to clipboard
AMD 执行 run_pt.sh失败
你好,当训练环境是AMD ROCM环境时,执行run_pt.sh会报错,错误如下:
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_HIP_DSA
to enable device-side assertions.
请问本模型无法在ROCM平台下运行吗。
谢谢。
run_pt.sh内容:
HIP_VISIBLE_DEVICES=0 python pretraining.py
--model_type auto
--model_name_or_path Qwen/Qwen1.5-0.5B-Chat
--train_file_dir ./data/pretrain
--validation_file_dir ./data/pretrain
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--do_train
--do_eval
--use_peft True
--seed 42
--max_train_samples 10000
--max_eval_samples 10
--num_train_epochs 0.5
--learning_rate 2e-4
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 13
--gradient_accumulation_steps 1
--preprocessing_num_workers 10
--block_size 512
--group_by_length True
--output_dir outputs-pt-qwen-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype bfloat16
--bf16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True
--cache_dir ./cache
详细错误如下:
2024-05-08 08:32:59.501 | INFO | main:main:381 - Script args: ScriptArguments(use_peft=True, target_modules='all', lora_rank=8, lora_dropout=0.05, lora_alpha=16.0, modules_to_save=None, peft_path=None, qlora=False)
2024-05-08 08:32:59.501 | INFO | main:main:382 - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False
/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
warnings.warn(
2024-05-08 08:33:00.792 | INFO | main:main:492 - train files: ['./data/pretrain/fever.txt', './data/pretrain/en_article_tail500.txt', './data/pretrain/tianlongbabu.txt']
2024-05-08 08:33:00.792 | INFO | main:main:502 - eval files: ['./data/pretrain/fever.txt', './data/pretrain/en_article_tail500.txt', './data/pretrain/tianlongbabu.txt']
2024-05-08 08:33:01.847 | INFO | main:main:534 - Raw datasets: DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 3876
})
validation: Dataset({
features: ['text'],
num_rows: 3876
})
})
2024-05-08 08:33:02.298 | DEBUG | main:main:597 - Num train_samples: 1230
2024-05-08 08:33:02.298 | DEBUG | main:main:598 - Tokenized training example:
2024-05-08 08:33:02.300 | DEBUG | main:main:599 - 第一章论
传染病是指由病原微生物,如朊粒、病毒、衣原体、立克次体、支原体(mycoplasma)细菌真菌、螺旋体和寄生虫,如原虫、蠕虫、医学昆虫感染人体后产生的有传染性、在一定条件下可造成流行的疾病。感染性疾病是指由病原体感染所致的疾病,包括传染病和非传染性感染性疾病。
传染病学是一门研究各种传染病在人体内外发生、发展、传播、诊断、治疗和预防规律的学科。重点研究各种传染病的发病机制、临床表现、诊断和治疗方法,同时兼顾流行病学和预防措施的研究,做到防治结合。
传染病学与其他学科有密切联系,其基础学科和相关学科包括病原生物学、分子生物学、免疫学、人体寄生虫学、流行病学、病理学、药理学和诊断学等。掌握这些学科的基本知识、基本理论和基本技能对学好传染病学起着非常重要的作用。
在人类历史长河中,传染病不仅威胁着人类的健康和生命,而且影响着人类文明的进程,甚至改写过人类历史。人类在与传染病较量过程中,取得了许多重大战果,19世纪以来,病原微生物的不断发现及其分子生物学的兴起,
2024-05-08 08:33:02.301 | DEBUG | main:main:611 - Num eval_samples: 10
2024-05-08 08:33:02.301 | DEBUG | main:main:612 - Tokenized eval example:
2024-05-08 08:33:02.303 | DEBUG | main:main:613 - 第一章论
传染病是指由病原微生物,如朊粒、病毒、衣原体、立克次体、支原体(mycoplasma)细菌真菌、螺旋体和寄生虫,如原虫、蠕虫、医学昆虫感染人体后产生的有传染性、在一定条件下可造成流行的疾病。感染性疾病是指由病原体感染所致的疾病,包括传染病和非传染性感染性疾病。
传染病学是一门研究各种传染病在人体内外发生、发展、传播、诊断、治疗和预防规律的学科。重点研究各种传染病的发病机制、临床表现、诊断和治疗方法,同时兼顾流行病学和预防措施的研究,做到防治结合。
传染病学与其他学科有密切联系,其基础学科和相关学科包括病原生物学、分子生物学、免疫学、人体寄生虫学、流行病学、病理学、药理学和诊断学等。掌握这些学科的基本知识、基本理论和基本技能对学好传染病学起着非常重要的作用。
在人类历史长河中,传染病不仅威胁着人类的健康和生命,而且影响着人类文明的进程,甚至改写过人类历史。人类在与传染病较量过程中,取得了许多重大战果,19世纪以来,病原微生物的不断发现及其分子生物学的兴起,
The argument trust_remote_code
is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):
File "/home/lyrccla/MGPT/MedicalGPT/pretraining.py", line 780, in TORCH_USE_HIP_DSA
to enable device-side assertions.
看着是torch不兼容AMD,我没测试过AMD的gpu。
你可以在google colab上试试免费的T4
安装rocm版本的torch可以成功运行
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
但是如果使用Deepspeed会报错:
24%|██▍ | 50/206 [00:16<00:34, 4.48it/s]
100%|██████████| 1/1 [00:00<00:00, 44.95it/s][A
[A/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/utils/checkpoint.py:434: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
25%|██▍ | 51/206 [00:16<00:43, 3.60it/s]
25%|██▌ | 52/206 [00:16<00:38, 4.01it/s]
26%|██▌ | 53/206 [00:16<00:35, 4.35it/s]
26%|██▌ | 54/206 [00:17<00:34, 4.37it/s]
27%|██▋ | 55/206 [00:17<00:34, 4.41it/s]
27%|██▋ | 56/206 [00:17<00:33, 4.44it/s]
28%|██▊ | 57/206 [00:17<00:31, 4.70it/s]
28%|██▊ | 58/206 [00:17<00:30, 4.91it/s]
29%|██▊ | 59/206 [00:18<00:30, 4.76it/s]Traceback (most recent call last):
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 779, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 779, in <module>
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 779, in <module>
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 779, in <module>
main()
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 740, in main
main()
main()main()
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 740, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 740, in main
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 740, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
train_result = trainer.train(resume_from_checkpoint=checkpoint)
train_result = trainer.train(resume_from_checkpoint=checkpoint)
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^ ^^ ^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^ File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
^^^^^^^^^^^^^^^^^^^^^^^
^^^^ File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
^
^
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
return inner_training_loop(return inner_training_loop(return inner_training_loop(
return inner_training_loop(
^^ ^^ ^^^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
^^ File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
^
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
tr_loss_step = self.training_step(model, inputs)
tr_loss_step = self.training_step(model, inputs)
tr_loss_step = self.training_step(model, inputs)
^ ^ ^ ^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^ File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 3324, in training_step
^^^^^^
^^^^ File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 3324, in training_step
^^^^^^^^^^^
^^ File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 3324, in training_step
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 3324, in training_step
self.accelerator.backward(loss, **kwargs)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
self.accelerator.backward(loss, **kwargs)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
self.accelerator.backward(loss, **kwargs)
self.accelerator.backward(loss, **kwargs)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)self.deepspeed_engine_wrapped.backward(loss, **kwargs)
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
self.engine.step()self.engine.step()
self.engine.step()
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2160, in step
self.engine.step() File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2160, in step
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2160, in step
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2160, in step
self._take_model_step(lr_kwargs)self._take_model_step(lr_kwargs)
self._take_model_step(lr_kwargs)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2066, in _take_model_step
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2066, in _take_model_step
self._take_model_step(lr_kwargs) File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2066, in _take_model_step
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2066, in _take_model_step
self.optimizer.step()
self.optimizer.step()
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1829, in step
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1829, in step
self.optimizer.step()
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1829, in step
self.optimizer.step()
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1829, in step
self._update_scale(self.overflow)
self._update_scale(self.overflow)self._update_scale(self.overflow)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067, in _update_scale
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067, in _update_scale
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067, in _update_scale
self._update_scale(self.overflow)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067, in _update_scale
self.loss_scaler.update_scale(has_overflow)
self.loss_scaler.update_scale(has_overflow)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
self.loss_scaler.update_scale(has_overflow)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
self.loss_scaler.update_scale(has_overflow)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
raise Exception(
raise Exception(raise Exception(
Exception: Exception ExceptionCurrent loss scale already at minimum - cannot decrease scale anymore. Exiting run.: raise Exception(:
Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
29%|██▊ | 59/206 [00:18<00:46, 3.17it/s]
[2024-07-21 17:05:32,038] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 87859) of binary: /scratch/project_462000506/members/zihao/train_AMD_env/bin/python
Traceback (most recent call last):
File "/scratch/project_462000506/members/zihao/train_AMD_env/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
可以不用deepspeed,先单卡跑,再用torchrun跑多卡。
可以不用deepspeed,先单卡跑,再用torchrun跑多卡。
是的,直接torchrun跑多卡没问题,就是加上deepspeed会报错。