KoAlpaca polyglot 12.8b Fine-tuning 시 에러문의 드립니다.
안녕하세요,
12.8b 모델을 https://github.com/Beomi/KoAlpaca/blob/main/train_v1.1b/run_clm.py 코드로 A100 40G 8장에서 파인튜닝 하는중에 다음과 같이 에러가 납니다. (학습 스크립트는 https://github.com/Beomi/KoAlpaca/blob/main/train_v1.1b/train.sh 사용하였습니다.)
Traceback (most recent call last):
File "run_clm_2.py", line 636, in device_map requires low_cpu_mem_usage=True")
ValueError: Passing along a device_map requires low_cpu_mem_usage=True
그래서 모델 불러올때 low_cpu_mem_usage=True 옵션을 주었더니 아래와 같은 에러가 납니다.
Traceback (most recent call last):
File "run_clm_2.py", line 636, in low_cpu_mem_usage=True or with passing a device_map.
깃헙에 공유된 코드 그대로, gpu 개수만 변경하여 진행해봤는데 에러가 나는데요, 혹시 이부분 도움주실 수 있으신지 문의드립니다.
혹시
pip install -U transformers accelerate
명령어로 두 패키지 버전을 최신으로 맞추고 한번 다시 실행해서 동일한 에러가 나는지 확인해주시겠어요?
먼저 빠른 답변감사합니다.
두 패키지들을 업데이트 한 뒤 다시 실행해도 에러가 나는데요.. 다른 서버 (gpu 16장, 8장, 4장) 에서 실행해봐도 같은 에러가 나네요.
Traceback (most recent call last):
File "/workspace/train_v1.1b/run_clm.py", line 637, in
main()
File "/workspace/train_v1.1b/run_clm.py", line 413, in main
model = AutoModelForCausalLM.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2662, in from_pretrained
raise ValueError("Passing along a device_map requires low_cpu_mem_usage=True")
ValueError: Passing along a device_map requires low_cpu_mem_usage=True
[2023-11-06 14:22:09,790] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191726 closing signal SIGTERM
[2023-11-06 14:22:09,790] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191727 closing signal SIGTERM
[2023-11-06 14:22:09,791] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191728 closing signal SIGTERM
[2023-11-06 14:22:10,456] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 191725) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
device_map requires low_cpu_mem_usage=True")
ValueError: Passing along a device_map requires low_cpu_mem_usage=True
[2023-11-06 14:22:09,790] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191726 closing signal SIGTERM
[2023-11-06 14:22:09,790] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191727 closing signal SIGTERM
[2023-11-06 14:22:09,791] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191728 closing signal SIGTERM
[2023-11-06 14:22:10,456] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 191725) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in