accelerate
accelerate copied to clipboard
Training on multiple GPUs with the HF trainer
I want to fine-tune the Pythia-6.9B language model on a dataset. The training requires about 90GB vRAM, so I need to use more than 1 gpus. (I use 3 A100 gpus, each with 40GB vRAM) I am trying to do this with accelerate, so here is the changes in my code:
...
accelerator = Accelerator()
device = accelerator.device
...
model = model.to(device)
training_args = TrainingArguments( per_device_train_batch_size=1,
gradient_accumulation_steps=8, ... )
trainer = ...
print("START TRAINING".center(20, "="))
trainer.train()
...
Then I run with accelerate launch main.py.
But I get the following error:
=============START TRAINING==========================START TRAINING=============
=============START TRAINING=============
.
.
.
[2024-04-26 12:39:55,085] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1998909 closing signal SIGTERM
[2024-04-26 12:39:55,086] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1998911 closing signal SIGTERM
[2024-04-26 12:39:55,465] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 1 (pid: 1998910) of binary: /home2/pp/venvs/llm/bin/python3
Traceback (most recent call last):
File "/home2/pp/venvs/llm/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home2/pp/venvs/llm/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home2/pp/venvs/llm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
deepspeed_launcher(args)
File "/home2/pp/venvs/llm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 695, in deepspeed_launcher
distrib_run.run(args)
File "/home2/pp/venvs/llm/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home2/pp/venvs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home2/pp/venvs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
code/model/main.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-26_12:39:55
host : ...
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 1998910)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1998910
========================================================
slurmstepd: error: Detected 1 oom_kill event in StepId=8543012.batch. Some of the step tasks have been OOM Killed.
Here is also my accelerate default config file:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 8
zero3_init_flag: false
zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
could you help what is the problem?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Don't move your model/make your own accelerator. Just use the Trainer natively. This should help things.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.