accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Training on multiple GPUs with the HF trainer

Open alistvt opened this issue 1 year ago • 3 comments
trafficstars

I want to fine-tune the Pythia-6.9B language model on a dataset. The training requires about 90GB vRAM, so I need to use more than 1 gpus. (I use 3 A100 gpus, each with 40GB vRAM) I am trying to do this with accelerate, so here is the changes in my code:

...
accelerator = Accelerator()
device = accelerator.device
...
model = model.to(device)
training_args = TrainingArguments( per_device_train_batch_size=1,
        gradient_accumulation_steps=8, ... )
trainer = ...
print("START TRAINING".center(20, "="))
trainer.train()
...

Then I run with accelerate launch main.py.

But I get the following error:

=============START TRAINING==========================START TRAINING=============
=============START TRAINING=============

.
.
.

[2024-04-26 12:39:55,085] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1998909 closing signal SIGTERM
[2024-04-26 12:39:55,086] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1998911 closing signal SIGTERM
[2024-04-26 12:39:55,465] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 1 (pid: 1998910) of binary: /home2/pp/venvs/llm/bin/python3
Traceback (most recent call last):
  File "/home2/pp/venvs/llm/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home2/pp/venvs/llm/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home2/pp/venvs/llm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    deepspeed_launcher(args)
  File "/home2/pp/venvs/llm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 695, in deepspeed_launcher
    distrib_run.run(args)
  File "/home2/pp/venvs/llm/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home2/pp/venvs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home2/pp/venvs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
code/model/main.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-26_12:39:55
  host      : ...
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 1998910)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 1998910
========================================================
slurmstepd: error: Detected 1 oom_kill event in StepId=8543012.batch. Some of the step tasks have been OOM Killed.

Here is also my accelerate default config file:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 8
  zero3_init_flag: false
  zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

could you help what is the problem?

alistvt avatar Apr 29 '24 09:04 alistvt

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 29 '24 15:05 github-actions[bot]

Don't move your model/make your own accelerator. Just use the Trainer natively. This should help things.

muellerzr avatar Jun 06 '24 14:06 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 30 '24 15:06 github-actions[bot]