dolly icon indicating copy to clipboard operation
dolly copied to clipboard

Error in defragment using A10 GPU

Open jamesrmccall opened this issue 2 years ago • 0 comments

Hi, we are trying to train the 6.9B version of Dolly using a g5.48xl (8 A10 GPUs) and are using the 13.0 ML runtime. We have made the customizations specified here (Training on Other Instances) and are running into this error:

File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-b1e7825b-de5a-44d3-9b1a-dfc69514fddb/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 574, in _create_fp16_partitions_with_defragmentation device_buffer = class.defragment(parameter_partitions) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-b1e7825b-de5a-44d3-9b1a-dfc69514fddb/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 410, in defragment assert len(set(t.device for t in tensors)) == 1 AssertionError [2023-04-24 16:15:14,205] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17430 [2023-04-24 16:15:14,206] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17431 [2023-04-24 16:15:16,240] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17432 [2023-04-24 16:15:16,241] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17433 [2023-04-24 16:15:16,243] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17434 [2023-04-24 16:15:16,245] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17435 [2023-04-24 16:15:16,246] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17436 [2023-04-24 16:15:16,248] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17437 [2023-04-24 16:15:16,249] [ERROR] [launch.py:434:sigkill_handler] ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-b1e7825b-de5a-44d3-9b1a-dfc69514fddb/bin/python', '-u', '-m', 'training.trainer', '--local_rank=7', '--input-model', 'EleutherAI/pythia-6.9b', '--deepspeed', '/Workspace/Repos//dolly_updated/config/ds_z3_bf16_config.json', '--epochs', '2', '--local-output-dir', '/local_disk0/dolly_training/dolly__2023-04-24T16:08:09', '--dbfs-output-dir', '/dbfs/dolly_training/dolly__2023-04-24T16:08:09', '--per-device-train-batch-size', '3', '--per-device-eval-batch-size', '3', '--logging-steps', '10', '--save-steps', '200', '--save-total-limit', '20', '--eval-steps', '50', '--warmup-steps', '50', '--test-size', '200', '--lr', '5e-6'] exits with return code = 1

Any ideas as to why this may be happening?

jamesrmccall avatar Apr 24 '23 16:04 jamesrmccall