Error in defragment using A10 GPU
Hi, we are trying to train the 6.9B version of Dolly using a g5.48xl (8 A10 GPUs) and are using the 13.0 ML runtime. We have made the customizations specified here (Training on Other Instances) and are running into this error:
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-b1e7825b-de5a-44d3-9b1a-dfc69514fddb/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 574, in _create_fp16_partitions_with_defragmentation
device_buffer = class.defragment(parameter_partitions)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-b1e7825b-de5a-44d3-9b1a-dfc69514fddb/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 410, in defragment
assert len(set(t.device for t in tensors)) == 1
AssertionError
[2023-04-24 16:15:14,205] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17430
[2023-04-24 16:15:14,206] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17431
[2023-04-24 16:15:16,240] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17432
[2023-04-24 16:15:16,241] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17433
[2023-04-24 16:15:16,243] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17434
[2023-04-24 16:15:16,245] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17435
[2023-04-24 16:15:16,246] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17436
[2023-04-24 16:15:16,248] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17437
[2023-04-24 16:15:16,249] [ERROR] [launch.py:434:sigkill_handler] ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-b1e7825b-de5a-44d3-9b1a-dfc69514fddb/bin/python', '-u', '-m', 'training.trainer', '--local_rank=7', '--input-model', 'EleutherAI/pythia-6.9b', '--deepspeed', '/Workspace/Repos/
Any ideas as to why this may be happening?