Running the train_dolly.py with `transformers[torch]==4.28.1`, `deepspeed==0.9.1` and V100 GPU gives error in defragment
I am running the code from 0eadcb7b0648d496d67243a7d572b413560be661
Using the versions of transformers[torch]==4.28.1 deepspeed==0.9.0
Results in the following error
2023-04-23 06:05:50 ERROR [__main__] main failed
Traceback (most recent call last):
File "dolly/training/trainer.py", line 332, in <module>
main()
File "/code/venvs/venv/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/code/venvs/venv/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/code/venvs/venv/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/code/venvs/venv/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "dolly/training/trainer.py", line 324, in main
train(**kwargs)
File "dolly/training/trainer.py", line 280, in train
trainer.train()
File "/code/venvs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/code/venvs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1731, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/code/venvs/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 156, in initialize
engine = DeepSpeedEngine(args=args,
File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 328, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1187, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1465, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 256, in __init__
self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 575, in _create_fp16_partitions_with_defragmentation
device_buffer = __class__.defragment(parameter_partitions)
File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 411, in defragment
assert len(set(t.device for t in tensors)) == 1
AssertionError
I can circumvent the error with adding this additional block to the deepspeed config
"offload_param": {
"device": "cpu",
"pin_memory": true
},
Keeping everything else same but only downgrading deepspeed to 0.8.3 eliminates the error even without the offload_param block.
I have not investigated what may have changed between deepspeed 0.8.3 to 0.9.0.
I am encountering this in a AWS p3dn.24xlarge machine with 8 Tesla V100-SXM2-32GB GPUs.
If you can reproduce the same error, you may consider adding additional instructions to the Training on Other Instances section.
Hm, that's strange. I wonder if 0.9.1 fixed it? I didn't observe this on A10s, FWIW. Offload is an interesting workaround, though shouldn't be needed with 32GB, yes.
I see the defragment error with 0.9.1 as well. (Changed the issue title to reflect 0.9.1)
The assertion comes from this line. According to the blame, it did not change for a year or so.
Looking upper in the call stack, that defragment function is called from here
The function name is _create_fp16_partitions_with_defragmentation. It has fp16 in its name. Maybe that codepath is not taken if one uses bf16 ? Which maybe GPUs other than V100 support? I have not investigated in depth
This is the relevant basic block.
if not self.offload_param: # partitioned params remain in GPU during training
# move parameter partitions into a single contiguous flat buffer
parameter_partitions: List[Tensor] = []
for sub_group in self.fp16_groups:
for param in sub_group:
parameter_partitions.append(param.ds_tensor)
device_buffer = __class__.defragment(parameter_partitions)
Seeing that offload_param if condition gave the inspiration for the workaround.
Looking at the assertion assert len(set(t.device for t in tensors)) == 1
Printing the set looks like this
{device(type='cuda', index=7), device(type='cpu')}
You can't use bf16 on the V100. Did you make the change in the README? https://github.com/databrickslabs/dolly#v100-gpus
You can't use bf16 on the V100. Did you make the change in the README? https://github.com/databrickslabs/dolly#v100-gpus
Yes. Otherwise one gets a clear error message for non-supported bfg16.
Also the function name in the stack is _create_fp16_partitions_with_defragmentation it has fp16 in it.
Yeah I figured, just triple checking
I am using 8x A10 GPUs (g5.48xl), and have replicated the errors in defragment using the customization specified here (Training on Other Instances) as well as downgrading deepspeed to deepspeed==0.8.3. It fails while loading the shuffled indices for the dataset.
You're saying downgrading didn't help? if not, does 0.8.0 work? If it does, then I should update the requirements.txt for now
Apologies, I ran the training with 0.8.3 overnight and it was successful. When changing deepspeed==0.8.3 in my requirements.txt, though, it still kept installing 0.9.1, perhaps due to the wheel caching; I had to separately %pip install deepspeed==0.8.3 outside of the requirements.txt to get it to work properly, but it did work.
OK, good info. Let me back off the requirements.txt to 0.8.3 for now
How about giving the workaround a try first @jamesrmccall ?
"offload_param": {
"device": "cpu",
"pin_memory": true
},
In the deepspeed config ?
That fixed the issue for me in the V100 GPUs.
@ahakanbaba I did try that and it still did not work (using A10 GPUs).
OK, if deepspeed 0.8.3 seems to resolve this, then that's done: https://github.com/databrickslabs/dolly/pull/130