dolly Running the train_dolly.py with `transformers[torch]==4.28.1`, `deepspeed==0.9.1` and V100 GPU gives error in defragment

I am running the code from 0eadcb7b0648d496d67243a7d572b413560be661

Using the versions of transformers[torch]==4.28.1 deepspeed==0.9.0

Results in the following error

2023-04-23 06:05:50 ERROR [__main__] main failed
Traceback (most recent call last):
  File "dolly/training/trainer.py", line 332, in <module>
    main()
  File "/code/venvs/venv/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/code/venvs/venv/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/code/venvs/venv/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/code/venvs/venv/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "dolly/training/trainer.py", line 324, in main
    train(**kwargs)
  File "dolly/training/trainer.py", line 280, in train
    trainer.train()
  File "/code/venvs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/code/venvs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1731, in _inner_training_loop
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
  File "/code/venvs/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 156, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 328, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1187, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1465, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 256, in __init__
    self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
  File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 575, in _create_fp16_partitions_with_defragmentation
    device_buffer = __class__.defragment(parameter_partitions)
  File "/code/venvs/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 411, in defragment
    assert len(set(t.device for t in tensors)) == 1
AssertionError

I can circumvent the error with adding this additional block to the deepspeed config

"offload_param": {
  "device": "cpu",
  "pin_memory": true
},

Keeping everything else same but only downgrading deepspeed to 0.8.3 eliminates the error even without the offload_param block.

I have not investigated what may have changed between deepspeed 0.8.3 to 0.9.0.

I am encountering this in a AWS p3dn.24xlarge machine with 8 Tesla V100-SXM2-32GB GPUs.

If you can reproduce the same error, you may consider adding additional instructions to the Training on Other Instances section.

Apr 23 '23 06:04 ahakanbaba

Hm, that's strange. I wonder if 0.9.1 fixed it? I didn't observe this on A10s, FWIW. Offload is an interesting workaround, though shouldn't be needed with 32GB, yes.

Apr 23 '23 14:04 srowen

I see the defragment error with 0.9.1 as well. (Changed the issue title to reflect 0.9.1)

Apr 23 '23 15:04 ahakanbaba

The assertion comes from this line. According to the blame, it did not change for a year or so.

Looking upper in the call stack, that defragment function is called from here

The function name is _create_fp16_partitions_with_defragmentation. It has fp16 in its name. Maybe that codepath is not taken if one uses bf16 ? Which maybe GPUs other than V100 support? I have not investigated in depth

This is the relevant basic block.

  if not self.offload_param:  # partitioned params remain in GPU during training
        	# move parameter partitions into a single contiguous flat buffer
        	parameter_partitions: List[Tensor] = []
        	for sub_group in self.fp16_groups:
            	for param in sub_group:
                	parameter_partitions.append(param.ds_tensor)
        	device_buffer = __class__.defragment(parameter_partitions)

Seeing that offload_param if condition gave the inspiration for the workaround.

Looking at the assertion assert len(set(t.device for t in tensors)) == 1 Printing the set looks like this

{device(type='cuda', index=7), device(type='cpu')}

Apr 24 '23 05:04 ahakanbaba

You can't use bf16 on the V100. Did you make the change in the README? https://github.com/databrickslabs/dolly#v100-gpus

Apr 24 '23 13:04 srowen

You can't use bf16 on the V100. Did you make the change in the README? https://github.com/databrickslabs/dolly#v100-gpus

Yes. Otherwise one gets a clear error message for non-supported bfg16.

Also the function name in the stack is _create_fp16_partitions_with_defragmentation it has fp16 in it.

Apr 25 '23 00:04 ahakanbaba

Yeah I figured, just triple checking

Apr 25 '23 00:04 srowen

I am using 8x A10 GPUs (g5.48xl), and have replicated the errors in defragment using the customization specified here (Training on Other Instances) as well as downgrading deepspeed to deepspeed==0.8.3. It fails while loading the shuffled indices for the dataset.

Apr 25 '23 03:04 jamesrmccall

You're saying downgrading didn't help? if not, does 0.8.0 work? If it does, then I should update the requirements.txt for now

Apr 25 '23 03:04 srowen

Apologies, I ran the training with 0.8.3 overnight and it was successful. When changing deepspeed==0.8.3 in my requirements.txt, though, it still kept installing 0.9.1, perhaps due to the wheel caching; I had to separately %pip install deepspeed==0.8.3 outside of the requirements.txt to get it to work properly, but it did work.

Apr 25 '23 14:04 jamesrmccall

OK, good info. Let me back off the requirements.txt to 0.8.3 for now

Apr 25 '23 14:04 srowen

How about giving the workaround a try first @jamesrmccall ?

"offload_param": {
  "device": "cpu",
  "pin_memory": true
},

In the deepspeed config ?

That fixed the issue for me in the V100 GPUs.

Apr 25 '23 16:04 ahakanbaba

@ahakanbaba I did try that and it still did not work (using A10 GPUs).

Apr 25 '23 16:04 jamesrmccall

OK, if deepspeed 0.8.3 seems to resolve this, then that's done: https://github.com/databrickslabs/dolly/pull/130

Apr 25 '23 16:04 srowen