DeepSpeed
DeepSpeed copied to clipboard
[BUG] DeepSpeed Ulysses zero3 compatibility
Describe the bug Training a hf model (llama 3.1 with peft) on long context with sequence_parallel_size > 1 works only up until zero stage 2. If I set "stage" to 3 I get the following error:
[rank1]: File "/root/miniconda3/envs/finetuning/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1464, in partition_grads
[rank1]: grad_buffer = self.__param_id_to_grad_partition[param.ds_id].narrow(0, 0, grad_partition.numel())
[rank1]: RuntimeError: start (0) + length (8388608) exceeds dimension size (4194304).
I also had to disable this assertion when switching over from zero 1 to 3:
assert train_batch == micro_batch * grad_acc * self.world_size
So maybe there is an issue with the world_size definition when running zero3 (though even when fixing this to the correct world size and device_mesh the same error occurs)?
To Reproduce Running the example from: DeepSpeedExamples/post_training/sequence_parallelism/test_ulysses.py with:
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": True
},
"offload_param": {
"device": "cpu",
"pin_memory": True
},
"overlap_comm": True,
"contiguous_gradients": True,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": True
},
on the hf pr: https://github.com/huggingface/transformers/pull/32305
Expected behavior ZeRo-3 should work as stated in the official blog post.
ds_report output
DeepSpeed general environment info:
torch install path ............... ['/root/miniconda3/envs/finetuning/lib/python3.10/site-packages/torch']
torch version .................... 2.4.1+cu121
deepspeed install path ........... ['/root/miniconda3/envs/finetuning/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.15.1, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.4, cuda 12.1
shared memory (/dev/shm) size .... 321.31 GB
System info:
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. two machines with x8 A100s each]
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version
- Any other relevant info about your setup
Launcher context I am using the deepspeed launcher.
Thanks for the help! Even if this not officially supported I would be thankful for some pointers, so I can implement something on my own. For context: We want to train a 70B model on seq length of 60k. 8B already works with Ulysses, but without zero-3 I think 70B is impossible on a single node.
@Xirid, ZeRO stage 3 is currently not supported in DeepSpeed long context parallelism (Ulyesses). ZeRO3 support is on our roadmap, contributions are welcome!
@samadejacobs Hello, I'd like to ask why z3 is not supported currently. I modified certain code and successfully executed z3. The difference between loss and loss is almost 4%. The trend is similar. And on z2 is basically less than 1 per thousand, look forward to your reply, and want to know how long z3 will support.
@glowwormX, to be clear Z3 is supported with Megatron DeepSpeed client, support for HF client is on our roadmap, no ETA at this point, contributions are welcome!
@samadejacobs Thank you for replying. If I want to support z3 on the HF client, where do I need to start? Where can I learn from the code of the Megatron DeepSpeed client?
Hi @samadejacobs, what would be needed to support Z3 with Ulysses in the HF client?