Failed to load universal_checkpoint with deepspeed integreation
System Info
transformersversion: 4.44.2- Platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.17
- Python version: 3.8.18
- Huggingface_hub version: 0.24.6
- Safetensors version: 0.4.4
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA A800 80GB PCIe
Who can help?
@muellerzr
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
The Universal Checkpointing feature allows loading with different world sizes. However, when using the Hugging Face Trainer, the loading of the converted universal checkpoint fails.
The failure seems to be due to HfTrainerDeepSpeedConfig not correctly handling the "load_universal_checkpoint": true or "universal_checkpoint": true arguments in the DeepSpeed configuration. Consequently, the load_universal_checkpoint function returns False.
Related Issues:
- https://github.com/microsoft/DeepSpeed/issues/5430
- https://github.com/microsoft/DeepSpeed/issues/2921
Expected behavior
Universal checkpoint should be loaded correctly.
Here's my deepspeed config json:
{
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 1e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 16,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false,
"activation_checkpointing": {
"partition_activations": false,
"cpu_checkpointing": true,
"contiguous_memory_optimization": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"no_pipeline_parallel": true,
"load_universal_checkpoint": true
}
Another related issue: https://github.com/microsoft/DeepSpeed/issues/5405
Hello @ArthurZucker and @muellerz. I am able to create a pull request to address the issue. I have resolved the issue by deleting all the “rng_state” files as it had a different world size.
Before I start with the PR, I would like to ensure that NOT loading these “rng_state” files does not have any side-effects.
We can skip these rng_state and add a warning.
Sure feel free to open a PR!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.