DeepSpeed
DeepSpeed copied to clipboard
Saving a checkpoint when training with NVMe offloading?
I am testing NVMe offloading when training a model. When I try to save a checkpoint, I am getting (full stack trace below):
NotImplementedError: ZeRO-3 does not yet support checkpointing with NVMe offloading, please disable for now.
Is that correct? There is no checkpointing with NVMe offloading or am I missing something in my setup/config file? If there is no checkpointing, how can I save the model?
Full stacktrace:
[2022-07-08 19:51:21]
[4e2c01e1] [rank=0] Traceback (most recent call last): <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] return _run_code(code, main_globals, None, <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] exec(code, run_globals) <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 132, in <module> <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] sys.exit(main(args.train_entrypoint)) <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 123, in main <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] controller.run() <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/deepspeed/_deepspeed_trial.py", line 296, in run <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] self._run() <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/deepspeed/_deepspeed_trial.py", line 338, in _run <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] self._save(path) <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/deepspeed/_deepspeed_trial.py", line 732, in _save <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] self.trial.save(self.context, path) <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/deepspeed/_deepspeed_trial.py", line 934, in save <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] m.save_checkpoint(path, tag=f"model{i}") <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2842, in save_checkpoint <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] self._save_zero_checkpoint(save_dir, tag) <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 3106, in _save_zero_checkpoint <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] zero_sd = dict(optimizer_state_dict=self.optimizer.state_dict(), <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2699, in state_dict <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] raise NotImplementedError( <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] NotImplementedError: ZeRO-3 does not yet support checkpointing with NVMe offloading, please disable for now.
Config file:
{
"train_batch_size": 256,
"steps_per_print": 2000,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001,
"betas": [
0.8,
0.999
],
"eps": 1e-8,
"weight_decay": 3e-7
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 1000
}
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"fp16": {
"enabled": true,
"fp16_master_weights_and_grads": false,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 15
},
"wall_clock_breakdown": false,
"zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 1e7,
"sub_group_size": 1e9,
"offload_param": {
"device": "nvme",
"nvme_path": "nvme0n1",
"pin_memory": false
},
"offload_optimizer": {
"device": "nvme",
"nvme_path": "nvme0n1",
"pin_memory": false
}
}
}
@aciborowska, sorry for this inconvenience. Model checkpointing for training with nvme offloading is not yet available.
Okay. In that case, after I complete the training, I can still save the model with e.g., stage3_gather_16bit_weights_on_model_save?
Also, any plans to add checkpointing for NVMe in the near future?
Yes, we do plan to add checkpointing for NVMe. In reality, you are the first user to my knowledge with this request. Can you please explain a bit your training scenario and why CPU offloading is insufficient?
I was mostly being curious about using NVMe in terms of performance trade offs compared to CPU and with different parameters. It is very surprising to me that NVMe does not support checkpointing (on the contrary to CPU), and this fact is not documented or even mentioned in tutorial/blogs, since (to me) is feels like a limitation.
Is there any reason why you decided not to implement checkpointing for NVMe? Is NVMe offloading mainly intended to be an inference-related feature?
One more question. When I was testing NVMe/CPU offloading (AWS, 1 x NVIDIA T4 Tensor Core GPU with 125 GB NVMe) I noticed that offloading with NVMe is about 3-4 times slower than CPU offloading. Is that something that can be generally expected? Can it get significantly larger/smaller?
Thanks for the clarification. We did not yet implement checkpointing for NVMe due to lack of bandwidth and interest. NVMe offloading is meant for training, finetuning, and inference. But we are yet to see much interest in training models at the scales that require it.
NVMe offloading performance depends on the NVMe device read/write speeds. Please see #998 and here for tips on benchmarking your system.
Thanks!
Yes, we do plan to add checkpointing for NVMe. In reality, you are the first user to my knowledge with this request. Can you please explain a bit your training scenario and why CPU offloading is insufficient?
FYI I've run into this problem as well using DeepSpeed 0.7. The scenario is fine-tuning gpt-neox-20b on a 2x RTX a6000 machine with 128Gb of RAM. On this setup I've only been able to get DeepSpeed finetuning to work with both optimizer and parameter offloading to nvme.
I've run into this problem as well with finetuning BLOOM.
Revisiting given the recent interest.
Taking a look at this now.
We are also running fine-tuning on BLOOM and need NVMe offloading due to memory constraints (apparently 2 nodes with 2 TB of memory each isn't enough). I would really appreciate snapshot support on NVMe.
Good to know, I'll post an update here shortly.
@aciborowska - what model were you trying to train when you first hit this?
@StevenArzt thanks, starting work on this now.
I would find this useful as well. My use case is that I'm working on a side project to make a machine translation system for a specific low resource language. I'm experimenting with large-ish decoders, on the order of 3-7B params. I want to do this for as little money as possible so I decided to use my home machine - RTX 3080 Ti with 32GB RAM.
The training works, but only with NVMe offload. It takes about two weeks to fine-tune one of these models but I'm fine with that.
I'm happy to help with either testing or implementation.
I believe supporting this feature is super important! I am training LoRA (from the PEFT library) using DeepSpeed. Everything else works like magic, except for the checkpoint. The communication and gather overhead of NVME devices become less of a problem when fine-tuning using LoRA, as it represents only a small fraction of the parameters.
I am happy to assist with testing, benchmarking, or implementation.
I'd also greatly appreciate this feature! 🙏
In the meantime, I feel like it would be nice to have DeepSpeed raise a value error or at least give a warning at the start of training that checkpointing won't work.
Likewise adding support here that I'd extremely appreciate this feature :)
I'll prioritize this work, thanks @dblakely and @PaulScotti for your feedback
+1. Would like this feature to be supported.
+1. We really need this feature because the LLM is larger and larger ...
@gary-young and @chongxiaoc - work is continuing on this here, please see that for status and to test the work.
I would find this useful as well. My use case is that I'm working on a side project to make a machine translation system for a specific low resource language. I'm experimenting with large-ish decoders, on the order of 3-7B params. I want to do this for as little money as possible so I decided to use my home machine - RTX 3080 Ti with 32GB RAM.
The training works, but only with NVMe offload. It takes about two weeks to fine-tune one of these models but I'm fine with that.
I'm happy to help with either testing or implementation.
Hi essene,
I am trying to fine-tune a 3-7B LLM models using zero-3 by completely offloading to NVMe using single GPU 3090 24GB + 2TB SSD but always face "kill subprocess" before training process start. Could you please share your experience and ds_config.json to me?
@0781532 - I'd recommend starting a new issue to share your error code and s simple repro case if possible.