DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Saving a checkpoint when training with NVMe offloading?

Open aciborowska opened this issue 3 years ago • 20 comments

I am testing NVMe offloading when training a model. When I try to save a checkpoint, I am getting (full stack trace below):

NotImplementedError: ZeRO-3 does not yet support checkpointing with NVMe offloading, please disable for now.

Is that correct? There is no checkpointing with NVMe offloading or am I missing something in my setup/config file? If there is no checkpointing, how can I save the model?

Full stacktrace:

[2022-07-08 19:51:21]
[4e2c01e1] [rank=0] Traceback (most recent call last): <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]     return _run_code(code, main_globals, None, <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]     exec(code, run_globals) <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 132, in <module> <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]     sys.exit(main(args.train_entrypoint)) <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 123, in main <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]     controller.run() <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/deepspeed/_deepspeed_trial.py", line 296, in run <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]     self._run() <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/deepspeed/_deepspeed_trial.py", line 338, in _run <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]     self._save(path) <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/deepspeed/_deepspeed_trial.py", line 732, in _save <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]     self.trial.save(self.context, path) <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/deepspeed/_deepspeed_trial.py", line 934, in save <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]     m.save_checkpoint(path, tag=f"model{i}") <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2842, in save_checkpoint <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]     self._save_zero_checkpoint(save_dir, tag) <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 3106, in _save_zero_checkpoint <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]     zero_sd = dict(optimizer_state_dict=self.optimizer.state_dict(), <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2699, in state_dict <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0]     raise NotImplementedError( <none> [2022-07-08 19:51:21]
[4e2c01e1] [rank=0] NotImplementedError: ZeRO-3 does not yet support checkpointing with NVMe offloading, please disable for now. 

Config file:

{
  "train_batch_size": 256,
  "steps_per_print": 2000,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.001,
      "betas": [
        0.8,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 3e-7
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 0.001,
      "warmup_num_steps": 1000
    }
  },
  "gradient_clipping": 1.0,
  "prescale_gradients": false,
  "fp16": {
      "enabled": true,
      "fp16_master_weights_and_grads": false,
      "loss_scale": 0,
      "loss_scale_window": 500,
      "hysteresis": 2,
      "min_loss_scale": 1,
      "initial_scale_power": 15
  },
  "wall_clock_breakdown": false,
  "zero_optimization": {
      "stage": 3,
      "contiguous_gradients": true,
      "stage3_max_live_parameters": 1e9,
      "stage3_max_reuse_distance": 1e9,
      "stage3_prefetch_bucket_size": 1e7,
      "stage3_param_persistence_threshold": 1e5,
      "reduce_bucket_size": 1e7,
      "sub_group_size": 1e9,
      "offload_param": {
        "device": "nvme",
        "nvme_path": "nvme0n1",
        "pin_memory": false
      },
      "offload_optimizer": {
        "device": "nvme",
        "nvme_path": "nvme0n1",
        "pin_memory": false
    }
  }
}

aciborowska avatar Jul 08 '22 22:07 aciborowska

@aciborowska, sorry for this inconvenience. Model checkpointing for training with nvme offloading is not yet available.

tjruwase avatar Jul 08 '22 23:07 tjruwase

Okay. In that case, after I complete the training, I can still save the model with e.g., stage3_gather_16bit_weights_on_model_save?

Also, any plans to add checkpointing for NVMe in the near future?

aciborowska avatar Jul 08 '22 23:07 aciborowska

Yes, we do plan to add checkpointing for NVMe. In reality, you are the first user to my knowledge with this request. Can you please explain a bit your training scenario and why CPU offloading is insufficient?

tjruwase avatar Jul 09 '22 20:07 tjruwase

I was mostly being curious about using NVMe in terms of performance trade offs compared to CPU and with different parameters. It is very surprising to me that NVMe does not support checkpointing (on the contrary to CPU), and this fact is not documented or even mentioned in tutorial/blogs, since (to me) is feels like a limitation.

Is there any reason why you decided not to implement checkpointing for NVMe? Is NVMe offloading mainly intended to be an inference-related feature?

One more question. When I was testing NVMe/CPU offloading (AWS, 1 x NVIDIA T4 Tensor Core GPU with 125 GB NVMe) I noticed that offloading with NVMe is about 3-4 times slower than CPU offloading. Is that something that can be generally expected? Can it get significantly larger/smaller?

aciborowska avatar Jul 11 '22 18:07 aciborowska

Thanks for the clarification. We did not yet implement checkpointing for NVMe due to lack of bandwidth and interest. NVMe offloading is meant for training, finetuning, and inference. But we are yet to see much interest in training models at the scales that require it.

NVMe offloading performance depends on the NVMe device read/write speeds. Please see #998 and here for tips on benchmarking your system.

tjruwase avatar Jul 11 '22 18:07 tjruwase

Thanks!

aciborowska avatar Jul 11 '22 19:07 aciborowska

Yes, we do plan to add checkpointing for NVMe. In reality, you are the first user to my knowledge with this request. Can you please explain a bit your training scenario and why CPU offloading is insufficient?

FYI I've run into this problem as well using DeepSpeed 0.7. The scenario is fine-tuning gpt-neox-20b on a 2x RTX a6000 machine with 128Gb of RAM. On this setup I've only been able to get DeepSpeed finetuning to work with both optimizer and parameter offloading to nvme.

timohear avatar Aug 18 '22 04:08 timohear

I've run into this problem as well with finetuning BLOOM.

zyfedward avatar Dec 21 '22 06:12 zyfedward

Revisiting given the recent interest.

tjruwase avatar Dec 22 '22 23:12 tjruwase

Taking a look at this now.

loadams avatar Feb 06 '23 23:02 loadams

We are also running fine-tuning on BLOOM and need NVMe offloading due to memory constraints (apparently 2 nodes with 2 TB of memory each isn't enough). I would really appreciate snapshot support on NVMe.

StevenArzt avatar Feb 16 '23 09:02 StevenArzt

Good to know, I'll post an update here shortly.

loadams avatar Feb 17 '23 15:02 loadams

@aciborowska - what model were you trying to train when you first hit this?

@StevenArzt thanks, starting work on this now.

loadams avatar Feb 22 '23 22:02 loadams

I would find this useful as well. My use case is that I'm working on a side project to make a machine translation system for a specific low resource language. I'm experimenting with large-ish decoders, on the order of 3-7B params. I want to do this for as little money as possible so I decided to use my home machine - RTX 3080 Ti with 32GB RAM.

The training works, but only with NVMe offload. It takes about two weeks to fine-tune one of these models but I'm fine with that.

I'm happy to help with either testing or implementation.

eisene avatar Mar 30 '23 15:03 eisene

I believe supporting this feature is super important! I am training LoRA (from the PEFT library) using DeepSpeed. Everything else works like magic, except for the checkpoint. The communication and gather overhead of NVME devices become less of a problem when fine-tuning using LoRA, as it represents only a small fraction of the parameters.

I am happy to assist with testing, benchmarking, or implementation.

Entropy-xcy avatar May 28 '23 15:05 Entropy-xcy

I'd also greatly appreciate this feature! 🙏

In the meantime, I feel like it would be nice to have DeepSpeed raise a value error or at least give a warning at the start of training that checkpointing won't work.

dblakely avatar Aug 14 '23 17:08 dblakely

Likewise adding support here that I'd extremely appreciate this feature :)

PaulScotti avatar Aug 17 '23 18:08 PaulScotti

I'll prioritize this work, thanks @dblakely and @PaulScotti for your feedback

loadams avatar Aug 17 '23 19:08 loadams

+1. Would like this feature to be supported.

chongxiaoc avatar Sep 22 '23 07:09 chongxiaoc

+1. We really need this feature because the LLM is larger and larger ...

haotong-yang avatar Dec 30 '23 10:12 haotong-yang

@gary-young and @chongxiaoc - work is continuing on this here, please see that for status and to test the work.

loadams avatar Jan 02 '24 17:01 loadams

I would find this useful as well. My use case is that I'm working on a side project to make a machine translation system for a specific low resource language. I'm experimenting with large-ish decoders, on the order of 3-7B params. I want to do this for as little money as possible so I decided to use my home machine - RTX 3080 Ti with 32GB RAM.

The training works, but only with NVMe offload. It takes about two weeks to fine-tune one of these models but I'm fine with that.

I'm happy to help with either testing or implementation.

Hi essene,

I am trying to fine-tune a 3-7B LLM models using zero-3 by completely offloading to NVMe using single GPU 3090 24GB + 2TB SSD but always face "kill subprocess" before training process start. Could you please share your experience and ds_config.json to me?

0781532 avatar Jan 27 '24 08:01 0781532

@0781532 - I'd recommend starting a new issue to share your error code and s simple repro case if possible.

loadams avatar Jan 29 '24 16:01 loadams