DeepSpeed [BUG] Universal checkpoint conversion failed

Describe the bug while converting a sharded zero3 checkpoint of llava styled multimodal model, I got the following error

""" Traceback (most recent call last): File "/scratch/hongshal/code/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 551, in main(args) File "/scratch/hongshal/code/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 525, in main _extract_zero_shard_files_stage3(args, optim_files, param_shapes, dp_degree, temp_dir) File "/scratch/hongshal/code/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 377, in _extract_zero_shard_files_stage3 _do_parallel_work(do_work, list(range(dp_degree)), args.num_extract_workers) File "/scratch/hongshal/code/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 356, in _do_parallel_work results.append(f.result()) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception RuntimeError: start (241829312) + length (176) exceeds dimension size (241829312). """

To Reproduce Tough for you to reproduce as it is not a public checkpoint

Expected behavior A clear and concise description of what you expected to happen.

ds_report output

[2024-08-02 18:31:05,140] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-02 18:31:06,422] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  FP Quantizer is using an untested triton version (2.0.0), only 2.3.0 and 2.3.1 are known to be compatible with these kernels
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  FP Quantizer is using an untested triton version (2.0.0), only 2.3.0 and 2.3.1 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch']
torch version .................... 2.1.0a0+32f93b1
deepspeed install path ........... ['/scratch/hongshal/code/DeepSpeed/deepspeed']
deepspeed info ................... 0.14.5+unknown, unknown, unknown
torch cuda version ............... 12.2
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.2
shared memory (/dev/shm) size .... 1.91 TB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: ubunut 22.04
GPU count and types: 8 H100 on one node
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version: 3.10.12

Launcher context

python ds_to_universal.py --input_folder /path/to/checkpoint/checkpoint-97650/global_step97650/ --output_folder /path/to/checkpoint/checkpoint-97650-universal/

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

Aug 02 '24 18:08 hongshanli23

@hsl89 What is your DeepSpeed configuration?

Aug 03 '24 00:08 xylian86

Hi @xylian86 , here is the config

{
{
  "train_micro_batch_size_per_gpu": "auto",
  "train_batch_size": "auto",
  "zero_allow_untested_optimizer": true,
  "gradient_clipping": "auto",
  "gradient_accumulation_steps": "auto",
  "bf16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 3,
    "contiguous_gradients": false,
    "overlap_comm": true,
    "allgather_bucket_size": 1e8,
    "reduce_bucket_size": 2e8,
    "stage3_max_live_parameters": 0.7e8,
    "stage3_param_persistence_threshold": 5e6,
    "stage3_gather_fp16_weights_on_model_save": true
  },
  "activation_checkpointing": {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "number_checkpoints": 100,
    "cpu_checkpointing": false
  }
}

why would the ds config play a role here?

Aug 03 '24 02:08 hongshanli23

Hi @hsl89 Thank you for providing the configuration. It is still challenge to pinpoint the root cause of the issue based on the current information. I would greatly appreciate if you could share the value of param_shapes from this line: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/checkpoint/ds_to_universal.py#L162 And you mentioned that the hardware setup consists of 8 H100 GPUs on a single node. Could you please confirm that the Data Parallelism (DP) degree is 8?

Aug 04 '24 13:08 xylian86

The param_shapes is a huge dictionary, I am not sure if it is helpful. Here are first few lines of it

model.embed_tokens.weight torch.Size([125056, 1280])
model.layers.0.self_attn.q_proj.weight torch.Size([1280, 1280])
model.layers.0.self_attn.k_proj.weight torch.Size([1280, 1280])
model.layers.0.self_attn.v_proj.weight torch.Size([1280, 1280])
model.layers.0.self_attn.o_proj.weight torch.Size([1280, 1280])
model.layers.0.mlp.gate_proj.weight torch.Size([3456, 1280])
model.layers.0.mlp.down_proj.weight torch.Size([1280, 3456])
model.layers.0.mlp.up_proj.weight torch.Size([3456, 1280])
model.layers.0.input_layernorm.weight torch.Size([1280])
model.layers.0.post_attention_layernorm.weight torch.Size([1280])
model.layers.1.self_attn.q_proj.weight torch.Size([1280, 1280])
model.layers.1.self_attn.k_proj.weight torch.Size([1280, 1280])
model.layers.1.self_attn.v_proj.weight torch.Size([1280, 1280])
model.layers.1.self_attn.o_proj.weight torch.Size([1280, 1280])
model.layers.1.mlp.gate_proj.weight torch.Size([3456, 1280])
model.layers.1.mlp.down_proj.weight torch.Size([1280, 3456])

I can confirm that the data parallel degree is 8. There are 8 sharded model and optimizer states

bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt	bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt	zero_pp_rank_0_mp_rank_00_model_states.pt  zero_pp_rank_4_mp_rank_00_model_states.pt
bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt	bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt	zero_pp_rank_1_mp_rank_00_model_states.pt  zero_pp_rank_5_mp_rank_00_model_states.pt
bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt	bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt	zero_pp_rank_2_mp_rank_00_model_states.pt  zero_pp_rank_6_mp_rank_00_model_states.pt
bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt	bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt	zero_pp_rank_3_mp_rank_00_model_states.pt  zero_pp_rank_7_mp_rank_00_model_states.pt

Aug 09 '24 15:08 hongshanli23

Thank you for sharing it. The error RuntimeError: start (241829312) + length (176) exceeds dimension size (241829312) indicates an overflow of the size, which may be caused by the calculation of the partitioned_numel (https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/checkpoint/ds_to_universal.py#L162-L169).

It is hard to pinpoint the root cause based on the current information. I would appreciate it if you could add two print statements to the code below and share the resulting log with me. If the log is lengthy, feel free to share it via drive.

    >>> print(flat_state['fp32'].numel())
    offset = 0
    for name, shape in param_shapes.items():
        unpartitioned_numel = shape.numel()
        partitioned_numel, _ = _zero_partitioned_param_info(unpartitioned_numel, dp_degree)
        padding_free_numel = min(partitioned_numel, abs(unpartitioned_numel - dp_index * partitioned_numel))
        for state_key in flat_state.keys():
            dump_param_fragment(temp_dir, 0, dp_index, state_key, flat_state[state_key], name, offset,
                                padding_free_numel)
        offset += partitioned_numel
        >>> print(f"dp_index={dp_index}\t{unpartitioned_numel}\t{partitioned_numel}\t{padding_free_numel}\t{offset}")

Aug 10 '24 05:08 xylian86

I am facing the same issue when i try to transfer ds checkpoint in ZeRO-3 (degree is 16) to universal checkpoint: concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/home/xxx/miniforge3/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "ds_to_universal.py", line 168, in extract_zero_shards_stage3 dump_param_fragment(temp_dir, 0, dp_index, state_key, flat_state[state_key], name, offset, File "ds_to_universal.py", line 196, in dump_param_fragment state_flat_tensor = state_flat_tensor.narrow(0, offset, numel).clone() RuntimeError: start (1010384896) + length (512) exceeds dimension size (1010384896). """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "ds_to_universal.py", line 551, in main(args) File "ds_to_universal.py", line 525, in main _extract_zero_shard_files_stage3(args, optim_files, param_shapes, dp_degree, temp_dir) File "ds_to_universal.py", line 377, in _extract_zero_shard_files_stage3 _do_parallel_work(do_work, list(range(dp_degree)), args.num_extract_workers) File "ds_to_universal.py", line 356, in _do_parallel_work results.append(f.result()) File "/home/xxx/miniforge3/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/home/xxx/miniforge3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception RuntimeError: start (1010384896) + length (512) exceeds dimension size (1010384896).

Sep 10 '24 06:09 Anhelor

@Anhelor Thank you for reporting the issue. As I mentioned above, I would appreciate it if you could add two print statements to the code below and share the resulting log with me. If the log is lengthy, feel free to share it via drive.

    >>> print(flat_state['fp32'].numel())
    offset = 0
    for name, shape in param_shapes.items():
        unpartitioned_numel = shape.numel()
        partitioned_numel, _ = _zero_partitioned_param_info(unpartitioned_numel, dp_degree)
        padding_free_numel = min(partitioned_numel, abs(unpartitioned_numel - dp_index * partitioned_numel))
        for state_key in flat_state.keys():
            dump_param_fragment(temp_dir, 0, dp_index, state_key, flat_state[state_key], name, offset,
                                padding_free_numel)
        offset += partitioned_numel
        >>> print(f"dp_index={dp_index}\t{unpartitioned_numel}\t{partitioned_numel}\t{padding_free_numel}\t{offset}")

Sep 10 '24 14:09 xylian86

nohup.out.txt

I add these print statement in ds_to_universal.py, and the output is nohup.out.txt.

Sep 11 '24 01:09 Anhelor

RuntimeError: start (1000457984) + length (3784704) exceeds dimension size (1000457984). same error,any soluritons?

Sep 26 '24 07:09 Sander-houqi

@Anhelor Thank you for sharing the log. I checked the log and confirmed that the calculation of the partitioned_numel is correct. However, it appears that the saved checkpoints have a different shape info compared to the metadata. Could you please provide the value of the param_shapes?

Oct 02 '24 18:10 xylian86

I encountered a similar issue as well

checkpoints tree:

model: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct dataset: https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k

the code I used is as follows:

import argparse
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from accelerate import Accelerator

data_files = {
    "train": "./everyday-conversations-llama3.1-2k/data/train_sft-00000-of-00001.parquet",
    "test": "./everyday-conversations-llama3.1-2k/data/test_sft-00000-of-00001.parquet"
}

ds = load_dataset("parquet", data_files=data_files)
train_dataset = ds['train']
validation_dataset = ds['test']

model_name = "/mnt/Meta-Llama-3.1-8B-Instruct"  # or "./meta-llamaLlama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    tokenized_inputs = tokenizer(
        examples['prompt'],
        text_pair=examples['completion'],
        truncation=True,
        padding='max_length',
        max_length=512
    )
    tokenized_inputs['labels'] = tokenized_inputs['input_ids'].copy()
    return tokenized_inputs


train_dataset = train_dataset.map(tokenize_function, batched=True)
validation_dataset = validation_dataset.map(tokenize_function, batched=True)
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
validation_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
accelerator = Accelerator()

training_args = TrainingArguments(
    output_dir="./llama3-8b-guojia",
    remove_unused_columns=False,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    learning_rate=2e-5,
    logging_dir="./logs",
    eval_strategy="epoch",
    save_strategy="steps",
    save_total_limit=3,
    save_steps=50,
    bf16=True,
    fp16=False,

    deepspeed="ds_config.json",
    # fsdp_config="fsdp_config.json",
    # accelerator_config="megatron_config.json"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    # train_batch_size=1,
)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--with_checkpoint", type=int, default=0, help="form checkpoint train")
    args = parser.parse_args()

    if args.with_checkpoint == 1:
        trainer.train(resume_from_checkpoint=True)
    else:
        trainer.train()```



use ds_config.json:

{ "steps_per_print": 5, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "universal_checkpoint": true, "checkpoint": { "tag_validation": "Warn", "load_universal": false, "use_node_local_storage": true, "parallel_write": { "pipeline_stage": false } }, "activation_checkpointing": { "partition_activations": true, "cpu_checkpointing": false, "contiguous_memory_optimization": true, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": true } }

Oct 22 '24 10:10 guojia99

@guojia99 Thank you for sharing the detailed script and config - I've been able to reproduce the issue on my end and implement a fix. Could you please test the fix using the code in this branch: https://github.com/xylian86/DeepSpeed/tree/fix_ucp_conversion_zero3?

Oct 22 '24 19:10 xylian86

@xylian86 thank you for u correction, i have successfully completed the conversion. and i will test later whether the converted data can be reloaded

Oct 23 '24 02:10 guojia99

I used the Meta-Llama-3.1-8B-Instruct model to generate a DeepSpeed checkpoint, then tried using the script, but encountered an error.

So, I printed the output shown below and found that it exceeded the length limit.

def extract_zero_shards_stage3(optim_files, param_shapes, dp_degree, temp_dir, dp_index):
    ...
    for name, shape in param_shapes.items():
        ... 
        for state_key in flat_state.keys():
            result = flat_state[state_key]
            print(f"===> result {type(result)}, key: {state_key}, size: {result.size()}, shape: {result.shape}")
            dump_param_fragment(temp_dir, 0, dp_index, state_key, result, name, offset,
                                padding_free_numel)
        offset += partitioned_numel

Is this a bug or is there an issue with the model generating the checkpoint?

Oct 29 '24 11:10 guojia99

@guojia99

i have successfully completed the conversion

encountered an error.

Thank you for reporting the issues. I am curious that why for the same model, the conversion results are different. Did you change the deepspeed config when training the model?

Oct 29 '24 14:10 xylian86

@xylian86 i have tried using various types of models, such as 1B, 7B, and 8B. however, only the 8B model encounters issues after training, converting, and then being processed by the script. I'm curious about this, so I plan to spend some time next week to investigate whether there's an issue with the model's output.

Nov 08 '24 10:11 guojia99

i using the Llama-8B model to retry the above experiment, but it encountered an error: TypeError: 'NoneType' object is not subscriptable.

the directories for generated data and checkpoints used are as follows:

using ds config:

{
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": "auto",
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "none",
      "pin_memory": true
    },
    "offload_param": {
      "device": "none",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1000000000.0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1000000000.0,
    "stage3_max_reuse_distance": 1000000000.0,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false,
  "universal_checkpoint": true,
  "checkpoint": {
    "tag_validation": "Warn",
    "load_universal": true,
    "use_node_local_storage": true,
    "parallel_write": {
      "pipeline_stage": false
    }
  }
}

and train.py

import argparse
import time
import loguru

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from accelerate import Accelerator

data_files = {
    "train": "./everyday-conversations-llama3.1-2k/data/train_sft-00000-of-00001.parquet",
    "test": "./everyday-conversations-llama3.1-2k/data/test_sft-00000-of-00001.parquet"
}

ds = load_dataset("parquet", data_files=data_files)
train_dataset = ds['train']
validation_dataset = ds['test']

model_name = "/data-hpfs/Meta-Llama-3.1-8B-Instruct" # "./meta-llamaLlama-3.2-1B"  # "./meta-llamaLlama-3.2-1B" # "/mnt/Meta-Llama-3.1-8B-Instruct"  # "./meta-llamaLlama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    tokenized_inputs = tokenizer(
        examples['prompt'],
        text_pair=examples['completion'],
        truncation=True,
        padding='max_length',
        max_length=512
    )
    tokenized_inputs['labels'] = tokenized_inputs['input_ids'].copy()
    return tokenized_inputs


train_dataset = train_dataset.map(tokenize_function, batched=True)
validation_dataset = validation_dataset.map(tokenize_function, batched=True)
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
validation_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
accelerator = Accelerator()

training_args = TrainingArguments(
    output_dir="./guojia_output/llama8B",
    remove_unused_columns=False,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    learning_rate=2e-5,
    logging_dir="./logs",
    eval_strategy="epoch",
    save_strategy="steps",
    save_total_limit=10,
    save_steps=70,
    bf16=True,
    # fp16=False,
    save_safetensors=True,

    deepspeed="ds_config.json",
    # fsdp_config="fsdp_config.json",
)

# use my trainer with 'transformers.Trainer'
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--with_checkpoint", type=int, default=0, help="form checkpoint train")
    args = parser.parse_args()
    ts = time.time()
    trainer.train(resume_from_checkpoint=args.with_checkpoint == 1)
    loguru.logger.warning(f"[T] use time {time.time() - ts:.6f}s")

the reproduction steps are as follows:

use thetrain.py script to generate a Zero-3 checkpoint. PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --nproc_per_node=4 train.py
convert this checkpoint to a universal checkpoint using the checkpoint conversion script. python ds_to_u.py --input_folder xxx --output_folder yyy
update the ds_config.json to use the load_universal and restart the training, try modifying the script to resume training using the universal checkpoint. PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --nproc_per_node=4 train.py --with_checkpoint=1

model: https://huggingface.co/meta-llama/Llama-3.2-8B-Instruct dataset: https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k

Nov 13 '24 05:11 guojia99

I was using the older version 0.14.4 of DeepSpeed as mentioned above. After updating to 0.15.2, I encountered the following error: [rank2]: ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

Nov 13 '24 09:11 guojia99

@guojia99 Thank you for reporting the issue. Could you please try this branch: https://github.com/xylian86/DeepSpeed/tree/fix_ucp_conversion_zero3?

Nov 16 '24 00:11 xylian86

Closing due to lack of activity. Please re-open as needed.

Dec 13 '24 21:12 tjruwase

Thanks @guojia99 for the fix, it works.

I was able to convert the very same ZeRO-3 checkpoint without/with universal checkpointing and it produced the same values for the state dictionary, e.g.:

# zero_to_fp32.py
print(state_dict["layers.1.ln_1.weight"].shape)
>>> torch.Size([5120])

print(state_dict["layers.1.ln_1.weight"].contiguous()[190:205])
>>> tensor([2.3908e-02, 2.2656e-02, 3.0095e-02, 2.4517e-02, 3.4772e-02, 2.7419e-02,
        2.6703e-02, 2.5958e-02, 1.3296e-04, 1.5781e-01, 2.3766e-02, 3.8483e-02,
        2.8044e-02, 3.1916e-02, 2.6807e-02], grad_fn=<SliceBackward0>)

# ds_to_universal.py
print(state_dict["layers.1.ln_1.weight"].shape)
>>> torch.Size([5120])

print(state_dict["layers.1.ln_1.weight"].contiguous()[190:205])
>>> tensor([2.3908e-02, 2.2656e-02, 3.0095e-02, 2.4517e-02, 3.4772e-02, 2.7419e-02,
        2.6703e-02, 2.5958e-02, 1.3296e-04, 1.5781e-01, 2.3766e-02, 3.8483e-02,
        2.8044e-02, 3.1916e-02, 2.6807e-02], grad_fn=<SliceBackward0>)

@tjruwase Could this branch please be merged to main?

https://github.com/xylian86/DeepSpeed/tree/fix_ucp_conversion_zero3

Apr 30 '25 15:04 gugarosa

DeepSpeed DeepSpeed copied to clipboard

[BUG] Universal checkpoint conversion failed

DeepSpeed
DeepSpeed copied to clipboard