Liger-Kernel icon indicating copy to clipboard operation
Liger-Kernel copied to clipboard

[bug] deepspeed zero++ multinode with liger kernel

Open SoundProvider opened this issue 8 months ago • 2 comments

🐛 Describe the bug

deepspeed zero++ config

  • I ran the training with slrum
{
    "zero_optimization": {
        "stage": 3,
        "stage3_gather_16bit_weights_on_model_save": true,
        "reduce_bucket_size": "auto",

        "zero_hpz_partition_size": 8,
        "zero_quantized_weights": true,
        "zero_quantized_gradients": true,

        "contiguous_gradients": true,
        "overlap_comm":true
    },
    "bf16": {
        "enabled": true
    },
    "gradient_accumulation_steps": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

error message

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::Half != c10::BFloat16
  File "/usr/local/lib/python3.10/dist-packages/liger_kernel/ops/fused_linear_cross_entropy.py", line 77, in fused_linear_cross_entropy_forward
    logits_chunk = _input_chunk @ weight.t()  # chunk_size x V
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/liger_kernel/transformers/fused_linear_cross_entropy.py", line 38, in forward

Reproduce

Sorry I can't release the code.
Huggingface Trainer with llama model could reproduce the message

Versions

Operating System: Linux-5.15.0-60-generic-x86_64-with-glibc2.35 Python version: 3.10.12 Liger Kernel version: 0.5.4 PyTorch version: 2.2.0a0+81ea7a4 CUDA version: 12.3 HIP(ROCm) version: Not available Triton version: 3.0.0 Transformers version: 4.49.0 Environment Report:

SoundProvider avatar Mar 11 '25 18:03 SoundProvider