DeepSpeed [BUG] Parameter CUDA alignment issue

[BUG] Parameter CUDA alignment issue

Open achicu opened this issue 1 year ago • 5 comments

Describe the bug Error is raised when using Linear module and bfloat16 data types.

File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 114, in forward │
return F.linear(input, self.weight, self.bias) │
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 512 n 4 k 512 mat1_ld 512 mat2_ld 512 result_ld 512 abcType 14 computeType 68 scaleType 0

To Reproduce Steps to reproduce the behavior:

Use a model with bfloat16
Allocate a parameter of odd size (ie. 3).
Add a Linear module.

Expected behavior Expecting to not throw errors.

Additional context

The issue is due to memory alignment. The failing CUDA call expects buffers to be aligned to 4 with a recommendation of 16. The tensors in PyTorch are aligned at 256 by default, so this works in the regular non-deepspeed case. When deepspeed stage 2 is used, the parameters are flattened into a larger tensor. There is no alignment going on between the tensors when the flattening is executed. For that reason if we have a tensor in the middle that uses an odd number of elements, then all the tensors after it will be missaligned.

Note that this is not a big deal for float32 because they are naturally aligned to 4 bytes. Bfloat16 reproduces the problem because a tensor of size 1 ends up using only 2 bytes. The next tensor in the flattened tensor will start at offset of "2", so the address will not be aligned by 4.

Jan 13 '23 01:01 achicu

@achicu, thanks for reporting this issue. We will investigate. If you could please share repro code, that would be greatly appreciated.

Jan 17 '23 22:01 tjruwase

@achicu, are you still seeing this issue?

Jan 24 '23 13:01 tjruwase

@tjruwase found some hack workaround. Trying to create a quick example.

Jan 26 '23 07:01 achicu

@tjruwase I've figured out a simple example and created a GIST below. See the gist for a vscode launcher snippet on how to run.

https://gist.github.com/achicu/884e18d6ec599d1d2db574dd056a16ad

Tested to reproduce with PyTorch version https://github.com/pytorch/pytorch/commit/ce9963e6ba0e40b8307477f5b4113733e7a30ec2

Jan 26 '23 09:01 achicu

Hi @achicu, I could not reproduce this issue with the example you provided (with either self.param1 = torch.nn.Parameter(torch.ones(1)) or self.param1 = torch.nn.Parameter(torch.ones(2), both reaching the passing point). Could you please check the example or my configs to make sure it reproduces?

[2023-02-09 07:16:41,652] [INFO] [config.py:1017:print]   optimizer_legacy_fusion ...... False
[2023-02-09 07:16:41,652] [INFO] [config.py:1017:print]   optimizer_name ............... adam
[2023-02-09 07:16:41,652] [INFO] [config.py:1017:print]   optimizer_params ............. {'lr': 0.00015}
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   pld_enabled .................. False
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   pld_params ................... False
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   prescale_gradients ........... False
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   scheduler_name ............... None
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   scheduler_params ............. None
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   sparse_attention ............. None
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   sparse_gradients_enabled ..... False
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   steps_per_print .............. 10
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   train_batch_size ............. 8
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   train_micro_batch_size_per_gpu  4
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   use_node_local_storage ....... False
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   wall_clock_breakdown ......... False
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   world_size ................... 2
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   zero_allow_untested_optimizer  False
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   zero_enabled ................. True
[2023-02-09 07:16:41,653] [INFO] [config.py:1017:print]   zero_optimization_stage ...... 2
[2023-02-09 07:16:41,653] [INFO] [config.py:1002:print_user_config]   json = {
    "train_micro_batch_size_per_gpu": 4,
    "gradient_accumulation_steps": 1,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.00015
        }
    },
    "bf16": {
        "enabled": true
    },
    "data_types": {
        "grad_accum_dtype": "bf16"
    },
    "zero_optimization": {
        "stage": 2
    },
    "communication_data_type": "bfp16"
}

No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00046181678771972656 seconds
test passes if it reaches this point
test passes if it reaches this point
[2023-02-09 07:16:43,493] [INFO] [launch.py:350:main] Process 52601 exits successfully.
[2023-02-09 07:16:44,494] [INFO] [launch.py:350:main] Process 52602 exits successfully.

My ds_report

torch version .................... 1.13.1+cu117
deepspeed info ................... 0.8.1+4af1f76a, 4af1f76a, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

Feb 09 '23 02:02 ShijieZZZZ

@achicu, feel free to re-open if you're still seeing issues here.

Feb 24 '23 18:02 ShijieZZZZ

Same Error. Are there some solutions to solve this problem?

May 13 '23 17:05 pUmpKin-Co

met same issue with fp16 mode

May 14 '23 11:05 xingchensong

I make it work by setting DISABLE_ADDMM_CUDA_LT=1 before run the dp scripts. That is: DISABLE_ADDMM_CUDA_LT=1 deepspeed ... I think the following discussion may be help: discussion.

May 15 '23 07:05 pUmpKin-Co

Seems that it is related to cuda, solved by downgrade cuda version from 11.7 to 11.3.

ref: https://forums.developer.nvidia.com/t/cublas-status-not-supported-for-bf16-cuda11-6-pytorch/238607

May 16 '23 09:05 xingchensong

DeepSpeed DeepSpeed copied to clipboard

[BUG] Parameter CUDA alignment issue

DeepSpeed
DeepSpeed copied to clipboard