DeepSpeed [BUG] Deepspeed hangs when setting ds_accelerator to cuda via VS Code Debugger

Describe the bug When initiating a debug session for the LLaVA training code utilizing DeepSpeed within VS Code, the execution halts indefinitely at the following log message:

[INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)

This issue only arises during debugging sessions in VS Code and is not observed when the code is run normally.

To Reproduce Steps to reproduce the behavior:

Configure the VS Code debugger for a Python environment. Following instructions in https://github.com/microsoft/DeepSpeed/issues/938#issuecomment-1544188631 to set the debugger config.
Start a debug session for the LLaVA training script configured with DeepSpeed.
Observe the process hanging at the get_accelerator log message.

Expected behavior I expected the debugger to start the training like what happened in terminal.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch']
torch version .................... 2.1.0a0+32f93b1
deepspeed install path ........... ['/usr/local/lib/python3.10/dist-packages/deepspeed']
deepspeed info ................... 0.12.6, unknown, unknown
torch cuda version ............... 12.2
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.2
shared memory (/dev/shm) size .... 62.80 GB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

Ubuntu 22.04
one machines with x4 RTX3090s each
Python 3.10.12

Launcher context LLaVA v1.5 finetune_lora

Docker context Nvidia Pytorch Docker 23.10

Additional context Add any other context about the problem here.

Apr 05 '24 05:04 xiaobaishu0097

I met a same question.I solve it by modify the launch.json. Here:

{

// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
    {
        "name": "Python: Debug DeepSpeed",
        "type": "python",
        "request": "launch",
        "program": "/data/anaconda3/envs/mjp_nips/bin/deepspeed",
        "justMyCode": false,
        "console": "integratedTerminal",
        "args": [
            "--include","localhost:7",
            "/data/LLaVA/llava/train/train_mem.py",
            "--lora_enable=True","--lora_r=64","--lora_alpha=256","--mm_projector_lr=2e-5",
            "--deepspeed=./scripts/zero2.json" ,
            "--model_name_or_path=/data/LLaVA/llava-v1.5-7b" ,
            "--version=v1" ,
            "--data_path=/data/LLaVA/playground/data/textvqa.json" ,
            "--image_folder=./playground/data" ,
            "--vision_tower=openai/clip-vit-large-patch14-336" ,
            "--mm_projector_type=mlp2x_gelu" ,
            "--mm_vision_select_layer=-2" ,
            "--mm_use_im_start_end=False" ,
            "--mm_use_im_patch_token=False" ,
            "--image_aspect_ratio=pad" ,
            "--group_by_modality_length=True" ,
            "--fp16=True" ,
            "--output_dir=/data/LLaVA/checkpoints/result" ,
            "--num_train_epochs=1" ,
            "--per_device_train_batch_size=1" ,
            "--per_device_eval_batch_size=4" ,
            "--gradient_accumulation_steps=1" ,
            "--evaluation_strategy=no" ,
            "--save_strategy=steps" ,
            "--save_steps=50000" ,
            "--save_total_limit=1" ,
            "--learning_rate=2e-4" ,
            "--weight_decay=0." ,
            "--warmup_ratio=0.03" ,
            "--lr_scheduler_type=cosine" ,
            "--logging_steps=1" ,
            "--tf32=False" ,
            "--model_max_length=2048" ,
            "--gradient_checkpointing=True" ,
            "--dataloader_num_workers=4" ,
            "--lazy_preprocess=True" ,
            "--report_to=wandb"
        ]
    }        
]

}

hope can help you! good luck!

Apr 16 '24 07:04 ylnxxts

I met a same question.I solve it by modify the launch.json. Here:

{

// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
    {
        "name": "Python: Debug DeepSpeed",
        "type": "python",
        "request": "launch",
        "program": "/data/anaconda3/envs/mjp_nips/bin/deepspeed",
        "justMyCode": false,
        "console": "integratedTerminal",
        "args": [
            "--include","localhost:7",
            "/data/LLaVA/llava/train/train_mem.py",
            "--lora_enable=True","--lora_r=64","--lora_alpha=256","--mm_projector_lr=2e-5",
            "--deepspeed=./scripts/zero2.json" ,
            "--model_name_or_path=/data/LLaVA/llava-v1.5-7b" ,
            "--version=v1" ,
            "--data_path=/data/LLaVA/playground/data/textvqa.json" ,
            "--image_folder=./playground/data" ,
            "--vision_tower=openai/clip-vit-large-patch14-336" ,
            "--mm_projector_type=mlp2x_gelu" ,
            "--mm_vision_select_layer=-2" ,
            "--mm_use_im_start_end=False" ,
            "--mm_use_im_patch_token=False" ,
            "--image_aspect_ratio=pad" ,
            "--group_by_modality_length=True" ,
            "--fp16=True" ,
            "--output_dir=/data/LLaVA/checkpoints/result" ,
            "--num_train_epochs=1" ,
            "--per_device_train_batch_size=1" ,
            "--per_device_eval_batch_size=4" ,
            "--gradient_accumulation_steps=1" ,
            "--evaluation_strategy=no" ,
            "--save_strategy=steps" ,
            "--save_steps=50000" ,
            "--save_total_limit=1" ,
            "--learning_rate=2e-4" ,
            "--weight_decay=0." ,
            "--warmup_ratio=0.03" ,
            "--lr_scheduler_type=cosine" ,
            "--logging_steps=1" ,
            "--tf32=False" ,
            "--model_max_length=2048" ,
            "--gradient_checkpointing=True" ,
            "--dataloader_num_workers=4" ,
            "--lazy_preprocess=True" ,
            "--report_to=wandb"
        ]
    }        
]

}

hope can help you! good luck!

Thanks for your kind help. However, my problem has not been solved. Here is my debugger config:

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "LLaVA 7B v1.5 finetune-lora",
            "type": "debugpy",
            "request": "launch",
            "program": "/usr/local/bin/deepspeed",
            "justMyCode": false,
            "console": "integratedTerminal",
            "args": [
                "--include", "localhost:1",
                "/workspace/LLaVA/llava/train/train_mem.py",
                "--lora_enable", "True",
                "--lora_r", "128",
                "--lora_alpha", "256",
                "--mm_projector_lr", "2e-5",
                "--deepspeed", "./scripts/zero2.json",
                "--model_name_or_path", "./checkpoints/vicuna-7b-v1.5",
                "--version", "v1",
                "--data_path", "./playground/data/LLaVA-Instruct-150K/llava_v1_5_mix665k.json",
                "--image_folder", "./playground/data/",
                "--vision_tower", "./checkpoints/clip-vit-large-patch14-336",
                "--pretrain_mm_mlp_adapter", "./checkpoints/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin",
                "--mm_projector_type", "mlp2x_gelu",
                "--mm_vision_select_layer", "-2",
                "--mm_use_im_start_end", "False",
                "--mm_use_im_patch_token", "False",
                "--image_aspect_ratio", "pad",
                "--group_by_modality_length", "True",
                "--bf16", "True",
                "--output_dir", "./checkpoints/llava-v1.5-7b-lora-debug",
                "--num_train_epochs", "1",
                "--per_device_train_batch_size", "2",
                "--per_device_eval_batch_size", "4",
                "--gradient_accumulation_steps", "1",
                "--evaluation_strategy", "no",
                "--save_strategy", "steps",
                "--save_steps", "50000",
                "--save_total_limit", "1",
                "--learning_rate", "2e-4",
                "--weight_decay", "0.",
                "--warmup_ratio", "0.03",
                "--lr_scheduler_type", "cosine",
                "--logging_steps", "1",
                "--tf32", "True",
                "--model_max_length", "2048",
                "--gradient_checkpointing", "True",
                "--lazy_preprocess", "True",
                "--dataloader_num_workers", "4",
                "--report_to", "none",
            ]
        }
    ]
}

Would you mind telling me what modification you made to run the debugger?

Apr 17 '24 04:04 xiaobaishu0097

I met a same question.I solve it by modify the launch.json. Here:

{

// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
    {
        "name": "Python: Debug DeepSpeed",
        "type": "python",
        "request": "launch",
        "program": "/data/anaconda3/envs/mjp_nips/bin/deepspeed",
        "justMyCode": false,
        "console": "integratedTerminal",
        "args": [
            "--include","localhost:7",
            "/data/LLaVA/llava/train/train_mem.py",
            "--lora_enable=True","--lora_r=64","--lora_alpha=256","--mm_projector_lr=2e-5",
            "--deepspeed=./scripts/zero2.json" ,
            "--model_name_or_path=/data/LLaVA/llava-v1.5-7b" ,
            "--version=v1" ,
            "--data_path=/data/LLaVA/playground/data/textvqa.json" ,
            "--image_folder=./playground/data" ,
            "--vision_tower=openai/clip-vit-large-patch14-336" ,
            "--mm_projector_type=mlp2x_gelu" ,
            "--mm_vision_select_layer=-2" ,
            "--mm_use_im_start_end=False" ,
            "--mm_use_im_patch_token=False" ,
            "--image_aspect_ratio=pad" ,
            "--group_by_modality_length=True" ,
            "--fp16=True" ,
            "--output_dir=/data/LLaVA/checkpoints/result" ,
            "--num_train_epochs=1" ,
            "--per_device_train_batch_size=1" ,
            "--per_device_eval_batch_size=4" ,
            "--gradient_accumulation_steps=1" ,
            "--evaluation_strategy=no" ,
            "--save_strategy=steps" ,
            "--save_steps=50000" ,
            "--save_total_limit=1" ,
            "--learning_rate=2e-4" ,
            "--weight_decay=0." ,
            "--warmup_ratio=0.03" ,
            "--lr_scheduler_type=cosine" ,
            "--logging_steps=1" ,
            "--tf32=False" ,
            "--model_max_length=2048" ,
            "--gradient_checkpointing=True" ,
            "--dataloader_num_workers=4" ,
            "--lazy_preprocess=True" ,
            "--report_to=wandb"
        ]
    }        
]

} hope can help you! good luck!

Thanks for your kind help. However, my problem has not been solved. Here is my debugger config:

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "LLaVA 7B v1.5 finetune-lora",
            "type": "debugpy",
            "request": "launch",
            "program": "/usr/local/bin/deepspeed",
            "justMyCode": false,
            "console": "integratedTerminal",
            "args": [
                "--include", "localhost:1",
                "/workspace/LLaVA/llava/train/train_mem.py",
                "--lora_enable", "True",
                "--lora_r", "128",
                "--lora_alpha", "256",
                "--mm_projector_lr", "2e-5",
                "--deepspeed", "./scripts/zero2.json",
                "--model_name_or_path", "./checkpoints/vicuna-7b-v1.5",
                "--version", "v1",
                "--data_path", "./playground/data/LLaVA-Instruct-150K/llava_v1_5_mix665k.json",
                "--image_folder", "./playground/data/",
                "--vision_tower", "./checkpoints/clip-vit-large-patch14-336",
                "--pretrain_mm_mlp_adapter", "./checkpoints/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin",
                "--mm_projector_type", "mlp2x_gelu",
                "--mm_vision_select_layer", "-2",
                "--mm_use_im_start_end", "False",
                "--mm_use_im_patch_token", "False",
                "--image_aspect_ratio", "pad",
                "--group_by_modality_length", "True",
                "--bf16", "True",
                "--output_dir", "./checkpoints/llava-v1.5-7b-lora-debug",
                "--num_train_epochs", "1",
                "--per_device_train_batch_size", "2",
                "--per_device_eval_batch_size", "4",
                "--gradient_accumulation_steps", "1",
                "--evaluation_strategy", "no",
                "--save_strategy", "steps",
                "--save_steps", "50000",
                "--save_total_limit", "1",
                "--learning_rate", "2e-4",
                "--weight_decay", "0.",
                "--warmup_ratio", "0.03",
                "--lr_scheduler_type", "cosine",
                "--logging_steps", "1",
                "--tf32", "True",
                "--model_max_length", "2048",
                "--gradient_checkpointing", "True",
                "--lazy_preprocess", "True",
                "--dataloader_num_workers", "4",
                "--report_to", "none",
            ]
        }
    ]
}

Would you mind telling me what modification you made to run the debugger?

First，I delete the attribute "--pretrain_mm_mlp_adapter"，because this attribute will leed to a error when loading the checkpoint to the mlp_adapter(I don't know whether this error is common, maybe you can try it)

Second，if you run llava on the V100，please set "fp16=true"，and set "bf16=false"、"tf32=false"，and run train.py instead of train_memy.py

Also，you can try to decrease the "--lora_r" if you memory is limit，I run llava-7B in 32GB memory，it takes almost 26GB.

Hope can help you!

Apr 17 '24 06:04 ylnxxts

by the way, I think it is also important to use absolute path instead of relative path. Maybe the debugger can't find some path I guess

Apr 17 '24 06:04 ylnxxts

@ylnxxts Thank you so much for the prompt help.

I have tried all your suggestions, but none of them works in the docker. However, when I move to the conda virtual environment, the VS Code debugger seems to work fine. I thought there might be some issue with deepspeed working with the debugger in the Nvidia pytorch docker. I hope it will be fixed in the future.

Apr 18 '24 05:04 xiaobaishu0097

DeepSpeed DeepSpeed copied to clipboard

[BUG] Deepspeed hangs when setting ds_accelerator to cuda via VS Code Debugger

{

{

{

DeepSpeed
DeepSpeed copied to clipboard