llama-recipes icon indicating copy to clipboard operation
llama-recipes copied to clipboard

NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Open kushalj001 opened this issue 1 year ago • 13 comments

System Info

torch: 2.1.0.dev20230819+cu118 cuda: 11.8 GPU type: A100 80GB #GPUs: 2

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

🐛 Describe the bug

I am working on a slightly modified RL algorithm to finetune llama 7B. I have been running into this error and I have no idea how to debug this further as it does not leave an easy-to-understand stack trace. I cannot trace what is leading to this.

Error logs

terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x2ae21a709647 in /cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x2ae21a6c58f9 in /cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x2ae21a5d3588 in /cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x2ae1c7b5db90 in /cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x2ae1c7b619b8 in /cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x2ae1c7b781db in /cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x2ae1c7b784e8 in /cluster/project/sachan/kushal/math/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xb94cf (0x2ae17a7a94cf in /cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/gcc-6.3.0-sqhtfh32p5gerbkvi5hih7cfvcpmewvj/lib64/libstdc++.so.6)
frame #8: <unknown function> + 0x7ea5 (0x2ae16fcf1ea5 in /lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x2ae17070db0d in /lib64/libc.so.6)

Fatal Python error: Aborted

Expected behavior

Any help/directions to debug this would be helpful.

kushalj001 avatar Aug 20 '23 06:08 kushalj001

Hi, I've seen this error message in different places and it seems to be rather a side effect than the actual cause of the crash. Can you elaborate a bit more on your script? Are you be able to create a minimal repro for this?

mreso avatar Sep 01 '23 06:09 mreso

Hi, I am finetuning llama-2-7b with a custom RL algorithm. It basically involves generating samples from the model given a prompt and some reward shaping. I've been trying to debug this with compute-sanitizer but I run into this error:

========= Error: No attachable process found. compute-sanitizer timed-out.
========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.

There's an extended conversation around this here and here. Let me know if this helps you any further. I am trying to create a minimal repro but finding it a bit tricky as the error is reproducible in a distributed setting only with a specific dataset, so it would require someone else to put in considerable effort. Thanks!

kushalj001 avatar Sep 01 '23 07:09 kushalj001

@kushalj001

I'm also facing similar error, but like you said, it can be reproduced but very tricky to write a small code snippet to do it. My project is a Llava style decoder using llama2 7b. Did you make any progress in debugging it?

[rank6]:[E ProcessGroupNCCL.cpp:1182] [Rank 6] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c58b63d87 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2c58b1475f in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f2c58c348a8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::startedGPUExecutionInternal() const + 0x7e (0x7f2c59d072ee in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isStarted() + 0x58 (0x7f2c59d0b458 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x303 (0x7f2c59d0eda3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2c59d0f839 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7f2ca9963df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f2cae286609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f2cae3c0133 in /usr/lib/x86_64-linux-gnu/libc.so.6)```

Andcircle avatar Mar 07 '24 04:03 Andcircle

@Andcircle I recently have seen it on H100s, I couldn't repro it yet on A100s. can you pls provide more details on your env. including pytorch, cuda and HF, bits&bytes, accelerate versions.

HamidShojanazeri avatar Mar 07 '24 16:03 HamidShojanazeri

@HamidShojanazeri thanks for your response

cuda 12.2 nccl 2.19.3 torch 2.2.0 transformer 4.37.2 trl 0.7.10 accelerate 0.27.2 bitsandbytes 0.42.0

Andcircle avatar Mar 07 '24 18:03 Andcircle

can you pls give it a try to torch 2.2.1 and cuda 12.1? @Andcircle

HamidShojanazeri avatar Mar 11 '24 21:03 HamidShojanazeri

@HamidShojanazeri thanks for your response, I'll try with nvidia base image 12.1, and keep result updated here

Andcircle avatar Mar 12 '24 05:03 Andcircle

@HamidShojanazeri cuda 12.1.1 torch 2.2.1

still have exactly the same error at the same step

Any hints or guidance, how to debug this type of situation?

tried to add TORCH_CUDA_SANITIZER=1, it's super slow, but break at the same step with no extra info tried to add CUDA_LAUNCH_BLOCKING=1, it will stuck at 100 steps with no error at all

Andcircle avatar Mar 13 '24 18:03 Andcircle

@Andcircle can you please enable TORCH_DISTRIBUTED_DEBUG as well, but beyond that I am still trying to repro it not much luck.. does this happen with cuda 11.8 as well?

can you pls share the command/ repro you are running too?

HamidShojanazeri avatar Mar 20 '24 20:03 HamidShojanazeri

@HamidShojanazeri I did add this ENV, but didn't get extra info, I also used sanitizer =)

This smallest code snippet I can reproduce

import wandb

import torch
from accelerate import Accelerator
from datasets import load_from_disk
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments
)

from trl import DataCollatorForCompletionOnlyLM

from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration, AutoTokenizer

import sys
project_root = '/'.join(os.path.dirname(__file__).split('/')[:-1])
print(project_root)
sys.path.append(project_root)
from utils.meta_loader import write_meta, read_meta

import transformers

# bench
alpha = 16
rank = 64
batch_size = 2
length = 4096
accumlate_steps = 1
lr = 5e-5

train_dataset = load_from_disk("/mnt/localssd/dataset/llava_processed_dataset/train")
eval_dataset = load_from_disk("/mnt/localssd/dataset/llava_processed_dataset/test")    

run_name = "llava_debug"
save_dir = "/mnt/localssd/llava_debug"

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    # load_in_8bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
    # llm_int8_skip_modules=["multi_modal_projector"]
)

model = LlavaForConditionalGeneration.from_pretrained(
    "llava-hf/llava-1.5-7b-hf",
    # "llava-hf/bakLlava-v1-hf",
    quantization_config=bnb_config,
    trust_remote_code=True, 
    device_map={'':torch.cuda.current_device()},
    torch_dtype=torch.float16,
    use_flash_attention_2=True
    )

target_modules = [
    "*language_model.*q_proj", 
    "*language_model.*k_proj", 
    "*language_model.*v_proj", 
    "*language_model.*o_proj", 
    "*language_model.*gate_proj", 
    "*language_model.*up_proj", 
    "*language_model.*down_proj", 
    "*language_model.*lm_head"]

modules_to_save = ["linear_1", "linear_2"]
    
peft_config = LoraConfig(
    lora_alpha=alpha,
    lora_dropout=0.1,
    r=rank,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=target_modules,
    modules_to_save=modules_to_save
)

tokenizer = AutoTokenizer.from_pretrained("llava-hf/llava-1.5-7b-hf", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
model = get_peft_model(model, peft_config)

training_arguments = TrainingArguments(
    output_dir=save_dir,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=accumlate_steps,
    optim="paged_adamw_32bit",
    save_steps=500,
    logging_steps=10,
    learning_rate=lr,
    fp16=True,
    max_grad_norm=0.3,
    num_train_epochs=100,
    warmup_ratio=0.03,
    # group_by_length=True,
    lr_scheduler_type="constant",
    run_name=run_name,
    evaluation_strategy="steps",
    eval_steps=200,
    ddp_find_unused_parameters=False,
    gradient_checkpointing=True,
    # weight_decay=0.01,
    # dataloader_num_workers=NUM_PROC//2
)


model.config.use_cache = False # not use for fine tuning

def test_data_collator(datas):
    result = {}
    input_ids = [torch.Tensor(d['input_ids']) for d in datas]
    attention_mask = [torch.Tensor(d['attention_mask']) for d in datas]
    pixel_values = [torch.Tensor(d['pixel_values']) for d in datas]
    labels = [torch.Tensor(d['labels']) for d in datas]
    
    result['input_ids'] = torch.concat(input_ids).type(torch.int64)
    result['attention_mask'] = torch.concat(attention_mask).type(torch.int64)
    result['pixel_values'] = torch.concat(pixel_values)
    result['labels'] = torch.concat(labels).type(torch.int64)
    return result
    

trainer = transformers.Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=training_arguments,
    data_collator=test_data_collator
)

trainer.train()```

Andcircle avatar Mar 20 '24 21:03 Andcircle

I see, you are using fsdp+qlora? it was not working so far, just couple of weeks its been figured out. They just released the PEFT and Transformers today, before you needed them and accelerate to be installed from src. It should be better and btw this seems to be HF trainer, we are not the. maintainer of that library. But please have a fresh env with all the new released ones, here is their example. I am working to add it to recipe as well.

HamidShojanazeri avatar Mar 21 '24 15:03 HamidShojanazeri

@HamidShojanazeri thanks for your reply, this is just an demo snippets, we actually use MP + DDP, FSDP without qlora can't save memory that much, since we have relatively long context window, during real training, we spread model into 2 GPUs, each node have 4 process.

But this won't affect to reproduce the error =)

Maybe I should give fsdp+qlora a try, what do you mean by it's not working so far?

Andcircle avatar Mar 21 '24 16:03 Andcircle

@Andcircle so up until couple of weeks ago FSDP was not composable with quantization, Jeremy and other folks made it work. Now those changes are upstreamed to Transformers, PEFT, Accelerate and SFT. We are working to add support in recipe as well.

cc : @awgu for visibility

HamidShojanazeri avatar Mar 21 '24 18:03 HamidShojanazeri