axolotl icon indicating copy to clipboard operation
axolotl copied to clipboard

Recent RunPod Axolotl error

Open drummerv opened this issue 9 months ago • 7 comments

Please check that this issue hasn't been reported before.

  • [X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I ran Axolotl around two days ago and it worked fine. 8xH100 SXM using RunPod's Axolotl Jupyter template.

Current behaviour

When I ran the same config today, it gave me this error:

RuntimeErrorRuntimeErrorRuntimeError: : : CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

Steps to reproduce

  1. Use RunPod's Axolotl Jupyter template
  2. Use 8xH100 SXM in Secure or Community
  3. Run training
  4. Wait for it to load the model
  5. It doesn't

Config yaml

base_model: ChaoticNeutrals/Poppy_Porpoise-0.72-L3-8B
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false
sequence_len: 8192
bf16: auto
fp16:
tf32: false
flash_attention: true
special_tokens:
  bos_token: <|begin_of_text|>
  pad_token: <|end_of_text|>
  eos_token: <|end_of_text|>

# Data
datasets:
  - path: TheDrummer/siayn-v6
    type: customllama3 # src/axolotl/prompt_strategies
warmup_steps: 30

# save_safetensors: true

# WandB
wandb_project: llama-3some
wandb_entity: 

# Iterations
num_epochs: 2

# Evaluation
val_set_size: 0.0125
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
eval_sample_packing: false
eval_batch_size: 1

# LoRA
output_dir: ./Llama-3some-8B-v2-Workspace
adapter: lora
lora_model_dir:
lora_r: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:

# Sampling
sample_packing: true
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 1
micro_batch_size: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
   use_reentrant: true

# Optimizer
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002

# Misc
train_on_inputs: false
group_by_length: false
early_stopping_patience:
local_rank:
logging_steps: 1
xformers_attention:
debug:
weight_decay: 0
fsdp:
fsdp_config:

# Checkpoints
resume_from_checkpoint:
saves_per_epoch: 2

Possible solution

Does the runpod / docker template use the latest commit? We can narrow it down to the last 1 to 2 days.

Which Operating Systems are you using?

  • [X] Linux
  • [ ] macOS
  • [ ] Windows

Python Version

main-latest

axolotl branch-commit

main-latest

Acknowledgements

  • [X] My issue title is concise, descriptive, and in title casing.
  • [X] I have searched the existing issues to make sure this bug has not been reported yet.
  • [X] I am using the latest version of axolotl.
  • [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

drummerv avatar May 06 '24 06:05 drummerv