lit-llama icon indicating copy to clipboard operation
lit-llama copied to clipboard

Floating point exception (core dumped)

Open XuanVuNguyen opened this issue 2 years ago • 3 comments

I was trying to run the script finetune/lora.py and I got the error:

Floating point exception (core dumped)

without any further traceback log. For context, I was following the guide in the howto folder, with the exception that I used my own training data, and the pretrain weight of LLaMa was downloaded from pyllama. Thank you in advance!

XuanVuNguyen avatar May 18 '23 11:05 XuanVuNguyen

Can you provide more details about your hardware setup, python environment, and precise llama config? Thanks!

carmocca avatar May 19 '23 15:05 carmocca

@carmocca Here's the info I gathered:

  • Python version: 3.8.7
  • System specification:
OS: Ubuntu 18.04.5 LTS x86_64 
Host: X299X AORUS MASTER -CF 
Kernel: 5.4.0-147-generic 
Uptime: 24 days, 16 hours, 43 mins 
Packages: 2673 (dpkg), 16 (snap) 
Shell: bash 4.4.20 
Terminal: node 
CPU: Intel i9-10920X (24) @ 4.600GHz 
GPU: NVIDIA 65:00.0 NVIDIA Corporation Device 2204 
GPU: NVIDIA 17:00.0 NVIDIA Corporation Device 2204 
Memory: 40403MiB / 128532MiB 
  • llama config:
{
"dim": 4096, 
"multiple_of": 256, 
"n_heads": 32, 
"n_layers": 32, 
"norm_eps": 1e-06, 
"vocab_size": -1
}
  • Finetuning with LoRA config:
eval_interval = 100
save_interval = 100
eval_iters = 100
log_interval = 1

# Hyperparameters
learning_rate = 3e-4
batch_size = 128
micro_batch_size = 4
gradient_accumulation_steps = batch_size // micro_batch_size
max_iters = 50000 * 3 // micro_batch_size
weight_decay = 0.0
max_seq_length = 256  # see scripts/prepare_alpaca.py
lora_r = 8
lora_alpha = 16
lora_dropout = 0.05
warmup_steps = 100

Please let me know if you need any further information. By the way, I was able to run the code on another machine, python 3.8.10 with no problem.

XuanVuNguyen avatar May 22 '23 03:05 XuanVuNguyen

I'm a bit confused about your llama config, because it doesn't match the values in https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/model.py#L18-L25

But if you say another machine works, then there might be some environment or hardware issue on the first one. Do you still get the error on it? Is it consistent?

carmocca avatar May 22 '23 19:05 carmocca