unsloth Loss not matching

Hi team, I tried to do QLora for 30B llama with unsloth. I found that there is no much improvement on speed and memory usage. The detaild are as following. seq_length=8192 batch size=1 use flash attn=true gradient_checkpointing=true

With unslosh:

0%|          | 0/52 [00:00<?, ?it/s]
  2%|▏         | 1/52 [00:38<32:45, 38.54s/it]
  4%|▍         | 2/52 [01:15<31:25, 37.71s/it]
  6%|▌         | 3/52 [01:52<30:37, 37.50s/it]
  8%|▊         | 4/52 [02:30<29:54, 37.39s/it]
 10%|▉         | 5/52 [03:07<29:13, 37.31s/it]
{'loss': **4.7581**, 'grad_norm': 3.063769578933716, 'learning_rate': 9.911436253643445e-05, 'epoch': 0.1, 'num_input_tokens_seen': 162198}

without unslosh:

0%|          | 0/52 [00:00<?, ?it/s]
  2%|▏         | 1/52 [00:41<35:08, 41.35s/it]
  4%|▍         | 2/52 [01:21<33:59, 40.79s/it]
  6%|▌         | 3/52 [02:02<33:13, 40.69s/it]
  8%|▊         | 4/52 [02:42<32:30, 40.63s/it]
 10%|▉         | 5/52 [03:23<31:48, 40.60s/it]
{'loss': **0.8759**, 'grad_norm': 0.32929742336273193, 'learning_rate': 9.911436253643445e-05, 'epoch': 0.1, 'num_input_tokens_seen': 162198}

1. The speed has only increased by about 3s, which is very different from the acceleration ratio mentioned in the document. 2. nvidia-smi: with unsloth: 35G, w/o unsloth: 39G. Only 10% less. 3. The value of the loss is abnormal.

here is the code:

model, _ = FastLanguageModel.from_pretrained(
            model_name=model_kwargs['model_id_or_path'],
            max_seq_length=8192,
            dtype=None,
            load_in_4bit=True,
            low_cpu_mem_usage = True,
            device_map ='auto',
            trust_remote_code=True,
            attn_implementation="flash_attention_2",
        )
model = FastLanguageModel.get_peft_model(
            model,
            lora_alpha=model_args.lora_alpha,
            lora_dropout=model_args.lora_dropout,
            r=model_args.lora_r,
            target_modules=model_args.lora_target_modules.split(",")
            use_gradient_checkpointing=True,
            random_state=training_args.seed,
            max_seq_length=8192,
        )
trainer = SFTTrainer(
        model=model,
        ...
    )

Is there some setting I'm missing? Looking forward to your reply.

Apr 17 '24 02:04 ghost

Hey @mxjyst. Do you have a reproducible example for non unsloth? Have you tried our Colab notebooks to confirm?

Also did you do benchmarking with unsloth first then hf in one script since unsloth first patches it.

All our benchmarking code is public for everyone to confirm ie see our HF blog post https://huggingface.co/blog/unsloth-trl in which HF did 3rd party benchmarking. Likewise llama factory and many others have confirmed our benchmarking

See llama factory's research paper: https://twitter.com/danielhanchen/status/1770870732475469926 in which it shows the OSS is the world's fastest by a large margin.

In terms of the loss diverging thats very abnormal. Can you reproduce this vai a Colab notebook?

Apr 17 '24 08:04 shimmyshimmer

@mxjyst Interesting on the loss not matching - would you be able to provide a reproducible example via Colab?

Apr 17 '24 18:04 danielhanchen

unsloth unsloth copied to clipboard

Loss not matching

unsloth
unsloth copied to clipboard