mlx-vlm icon indicating copy to clipboard operation
mlx-vlm copied to clipboard

Error in FineTuning deepseek-vl-7b-chat-8bit

Open sachinraja13 opened this issue 1 year ago • 2 comments

This is the command I'm using:

python -m mlx_vlm.lora --dataset ~/Datasets/BusinessVQA/fintabnet/val/vqa_dataset.hf --model-path ~/.cache/lm-studio/models/mlx-community/deepseek-vl-7b-chat-8bit --epochs 2 --batch-size 4 --learning-rate 5e-5

Here is the console output:

INFO:__main__:Loading model from /Users/sachinraja/.cache/lm-studio/models/mlx-community/deepseek-vl-7b-chat-8bit
INFO:__main__:Loading dataset from /Users/sachinraja/Datasets/BusinessVQA/fintabnet/val/vqa_dataset.hf
INFO:__main__:Applying chat template to the dataset
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 240574/240574 [00:07<00:00, 31075.60 examples/s]
INFO:__main__:Setting up LoRA
#trainable params: 23.424 M || all params: 6910.365696 M || trainable%: 0.339%
INFO:__main__:Setting up optimizer
INFO:__main__:Setting up trainer
INFO:__main__:Training model
  0%|                                                                                                                                                                                                                          | 0/60143 [00:10<?, ?it/s]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/sachinraja/Code/mlx-vlm/mlx_vlm/lora.py", line 177, in <module>
    main(args)
  File "/Users/sachinraja/Code/mlx-vlm/mlx_vlm/lora.py", line 97, in main
    loss = trainer.train_step(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/sachinraja/Code/mlx-vlm/mlx_vlm/trainer/trainer.py", line 265, in train_step
    loss, grads = loss_and_grad_fn(self.model, batch)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mlx/lib/python3.11/site-packages/mlx/nn/utils.py", line 35, in wrapped_value_grad_fn
    value, grad = value_grad_fn(model.trainable_parameters(), *args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mlx/lib/python3.11/site-packages/mlx/nn/utils.py", line 29, in inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/sachinraja/Code/mlx-vlm/mlx_vlm/trainer/trainer.py", line 251, in loss_fn
    nn.losses.cross_entropy(
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mlx/lib/python3.11/site-packages/mlx/nn/losses.py", line 81, in cross_entropy
    raise ValueError(
ValueError: Targets shape (4, 78) does not match logits shape (1, 78, 102400).

@Blaizzy : Will greatly appreciate your help here please.

sachinraja13 avatar Jan 27 '25 10:01 sachinraja13

Hey @sachinraja13

Please set batch size to 1.

There is a bug with batch size bigger than one for some models.

Blaizzy avatar Feb 24 '25 13:02 Blaizzy

Thank you!

sachinraja13 avatar Feb 24 '25 18:02 sachinraja13

This is being fixed in #499

Blaizzy avatar Nov 10 '25 12:11 Blaizzy