[BUG] Internal error when fine-tuning Gemma

Open awni opened this issue 1 year ago • 1 comments

E.g.:

mlx_lm.lora --model mlx-community/codegemma-7b-it-8bit --train --adapter-path adapters_codegemma_7B --data training_data --iters 500

Can result in:

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Internal Error (0000000e:Internal Error)
zsh: abort      mlx_lm.lora --model mlx-community/codegemma-7b-it-8bit --train --adapter-path

The splikt matmul on the output which has a very large inner dimension (256k) appears to be the culprit. @jagrit06 is looking into this.

Aug 26 '24 20:08 awni

Hi, I'm experiencing the same issue. Is there a workaround?

Oct 28 '25 17:10 cbowdon