axolotl Fulltune training Mistral 7B using JSON dataset instead of JSONL makes the model incoherent

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Training Mistral 7B with example settings should work.

Current behaviour

I'm using my own data in ShareGPT format, training using the default config in this repo (sample packing true), and deepspeed zero3 config.

I see loss decrease steadily from 10 -> 6 during training. I saved a checkpoint and put the full-tuned model into oogabooga GUI to test it. The output becomes incoherent. It outputs "a", "of", "and", then stops.

I would at least expect it to be similar to the original model.

What could be the issue? Is it a training issue or an inference issue on the GUI side?

Steps to reproduce

see above

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Oct 19 '23 09:10 l3utterfly

Could you check your prompt format is the same? Your loss also starts quite high.

Oct 20 '23 12:10 NanoCode012

My prompt is in the normal shareGPT format. I'm not using JSONL though, perhaps that could be an issue. I will try using JSONL and report back

Oct 26 '23 06:10 l3utterfly

@l3utterfly Did you ever find out anything more about your issue?

Out of curiosity, what happens if you fine-tune for a small number of epochs with learning rate 0? If the model is still incoherent, then this would point to an issue somewhere with axolotl rather than your data.

Oct 30 '23 15:10 mxbi

@mxbi I figured out the issue: I re-formatted my dataset to JSONL and the training works. So it seems JSON datasets will not work, only JSONL.

Oct 31 '23 05:10 l3utterfly

Oh yeah I had the same thing

Oct 31 '23 05:10 ehartford

Did not know that using JSON would cause such an issue. That sounds weird.

I will close this issue. Please re-open if the problem re-occurs.

Nov 02 '23 09:11 NanoCode012

Surely this should be re-opened (and likely renamed): there is a real bug here in that JSON - suggested by the documentation - causes silent failure. I suspect many people are hitting the same issue and wasting time or giving up.

Do we have any example ymls where jsonl works but json does not?

Nov 02 '23 09:11 mxbi

Sorry for that. The issue is reopened. Could you please provide an example config of where json does not work?

The dataset handler for json and jsonl is the same via the upstream datasets library.

Nov 02 '23 11:11 NanoCode012

@l3utterfly , do you perhaps still have the offending dataset to share or some sample of it for reproducing this?

Mar 30 '24 17:03 NanoCode012

axolotl axolotl copied to clipboard

Fulltune training Mistral 7B using JSON dataset instead of JSONL makes the model incoherent

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

axolotl
axolotl copied to clipboard