mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

Support for OpenAI’s fine-tuning dataset format

Open madroidmaq opened this issue 11 months ago • 1 comments

  • [x] chat format
  • [x] prompt completion pair format
  • [x] huggingface chat-templates
  • [x] update readme

Supports OpenAI's fine-tuning format. Currently, both methods are supported. The format is as follows:

Chat format:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

prompt completion pair format:

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

For more information see: https://platform.openai.com/docs/guides/fine-tuning/example-format

Also supports Chat Templates .

It is also compatible with current data set formats.

madroidmaq avatar Mar 08 '24 17:03 madroidmaq

Possible help for https://github.com/ml-explore/mlx-examples/issues/484 questions

madroidmaq avatar Mar 08 '24 17:03 madroidmaq

I think this is helpful because it indeed would cover the issue I raised that you referenced, but also allows for extensibility by other tooling without reinventing the wheel every time.

I think so long as flexible prompt templates can be passed in manually (say, as a non-default value to the OpenAI one in this PR), or by other libraries in the future, we're in a better long term state. I think it's unreasonable to try and handle every single prompt template, and makes more sense outside of it - especially until we have better standards to put a specific prompt template in the repo.

But I do think this PR adds value by creating a different prompt templating type (or task?).

And OpenAI is becoming a standard itself as one of those user/assistant prompt template types. Maybe the right approach now is allowing for the different tasks/types (ie: ChatML, or Alpaca-like for Instruct), and treat them as template types (Ie: pure completion, instruct, and user/assistant like this PR, making for a total of 3 in this example), rather than trying to cover every single variation.

fblissjr avatar Mar 16 '24 00:03 fblissjr

@fblissjr I'm not sure if you've seen Chat Templates, but if you haven't, I strongly suggest you take a look because I think it can answer your question (whether there is a need for real processing of dataset assembly formats similar to ChatML). I think this is a good decoupling; the dataset no longer cares about how the text will be formatted for model training, only providing the necessary content, and leaves the assembly to be handled by the tokenizer. I believe this facilitates the implementation of single responsibility.

Of course, in my view, there might also be possibilities to expand the existing dataset format to support more functionalities in the future, such as:

  • Function calling, like what CohereForAI/c4ai-command-r-v01 supports;
  • RLHF (Reinforcement Learning from Human Feedback), supporting chosen and rejected options.

madroidmaq avatar Mar 18 '24 02:03 madroidmaq

Thanks @madroidmaq this PR is great! Can't wait to see it merged! 🥳

ivanfioravanti avatar Mar 20 '24 07:03 ivanfioravanti

Very nice! Thanks

alwint3r avatar Mar 20 '24 09:03 alwint3r

@fblissjr I'm not sure if you've seen Chat Templates, but if you haven't, I strongly suggest you take a look because I think it can answer your question (whether there is a need for real processing of dataset assembly formats similar to ChatML). I think this is a good decoupling; the dataset no longer cares about how the text will be formatted for model training, only providing the necessary content, and leaves the assembly to be handled by the tokenizer. I believe this facilitates the implementation of single responsibility.

Of course, in my view, there might also be possibilities to expand the existing dataset format to support more functionalities in the future, such as:

  • Function calling, like what CohereForAI/c4ai-command-r-v01 supports;
  • RLHF (Reinforcement Learning from Human Feedback), supporting chosen and rejected options.

Thanks for the info - this helps. I agree, this PR is smart and extensible. I'm not a fan of chat templates to begin with, and I think once we get over this chat paradigm, what you said above will make even more sense.

fblissjr avatar Mar 20 '24 16:03 fblissjr

edited: checked the code, got confused with another commit. Thx!

Solido avatar Mar 24 '24 20:03 Solido

While training with the same data but using text and completion, the training loss are very different. It start very high on high while completion score well but the model did not align. What can explain such training estimation differences?

Solido avatar Mar 26 '24 18:03 Solido