mlx-examples
mlx-examples copied to clipboard
Support for OpenAI’s fine-tuning dataset format
- [x] chat format
- [x] prompt completion pair format
- [x] huggingface chat-templates
- [x] update readme
Supports OpenAI's fine-tuning format. Currently, both methods are supported. The format is as follows:
Chat format:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
prompt completion pair format:
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
For more information see: https://platform.openai.com/docs/guides/fine-tuning/example-format
Also supports Chat Templates .
It is also compatible with current data set formats.
Possible help for https://github.com/ml-explore/mlx-examples/issues/484 questions
I think this is helpful because it indeed would cover the issue I raised that you referenced, but also allows for extensibility by other tooling without reinventing the wheel every time.
I think so long as flexible prompt templates can be passed in manually (say, as a non-default value to the OpenAI one in this PR), or by other libraries in the future, we're in a better long term state. I think it's unreasonable to try and handle every single prompt template, and makes more sense outside of it - especially until we have better standards to put a specific prompt template in the repo.
But I do think this PR adds value by creating a different prompt templating type (or task?).
And OpenAI is becoming a standard itself as one of those user/assistant prompt template types. Maybe the right approach now is allowing for the different tasks/types (ie: ChatML, or Alpaca-like for Instruct), and treat them as template types (Ie: pure completion, instruct, and user/assistant like this PR, making for a total of 3 in this example), rather than trying to cover every single variation.
@fblissjr I'm not sure if you've seen Chat Templates, but if you haven't, I strongly suggest you take a look because I think it can answer your question (whether there is a need for real processing of dataset assembly formats similar to ChatML). I think this is a good decoupling; the dataset no longer cares about how the text will be formatted for model training, only providing the necessary content, and leaves the assembly to be handled by the tokenizer. I believe this facilitates the implementation of single responsibility.
Of course, in my view, there might also be possibilities to expand the existing dataset format to support more functionalities in the future, such as:
- Function calling, like what CohereForAI/c4ai-command-r-v01 supports;
- RLHF (Reinforcement Learning from Human Feedback), supporting chosen and rejected options.
Thanks @madroidmaq this PR is great! Can't wait to see it merged! 🥳
Very nice! Thanks
@fblissjr I'm not sure if you've seen Chat Templates, but if you haven't, I strongly suggest you take a look because I think it can answer your question (whether there is a need for real processing of dataset assembly formats similar to ChatML). I think this is a good decoupling; the dataset no longer cares about how the text will be formatted for model training, only providing the necessary content, and leaves the assembly to be handled by the tokenizer. I believe this facilitates the implementation of single responsibility.
Of course, in my view, there might also be possibilities to expand the existing dataset format to support more functionalities in the future, such as:
- Function calling, like what CohereForAI/c4ai-command-r-v01 supports;
- RLHF (Reinforcement Learning from Human Feedback), supporting chosen and rejected options.
Thanks for the info - this helps. I agree, this PR is smart and extensible. I'm not a fan of chat templates to begin with, and I think once we get over this chat paradigm, what you said above will make even more sense.
edited: checked the code, got confused with another commit. Thx!
While training with the same data but using text and completion, the training loss are very different. It start very high on high while completion score well but the model did not align. What can explain such training estimation differences?