OpenAdapt
OpenAdapt copied to clipboard
Design: Fine-Tuning
Feature request
We would like to implement fine-tuning.
This task involves considering the tradeoffs between various approaches to improving action completions and outcome evaluation via fine-tuning.
More generally, this also involves:
- Creating a training set
- Fine tuning on that training set
- Comparing the results
Motivation
https://arxiv.org/abs/2406.03679
Autonomous agents that control computer interfaces to accomplish human tasks are emerging. Leveraging LLMs to power such agents has been of special interest, but unless fine-tuned on human-collected task demonstrations, performance is still relatively low.
Related
https://github.com/MLDSAI/OpenAdapt/issues/70 https://github.com/MLDSAI/OpenAdapt/issues/72 https://github.com/OpenAdaptAI/OpenAdapt/issues/415 https://github.com/OpenAdaptAI/OpenAdapt/issues/748
Bounty
A paid bounty is available. Please suggest a price range 🙏
Currently iterating on this issue through #327, by figuring out failure cases by testing various event sequences . To that end, current action items include:
- Researching fine tuning on LLMs in general
- writing a fine tuning pipeline for GPT-4 for Events.
- generalizing the pipeline to arbitrary LLMs, the only exception being the model-specific API calls (HuggingFace and etc)
https://medium.com/@jeremyarancio/fine-tune-an-llm-on-your-personal-data-create-a-the-lord-of-the-rings-storyteller-6826dd614fa9
Useful article, goes over training and techniques like quantization and LoRA as well. Pretty educational to get an idea of what fine tuning an LLM looks like.
Some immediate action items may include:
1)working more closely with mind2web's codebase once they release the fine tuning code. The reason I say this is because training above in the article seems super black boxed to me, i.e it's not clear to me how/where the LLM is being shown the right answer to a given input when generating a completion.
- Dataset of Window and Action Events. Can just distill from our recordings and pool to create a dataset comprising of these event Dicts that we can use for training, validation and testing.
I think we want something like:
python -m openadapt.finetune --recording_id <recording_id> --model <model_name>
https://platform.openai.com/docs/guides/fine-tuning if you scroll down a little you can see that neither GPT-4 nor GPT-3.5-turbo are available for fine tuning at the moment 😞 We could use the davinci base model, although I'm now curious as to what model Mind2Web do their fine tuning on 🤔
@bi-loop any interest? 🙏