nebuly
nebuly copied to clipboard
[Chatllama] Add multiple sources for generating synthetic data
Description
Currently, chatllama supports the synthetic data generation just from OpenAI’s davinci-003
.
Both for conversations and for scores.
In order to avoid huge costs while generating data we should support other API models (as the cheaper gpt-3.5-turbo
), other API providers and local models (Flan T5 seems a good candidate).
Furthermore, in order to generate more diverse data, it could be beneficial to be able to use multiple prompt templates during the generation.
TODO
- [ ] Add support for
gpt-3.5-turbo
. Externally respect to LangChain models. - [ ] Add preview of the costs associated with the API models (i.e. n_words / 0.75 * API_cost_per_token) before proceeding with the labelling.
- [ ] Modify langchain-based script for supporting multiple API models and providers.
- [ ] Add support for HF models to perform the generation task.
- [ ] Allow user to specify multiple templates when generating synthetic data that can be customisable to the user needs.
- [ ] Provide multiple template examples for dataset generation.
hi did you add support for HF models in dataset generation? It seems only OpenAI’s davinci-003 in line 21 in generate_rewards.py.