[Chatllama] Add multiple sources for generating synthetic data

Open diegofiori opened this issue 1 year ago • 1 comments

Description

Currently, chatllama supports the synthetic data generation just from OpenAI’s davinci-003. Both for conversations and for scores.

In order to avoid huge costs while generating data we should support other API models (as the cheaper gpt-3.5-turbo ), other API providers and local models (Flan T5 seems a good candidate).

Furthermore, in order to generate more diverse data, it could be beneficial to be able to use multiple prompt templates during the generation.

TODO

[ ] Add support for gpt-3.5-turbo . Externally respect to LangChain models.
[ ] Add preview of the costs associated with the API models (i.e. n_words / 0.75 * API_cost_per_token) before proceeding with the labelling.
[ ] Modify langchain-based script for supporting multiple API models and providers.
[ ] Add support for HF models to perform the generation task.
[ ] Allow user to specify multiple templates when generating synthetic data that can be customisable to the user needs.
[ ] Provide multiple template examples for dataset generation.

Mar 08 '23 13:03 diegofiori

hi did you add support for HF models in dataset generation? It seems only OpenAI’s davinci-003 in line 21 in generate_rewards.py.

Mar 31 '23 08:03 pengwei-iie

nebuly nebuly copied to clipboard

[Chatllama] Add multiple sources for generating synthetic data

Description

TODO

nebuly
nebuly copied to clipboard