nebuly icon indicating copy to clipboard operation
nebuly copied to clipboard

[Chatllama] Add multiple sources for generating synthetic data

Open diegofiori opened this issue 1 year ago • 1 comments

Description

Currently, chatllama supports the synthetic data generation just from OpenAI’s davinci-003. Both for conversations and for scores.

In order to avoid huge costs while generating data we should support other API models (as the cheaper gpt-3.5-turbo ), other API providers and local models (Flan T5 seems a good candidate).

Furthermore, in order to generate more diverse data, it could be beneficial to be able to use multiple prompt templates during the generation.

TODO

  • [ ] Add support for gpt-3.5-turbo . Externally respect to LangChain models.
  • [ ] Add preview of the costs associated with the API models (i.e. n_words / 0.75 * API_cost_per_token) before proceeding with the labelling.
  • [ ] Modify langchain-based script for supporting multiple API models and providers.
  • [ ] Add support for HF models to perform the generation task.
  • [ ] Allow user to specify multiple templates when generating synthetic data that can be customisable to the user needs.
  • [ ] Provide multiple template examples for dataset generation.

diegofiori avatar Mar 08 '23 13:03 diegofiori

hi did you add support for HF models in dataset generation? It seems only OpenAI’s davinci-003 in line 21 in generate_rewards.py.

pengwei-iie avatar Mar 31 '23 08:03 pengwei-iie