h2o-llmstudio icon indicating copy to clipboard operation
h2o-llmstudio copied to clipboard

[FEATURE] Support nested tree conversation data

Open psinger opened this issue 2 years ago • 0 comments

🚀 Feature

Support tree-like conversation data - i.e. chain of thoughts such as the OASST data provdes.

Motivation

Currently, we only support prompt/output data structures. While one can manually add previous conversations to the input dataframe, it would be very helpful to support this out of the box.

This will be specifically helpful for conversational bots with history.

Proposed features & solution

A first part of the solution is support providing such information in the input data. I am proposing the following:

  1. Allow to set a parent_column in the dataset. This column will allow to link individual conversations with each other.
  2. Have potentially two extra settings - could be also only one:
    • Probability to link conversations together while training
    • Number of conversations to link together

Additionally, I would suggest to add an augmentation setting:

  • Random probability to link any random conversations together.

This might help to differentiate between unrelated and related context.

Potential additional tasks:

  • [ ] Update README
  • [ ] Update configs
  • [ ] Update Kaggle Dataset notebook

psinger avatar Apr 20 '23 11:04 psinger