Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Hippocorpus dataset for Open Assistant

Open MightyAlex200 opened this issue 1 year ago • 8 comments

As per @christophschuhmann's request, I'll be cleaning up and formatting the Hippocorpus dataset for training with the assistant.

Hippocorpus is a dataset of 6,854 English diary-like short stories about recalled and imagined events, collected through a crowdsourcing framework and paired with author demographics and other variables.

The plan is as follows: (correct me if I get any of this wrong)

  1. Create a python script which
    1. Converts the Hippocorpus dataset to a Parquet file
    2. The Parquet file has columns "INSTRUCTION", "RESPONSE" and "SOURCE" where
      1. "INSTRUCTION" is a natural language instruction to produce the story given the mainEvent/summary, and potentially a random sentence from story to include
      2. "RESPONSE" is story prepended with a "Sure! Here's a story about mainEvent" or similar prefix.
      3. "SOURCE" is "Hippocorpus" and AssignmentId
    3. Exports the modified dataset with row_group_size=100
  2. Check at least 100 samples of the exported dataset for quality and report findings
  3. Once deemed high enough quality upload python script to repository and modified dataset to HF

MightyAlex200 avatar Jan 15 '23 03:01 MightyAlex200