Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Hippocorpus dataset for Open Assistant

Open MightyAlex200 opened this issue 2 years ago • 8 comments

As per @christophschuhmann's request, I'll be cleaning up and formatting the Hippocorpus dataset for training with the assistant.

Hippocorpus is a dataset of 6,854 English diary-like short stories about recalled and imagined events, collected through a crowdsourcing framework and paired with author demographics and other variables.

The plan is as follows: (correct me if I get any of this wrong)

  1. Create a python script which
    1. Converts the Hippocorpus dataset to a Parquet file
    2. The Parquet file has columns "INSTRUCTION", "RESPONSE" and "SOURCE" where
      1. "INSTRUCTION" is a natural language instruction to produce the story given the mainEvent/summary, and potentially a random sentence from story to include
      2. "RESPONSE" is story prepended with a "Sure! Here's a story about mainEvent" or similar prefix.
      3. "SOURCE" is "Hippocorpus" and AssignmentId
    3. Exports the modified dataset with row_group_size=100
  2. Check at least 100 samples of the exported dataset for quality and report findings
  3. Once deemed high enough quality upload python script to repository and modified dataset to HF

MightyAlex200 avatar Jan 15 '23 03:01 MightyAlex200

Thank you! Looking forward to your contributions!

huu4ontocord avatar Jan 15 '23 03:01 huu4ontocord

If the hippocampus dataset isn't already in the form of qa or instruction->response, you would need to convert the text to do this.

huu4ontocord avatar Jan 15 '23 03:01 huu4ontocord

I created a draft pull request with my code. #750 Here is a sample of the current output of this script. It's worth nothing that there are still quite a few mistakes, but almost all of them come from the unreliability of the original data. Are the mistakes that are present of an acceptably low frequency or should I continue to whittle down the dataset and apply modifications to the data? output.csv

MightyAlex200 avatar Jan 16 '23 00:01 MightyAlex200

I've modified the script per Huu Nguyen's request and specifications. Here is a sample of the new output: output.csv

MightyAlex200 avatar Jan 16 '23 03:01 MightyAlex200

The results look really good!

huu4ontocord avatar Jan 16 '23 03:01 huu4ontocord

@MightyAlex200 would you be interested in doing this dataset? https://www.kaggle.com/datasets/elvis23/mental-health-conversational-data ?

huu4ontocord avatar Jan 21 '23 06:01 huu4ontocord

cool, can you convert it to parquets with columns "INSTRUCTION" and "RESPONSE" and "SOURCE" - saved with the option row_group_size=100 and upload it to HF

christophschuhmann avatar Jan 22 '23 10:01 christophschuhmann

Checking in if it this has been PR'ed? let me know if i can push along.

huu4ontocord avatar Jan 27 '23 18:01 huu4ontocord

Closing old data issue.

andreaskoepf avatar Jun 14 '23 08:06 andreaskoepf