Open-Assistant Hippocorpus dataset for Open Assistant

As per @christophschuhmann's request, I'll be cleaning up and formatting the Hippocorpus dataset for training with the assistant.

Hippocorpus is a dataset of 6,854 English diary-like short stories about recalled and imagined events, collected through a crowdsourcing framework and paired with author demographics and other variables.

The plan is as follows: (correct me if I get any of this wrong)

Create a python script which
1. Converts the Hippocorpus dataset to a Parquet file
2. The Parquet file has columns "INSTRUCTION", "RESPONSE" and "SOURCE" where
  1. "INSTRUCTION" is a natural language instruction to produce the story given the mainEvent/summary, and potentially a random sentence from story to include
  2. "RESPONSE" is story prepended with a "Sure! Here's a story about mainEvent" or similar prefix.
  3. "SOURCE" is "Hippocorpus" and AssignmentId
3. Exports the modified dataset with row_group_size=100
Check at least 100 samples of the exported dataset for quality and report findings
Once deemed high enough quality upload python script to repository and modified dataset to HF

Jan 15 '23 03:01 MightyAlex200

Thank you! Looking forward to your contributions!

Jan 15 '23 03:01 huu4ontocord

If the hippocampus dataset isn't already in the form of qa or instruction->response, you would need to convert the text to do this.

Jan 15 '23 03:01 huu4ontocord

I created a draft pull request with my code. #750 Here is a sample of the current output of this script. It's worth nothing that there are still quite a few mistakes, but almost all of them come from the unreliability of the original data. Are the mistakes that are present of an acceptably low frequency or should I continue to whittle down the dataset and apply modifications to the data? output.csv

Jan 16 '23 00:01 MightyAlex200

I've modified the script per Huu Nguyen's request and specifications. Here is a sample of the new output: output.csv

Jan 16 '23 03:01 MightyAlex200

The results look really good!

Jan 16 '23 03:01 huu4ontocord

@MightyAlex200 would you be interested in doing this dataset? https://www.kaggle.com/datasets/elvis23/mental-health-conversational-data ?

Jan 21 '23 06:01 huu4ontocord

cool, can you convert it to parquets with columns "INSTRUCTION" and "RESPONSE" and "SOURCE" - saved with the option row_group_size=100 and upload it to HF

Jan 22 '23 10:01 christophschuhmann

Checking in if it this has been PR'ed? let me know if i can push along.

Jan 27 '23 18:01 huu4ontocord

Closing old data issue.

Jun 14 '23 08:06 andreaskoepf

Open-Assistant Open-Assistant copied to clipboard

Hippocorpus dataset for Open Assistant

Open-Assistant
Open-Assistant copied to clipboard