Open-Assistant
Open-Assistant copied to clipboard
Hippocorpus dataset for Open Assistant
As per @christophschuhmann's request, I'll be cleaning up and formatting the Hippocorpus dataset for training with the assistant.
Hippocorpus is a dataset of 6,854 English diary-like short stories about recalled and imagined events, collected through a crowdsourcing framework and paired with author demographics and other variables.
The plan is as follows: (correct me if I get any of this wrong)
- Create a python script which
- Converts the Hippocorpus dataset to a Parquet file
- The Parquet file has columns "INSTRUCTION", "RESPONSE" and "SOURCE" where
- "INSTRUCTION" is a natural language instruction to produce the story given the
mainEvent
/summary
, and potentially a random sentence fromstory
to include - "RESPONSE" is
story
prepended with a "Sure! Here's a story aboutmainEvent
" or similar prefix. - "SOURCE" is "Hippocorpus" and
AssignmentId
- "INSTRUCTION" is a natural language instruction to produce the story given the
- Exports the modified dataset with
row_group_size=100
- Check at least 100 samples of the exported dataset for quality and report findings
- Once deemed high enough quality upload python script to repository and modified dataset to HF