Open-Assistant
Open-Assistant copied to clipboard
Proposal: Use OA compatible jsonl message format for multi-turn conversations
We need a dataset file format that allows multi-turn conversations. Currently we ask people to contribute datasets as parquet files with a simple column structure: INSTRUCTION, RESPONSE, SOURCE, METADATA, see datasets/README.md.
In the Open-Assistant HF collection backend we use jsonl
(or jsonl.gz
) as import/export file format. We could use a thread variant of this format to store multi-turn conversations and use it as our official OA conversation dataset format. The core structure would look as follows (here shown formatted with indentation, in the jsonl files it would be encoded as one json
object per line):
{
"thread": [
{
"text": "Hola, \u00bfqu\u00e9 eres?",
"role": "prompter"
},
{
"text": "Soy una inteligencia Artificial (..)",
"role": "assistant"
}
],
"source": "wikipedia",
"meta": { "value": 123 },
}
This format would be "compatible" to the full oasst import/export format. A full thread export from oaast looks as follows (again shown indented for readability):
{
"thread": [
{
"message_id": "77b151ac-e001-4b19-9afd-eb9cabf5cfbc",
"text": "What are some of the pro's and con's of social media?",
"role": "prompter",
"lang": "en",
"review_count": 3,
"review_result": true,
"deleted": false,
"synthetic": false,
"emojis": {
"+1": 6,
"_skip_reply": 1,
"_skip_ranking": 1
}
},
{
"message_id": "d80c6b1b-4c50-4d07-a20e-56476fc6e4ce",
"parent_id": "77b151ac-e001-4b19-9afd-eb9cabf5cfbc",
"text": "Here are some potential pros and cons of social media: (..)",
"role": "assistant",
"lang": "en",
"review_count": 3,
"review_result": true,
"deleted": false,
"rank": 0,
"synthetic": false,
"emojis": {
"+1": 6
}
},
{
"message_id": "3f458cb6-4b61-40cd-96fe-b6d7c06a2c53",
"parent_id": "d80c6b1b-4c50-4d07-a20e-56476fc6e4ce",
"text": "Why does it affect mental health?",
"role": "prompter",
"lang": "en",
"review_count": 3,
"review_result": true,
"deleted": false,
"synthetic": false,
"emojis": {
"+1": 2,
"_skip_reply": 1,
"_skip_ranking": 1,
"_skip_labeling": 1
}
},
{
"message_id": "fa12350e-8899-49b8-842b-f82cd6bc8676",
"parent_id": "3f458cb6-4b61-40cd-96fe-b6d7c06a2c53",
"text": "Social media can affect mental health in many ways(..)",
"role": "assistant",
"lang": "en",
"review_count": 3,
"review_result": true,
"deleted": false,
"rank": 0,
"synthetic": false,
"emojis": {
"+1": 2,
"_skip_labeling": 2
},
"labels": {
"spam": {
"value": 0.0,
"count": 3
},
"fails_task": {
"value": 0.0,
"count": 2
},
"lang_mismatch": {
"value": 0.0,
"count": 3
},
"pii": {
"value": 0.0,
"count": 2
},
"not_appropriate": {
"value": 0.0,
"count": 2
},
"hate_speech": {
"value": 0.0,
"count": 2
},
"sexual_content": {
"value": 0.0,
"count": 2
},
"quality": {
"value": 0.5,
"count": 3
},
"toxicity": {
"value": 0.0,
"count": 2
},
"humor": {
"value": 0.0,
"count": 2
},
"helpfulness": {
"value": 0.5,
"count": 2
},
"creativity": {
"value": 0.25,
"count": 2
},
"violence": {
"value": 0.0,
"count": 2
}
}
}
]
}
The additional properties shown here are optional, only the "text"
field would really be mandatory (and maybe "role"
) for each message. The "lang"
field could be added for multi-lingual datasets. Additional properties could be added in custom fields.
Handling jsonl
In many languages jsonl
data can be generated and parsed easily (i.e only with standard libraries in very few lines). The following is an example of loading jsonl (used in model_training/custom_datasets/oasst_dataset.py):
The main loading code for jsonl
/jsonl.gz
in python is as simple as:
if input_file_path.suffix == ".gz":
file_in = gzip.open(str(input_file_path), mode="tr", encoding="UTF-8")
else:
file_in = input_file_path.open("r", encoding="UTF-8")
with file_in:
# read one message tree per line
for line in file_in:
dict_tree = json.loads(line)
# validate data
tree: ExportMessageTree = pydantic.parse_obj_as(ExportMessageTree, dict_tree)
parquet vs. jsonl
Here are some (very subjective and not complete) cons for parquet & jsonl:
parquet:
- tabular data, makes it hard to store nested things or additional properties
- harder to process than json (requires non-standard libs)
- not specifically optimized for variable size data like text data
- has row-group binary structure which seems optimal for columns with similar values or to load subset of column, but features are not required or used in OA code (to my knowledge)
- advantage of compression feature compared to jsonl.gz (or other compressors) questionable, parquet theoretically allows partial loading with correct row-group size, but also rarely used
jsonl:
- repeats property names for each element (requires compression for efficient storage)
- string become a bit longer due to standard encoding/escaping of special characters
- seeking in file natively not possible, normally read begin to end
- allows more freedom of the structure, e.g. arbitrary complex json (can make it harder to handle)
Alternative multi-turn tabular format
(In case we cannot agree on jsonl/oa format an alternative would be to define a tabular multi-turn conversations format that is close to the current one by adding two columns like CONVERSATION_ID
and ROUND
:
- CONVERSATION_ID (string)
- ROUND (int32)
- INSTRUCTION (string): Instruction text
- RESPONSE (string): Expected response to the instruction
- SOURCE (string): Original data source short name, e.g. "wikipedia"
- METADATA (JSON string, optional): Any other useful information stored in JSON
This would be similar to other datasets like the empathetic_dialogues dataset.)
There many different dataset formats for multi-turn conversations, same examples:
- empathetic_dialogues (CSV)
- conv_ai_2 (json)
- blended_skill_talk (json)
-
daily_dialog (custom zip compressed text with markers like
__eou__
) - personachat (CSV)
If we could have a column for "URLs and references used" in each turn (and ideally a matching input field in the front-end), it would be useful for fine-tuning information retrieval later on. I think something very simple is enough. Example:
"urls": ["https://arxiv.org/abs/1706.03762"],
The rest can be done later when processing records into training data, as long as we have the URLs.
(If this is the wrong place for this suggestion, I apologize and will move/delete the comment.)
I like the jsonl format proposed. It seems sensible to have a consistent format for all our data, and json in general is more flexible if we do want to extend it later on. If we confirm this change we should make sure the docs are updated at the same time /cc @Vechtomov
Obviously jsonl is easier for storing and processing dialogs and especially multi-turn dialogs. I think we can use it for these types of datasets. /cc @christophschuhmann
I also like jsonl.
I would also add +1 to the suggested jsonl
format.
btw, if we find a consensus on the format, would there be any action item from this proposal?
I'll make a PR. But I found a little confusing behavior: when you upload a jsonl
file via Dataset("dataset.jsonl").push_to_hub(...)
it is converted into parquet. Also even if you upload the file manually or via git lfs
it will still be converted internally to parquet on your machine when you download it via load_dataset
.
it is converted into parquet. Also even if you upload the file manually or via
git lfs
it will still be converted internally to parquet on your machine when you download it viaload_dataset
OK, we probably have to look which ways are available to customize the Huggingface datasets library loading-mechanism. Research could start here: https://huggingface.co/docs/datasets/dataset_script