distilabel
distilabel copied to clipboard
Add `distilabel_meta` column to the datasets to include general data
Description
This PR adds a new field distilabel_meta
to store general outputs related to distilabel.
Currently we will have distilabel_id
with a UUID, and in case of Tasks
that failed during format_output
, the raw output generated from the LLM.
The following examples use a sample pipeline with an LLM with random errors on the format_output
method to showcase the new behaviour. We will have a distilabel_meta
field per row to store internal metadata:
- First case: add the raw output in case of error in format_output:
# Sample dataframe of the final distiset
model_name generation distilabel_meta
0 test None {'raw_output_dummy': 'output'}
1 test llm response 0.0973 None
2 test llm response 0.341 None
3 test llm response 0.1632 None
4 test None {'raw_output_dummy': 'output'}
5 test llm response 0.0733 None
6 test llm response 0.4824 None
7 test None {'raw_output_dummy': 'output'}
8 test llm response 0.4292 None
9 test llm response 0.3858 None
# Example of the first item from the dataframe
{
"model_name": "test",
"generation": null,
"distilabel_meta": {
"raw_output_dummy": "output"
}
}
- Second case: Also include by default a
distilabel_id
in the field: This could be optional, for the moment is always inserted.
# Sample dataframe of the final distiset
model_name generation distilabel_meta
0 test llm response 0.0517 {'distilabel_id': '2719519d-42d9-4ee9-aa87-6cf...
1 test llm response 0.08 {'distilabel_id': 'e102a87d-7369-4518-be60-886...
2 test llm response 0.1204 {'distilabel_id': '8a4ead69-6a3c-4310-afec-38f...
3 test llm response 0.3122 {'distilabel_id': '1c75b4ad-6098-4845-a1d9-fb5...
4 test None {'distilabel_id': 'aec3d247-f109-4201-9e8e-c54...
5 test llm response 0.07 {'distilabel_id': '77970466-2dee-4649-a71f-54e...
6 test llm response 0.4406 {'distilabel_id': 'f2a4ee2f-5274-4ab4-a7e5-b21...
7 test llm response 0.1297 {'distilabel_id': 'eb27548a-e41b-4ba1-9897-c60...
8 test llm response 0.4229 {'distilabel_id': 'bc2c2fe4-9422-4016-8251-a28...
9 test None {'distilabel_id': '140e50c8-c1ea-4663-9189-c08...
# Example of the first item from the dataframe
{
"model_name": "test",
"generation": "llm response 0.0517",
"distilabel_meta": {
"distilabel_id": "2719519d-42d9-4ee9-aa87-6cfe1895082e",
"raw_output_dummy": null
}
}
- Second case: Same case as before, but 2 LLMs and combining the result:
The thing now is what to do with the
distilabel_id
? I assume we can keep just one of them.
# Sample dataframe of the final distiset
merged_model_name merged_generation distilabel_meta
0 [test, test] [llm response 0.246, llm response 0.0886] {'distilabel_id': 'ff931663-4848-4667-b79e-d40...
1 [test, test] [llm response 0.4494, llm response 0.2173] {'distilabel_id': 'a34d0ef5-6c0f-4e5b-9873-d0f...
2 [test, test] [llm response 0.305, llm response 0.4888] {'distilabel_id': 'd31f3393-4f5a-4221-b834-1cb...
3 [test, test] [llm response 0.1056, llm response 0.4021] {'distilabel_id': '5767c544-6802-42a1-adb7-7d0...
4 [test, test] [llm response 0.1908, llm response 0.4613] {'distilabel_id': '4b9ab8df-41c5-410e-b392-faa...
5 [test, test] [None, None] {'distilabel_id': 'cc594466-e0ff-4d66-aed5-e7b...
6 [test, test] [llm response 0.1095, None] {'distilabel_id': 'f824a5fd-fd38-4ffc-a4f5-c76...
7 [test, test] [llm response 0.4316, llm response 0.4952] {'distilabel_id': '0220feec-ae54-4980-8f55-c91...
8 [test, test] [None, llm response 0.2785] {'distilabel_id': '1d0f002c-c217-4d7b-a7d7-149...
9 [test, test] [llm response 0.262, None] {'distilabel_id': '5d5cd42a-72da-4a1d-9dcb-79a...
# Example of the first item from the dataframe
{
"merged_model_name": [
"test",
"test"
],
"merged_generation": [
"llm response 0.246",
"llm response 0.0886"
],
"distilabel_meta": {
"distilabel_id": "ff931663-4848-4667-b79e-d4038ca45cd7",
"raw_output_dummy": null,
"raw_output_dummy_2": null
}
}
Closes #582