distilabel Add `distilabel_meta` column to the datasets to include general data

Add `distilabel_meta` column to the datasets to include general data

Open plaguss opened this issue 10 months ago • 0 comments

Description

This PR adds a new field distilabel_meta to store general outputs related to distilabel. Currently we will have distilabel_id with a UUID, and in case of Tasks that failed during format_output, the raw output generated from the LLM.

The following examples use a sample pipeline with an LLM with random errors on the format_output method to showcase the new behaviour. We will have a distilabel_meta field per row to store internal metadata:

First case: add the raw output in case of error in format_output:

# Sample dataframe of the final distiset

   model_name           generation                 distilabel_meta
0        test                 None  {'raw_output_dummy': 'output'}
1        test  llm response 0.0973                            None
2        test   llm response 0.341                            None
3        test  llm response 0.1632                            None
4        test                 None  {'raw_output_dummy': 'output'}
5        test  llm response 0.0733                            None
6        test  llm response 0.4824                            None
7        test                 None  {'raw_output_dummy': 'output'}
8        test  llm response 0.4292                            None
9        test  llm response 0.3858                            None

# Example of the first item from the dataframe
{
  "model_name": "test",
  "generation": null,
  "distilabel_meta": {
    "raw_output_dummy": "output"
  }
}

Second case: Also include by default a distilabel_id in the field: This could be optional, for the moment is always inserted.

# Sample dataframe of the final distiset

  model_name           generation                                    distilabel_meta
0       test  llm response 0.0517  {'distilabel_id': '2719519d-42d9-4ee9-aa87-6cf...
1       test    llm response 0.08  {'distilabel_id': 'e102a87d-7369-4518-be60-886...
2       test  llm response 0.1204  {'distilabel_id': '8a4ead69-6a3c-4310-afec-38f...
3       test  llm response 0.3122  {'distilabel_id': '1c75b4ad-6098-4845-a1d9-fb5...
4       test                 None  {'distilabel_id': 'aec3d247-f109-4201-9e8e-c54...
5       test    llm response 0.07  {'distilabel_id': '77970466-2dee-4649-a71f-54e...
6       test  llm response 0.4406  {'distilabel_id': 'f2a4ee2f-5274-4ab4-a7e5-b21...
7       test  llm response 0.1297  {'distilabel_id': 'eb27548a-e41b-4ba1-9897-c60...
8       test  llm response 0.4229  {'distilabel_id': 'bc2c2fe4-9422-4016-8251-a28...
9       test                 None  {'distilabel_id': '140e50c8-c1ea-4663-9189-c08...

# Example of the first item from the dataframe
{
  "model_name": "test",
  "generation": "llm response 0.0517",
  "distilabel_meta": {
    "distilabel_id": "2719519d-42d9-4ee9-aa87-6cfe1895082e",
    "raw_output_dummy": null
  }
}

Second case: Same case as before, but 2 LLMs and combining the result: The thing now is what to do with the distilabel_id? I assume we can keep just one of them.

# Sample dataframe of the final distiset
  merged_model_name                           merged_generation                                    distilabel_meta
0      [test, test]   [llm response 0.246, llm response 0.0886]  {'distilabel_id': 'ff931663-4848-4667-b79e-d40...
1      [test, test]  [llm response 0.4494, llm response 0.2173]  {'distilabel_id': 'a34d0ef5-6c0f-4e5b-9873-d0f...
2      [test, test]   [llm response 0.305, llm response 0.4888]  {'distilabel_id': 'd31f3393-4f5a-4221-b834-1cb...
3      [test, test]  [llm response 0.1056, llm response 0.4021]  {'distilabel_id': '5767c544-6802-42a1-adb7-7d0...
4      [test, test]  [llm response 0.1908, llm response 0.4613]  {'distilabel_id': '4b9ab8df-41c5-410e-b392-faa...
5      [test, test]                                [None, None]  {'distilabel_id': 'cc594466-e0ff-4d66-aed5-e7b...
6      [test, test]                 [llm response 0.1095, None]  {'distilabel_id': 'f824a5fd-fd38-4ffc-a4f5-c76...
7      [test, test]  [llm response 0.4316, llm response 0.4952]  {'distilabel_id': '0220feec-ae54-4980-8f55-c91...
8      [test, test]                 [None, llm response 0.2785]  {'distilabel_id': '1d0f002c-c217-4d7b-a7d7-149...
9      [test, test]                  [llm response 0.262, None]  {'distilabel_id': '5d5cd42a-72da-4a1d-9dcb-79a...

# Example of the first item from the dataframe
{
  "merged_model_name": [
    "test",
    "test"
  ],
  "merged_generation": [
    "llm response 0.246",
    "llm response 0.0886"
  ],
  "distilabel_meta": {
    "distilabel_id": "ff931663-4848-4667-b79e-d4038ca45cd7",
    "raw_output_dummy": null,
    "raw_output_dummy_2": null
  }
}

Closes #582

Apr 25 '24 15:04 plaguss

distilabel distilabel copied to clipboard

Add `distilabel_meta` column to the datasets to include general data

Description

distilabel
distilabel copied to clipboard