argilla icon indicating copy to clipboard operation
argilla copied to clipboard

[BUG-SDK] ChatField cannot be serialized by datasets

Open burtenshaw opened this issue 4 months ago • 0 comments

Describe the bug

When trying to export a dataset with a chat field using the to_datasets method, I get a serialization error:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:188, in TypedSequence.__arrow_array__(self, type)
    [187](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:187)     trying_cast_to_python_objects = True
--> [188](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:188)     out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
    [189](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:189) # use smaller integer precisions if possible

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:368, in pyarrow.lib.array()

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:42, in pyarrow.lib._sequence_to_array()

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: Could not convert ChatFieldValue(role='user', content='Who invented python?') with type ChatFieldValue: did not recognize Python value type when inferring an Arrow data type

During handling of the above exception, another exception occurred:

ArrowInvalid                              Traceback (most recent call last)
Cell In[26], [line 1](vscode-notebook-cell:?execution_count=26&line=1)
----> [1](vscode-notebook-cell:?execution_count=26&line=1) dataset.records(with_responses=True).to_datasets()

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_dataset_records.py:131, in DatasetRecordsIterator.to_datasets(self)
    [130](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_dataset_records.py:130) def to_datasets(self) -> "HFDataset":
--> [131](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_dataset_records.py:131)     return HFDatasetsIO.to_datasets(records=list(self), dataset=self.__dataset)

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:185, in HFDatasetsIO.to_datasets(records, dataset)
    [178](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:178) """
    [179](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:179) Export the records to a Hugging Face dataset.
    [180](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:180) 
    [181](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:181) Returns:
    [182](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:182)     The dataset containing the records.
    [183](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:183) """
    [184](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:184) record_dicts = GenericIO.to_dict(records, flatten=True)
--> [185](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:185) hf_dataset = HFDataset.from_dict(record_dicts)
    [186](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:186) hf_dataset = HFDatasetsIO._uncast_argilla_attributes_to_datasets(hf_dataset, dataset.schema)
    [187](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:187) return hf_dataset

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:931, in Dataset.from_dict(cls, mapping, features, info, split)
    [929](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:929)     arrow_typed_mapping[col] = data
    [930](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:930) mapping = arrow_typed_mapping
--> [931](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:931) pa_table = InMemoryTable.from_pydict(mapping=mapping)
    [932](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:932) if info is None:
    [933](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:933)     info = DatasetInfo()

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:757, in InMemoryTable.from_pydict(cls, *args, **kwargs)
    [741](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:741) @classmethod
    [742](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:742) def from_pydict(cls, *args, **kwargs):
    [743](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:743)     """
    [744](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:744)     Construct a Table from Arrow arrays or columns.
    [745](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:745) 
   (...)
    [755](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:755)         `datasets.table.Table`
    [756](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:756)     """
--> [757](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:757)     return cls(pa.Table.from_pydict(*args, **kwargs))

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/table.pxi:1920, in pyarrow.lib._Tabular.from_pydict()

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/table.pxi:6136, in pyarrow.lib._from_pydict()

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:398, in pyarrow.lib.asarray()

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:248, in pyarrow.lib.array()

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:112, in pyarrow.lib._handle_arrow_array_protocol()

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:256, in TypedSequence.__arrow_array__(self, type)
    [254](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:254)     return out
    [255](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:255) elif trying_cast_to_python_objects and "Could not convert" in str(e):
--> [256](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:256)     out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True, optimize_list_casting=False))
    [257](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:257)     if type is not None:
    [258](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:258)         out = cast_array_to_feature(out, type, allow_primitive_to_str=True, allow_decimal_to_str=True)

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:368, in pyarrow.lib.array()

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:42, in pyarrow.lib._sequence_to_array()

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: Could not convert ChatFieldValue(role='user', content='Who invented python?') with type ChatFieldValue: did not recognize Python value type when inferring an Arrow data type

To Reproduce

dataset.records(with_responses=True).to_datasets()

Expected behavior

Chat fields should be converted into a dict.

Screenshots

Environment (please complete the following information):

  • OS [e.g. iOS]:
  • Browser [e.g. chrome, safari]:
  • Argilla Version [e.g. 1.0.0]:
  • ElasticSearch Version [e.g. 7.10.2]:
  • Docker Image (optional) [e.g. argilla:v1.0.0]:

Additional context

burtenshaw avatar Sep 30 '24 17:09 burtenshaw