argilla
argilla copied to clipboard
[BUG-SDK] ChatField cannot be serialized by datasets
Describe the bug
When trying to export a dataset with a chat field using the to_datasets
method, I get a serialization error:
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:188, in TypedSequence.__arrow_array__(self, type)
[187](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:187) trying_cast_to_python_objects = True
--> [188](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:188) out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
[189](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:189) # use smaller integer precisions if possible
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:368, in pyarrow.lib.array()
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:42, in pyarrow.lib._sequence_to_array()
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()
ArrowInvalid: Could not convert ChatFieldValue(role='user', content='Who invented python?') with type ChatFieldValue: did not recognize Python value type when inferring an Arrow data type
During handling of the above exception, another exception occurred:
ArrowInvalid Traceback (most recent call last)
Cell In[26], [line 1](vscode-notebook-cell:?execution_count=26&line=1)
----> [1](vscode-notebook-cell:?execution_count=26&line=1) dataset.records(with_responses=True).to_datasets()
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_dataset_records.py:131, in DatasetRecordsIterator.to_datasets(self)
[130](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_dataset_records.py:130) def to_datasets(self) -> "HFDataset":
--> [131](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_dataset_records.py:131) return HFDatasetsIO.to_datasets(records=list(self), dataset=self.__dataset)
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:185, in HFDatasetsIO.to_datasets(records, dataset)
[178](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:178) """
[179](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:179) Export the records to a Hugging Face dataset.
[180](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:180)
[181](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:181) Returns:
[182](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:182) The dataset containing the records.
[183](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:183) """
[184](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:184) record_dicts = GenericIO.to_dict(records, flatten=True)
--> [185](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:185) hf_dataset = HFDataset.from_dict(record_dicts)
[186](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:186) hf_dataset = HFDatasetsIO._uncast_argilla_attributes_to_datasets(hf_dataset, dataset.schema)
[187](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/argilla/records/_io/_datasets.py:187) return hf_dataset
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:931, in Dataset.from_dict(cls, mapping, features, info, split)
[929](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:929) arrow_typed_mapping[col] = data
[930](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:930) mapping = arrow_typed_mapping
--> [931](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:931) pa_table = InMemoryTable.from_pydict(mapping=mapping)
[932](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:932) if info is None:
[933](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:933) info = DatasetInfo()
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:757, in InMemoryTable.from_pydict(cls, *args, **kwargs)
[741](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:741) @classmethod
[742](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:742) def from_pydict(cls, *args, **kwargs):
[743](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:743) """
[744](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:744) Construct a Table from Arrow arrays or columns.
[745](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:745)
(...)
[755](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:755) `datasets.table.Table`
[756](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:756) """
--> [757](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/table.py:757) return cls(pa.Table.from_pydict(*args, **kwargs))
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/table.pxi:1920, in pyarrow.lib._Tabular.from_pydict()
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/table.pxi:6136, in pyarrow.lib._from_pydict()
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:398, in pyarrow.lib.asarray()
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:248, in pyarrow.lib.array()
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:112, in pyarrow.lib._handle_arrow_array_protocol()
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:256, in TypedSequence.__arrow_array__(self, type)
[254](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:254) return out
[255](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:255) elif trying_cast_to_python_objects and "Could not convert" in str(e):
--> [256](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:256) out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True, optimize_list_casting=False))
[257](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:257) if type is not None:
[258](https://file+.vscode-resource.vscode-cdn.net/Users/ben/code/argilla-llama-index/docs/tutorials/~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/datasets/arrow_writer.py:258) out = cast_array_to_feature(out, type, allow_primitive_to_str=True, allow_decimal_to_str=True)
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:368, in pyarrow.lib.array()
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/array.pxi:42, in pyarrow.lib._sequence_to_array()
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()
File ~/code/argilla-llama-index/.venv/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()
ArrowInvalid: Could not convert ChatFieldValue(role='user', content='Who invented python?') with type ChatFieldValue: did not recognize Python value type when inferring an Arrow data type
To Reproduce
dataset.records(with_responses=True).to_datasets()
Expected behavior
Chat fields should be converted into a dict.
Screenshots
Environment (please complete the following information):
- OS [e.g. iOS]:
- Browser [e.g. chrome, safari]:
- Argilla Version [e.g. 1.0.0]:
- ElasticSearch Version [e.g. 7.10.2]:
- Docker Image (optional) [e.g. argilla:v1.0.0]:
Additional context