distilabel
distilabel copied to clipboard
[BUG] ValueError raised in write_buffer.py when pyarrow.Table.cast is called
When running my pipeline is seem to be getting this error:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/base.py", line 734, in _output_queue_loop
self._process_batch(batch)
File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/base.py", line 794, in _process_batch
self._write_buffer.add_batch(batch) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/write_buffer.py", line 102, in add_batch
self._write(step_name)
File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/write_buffer.py", line 135, in _write
table = table.cast(new_schema)
File "pyarrow/table.pxi", line 4547, in pyarrow.lib.Table.cast
ValueError: Target schema's field names are not matching the table's field names: ['listing_id', 'listing_text', 'profiles', 'instruction', 'generation', 'model_name', 'cv_sections'], ['cv_sections', 'profiles', 'listing_id', 'listing_text', 'instruction', 'generation', 'model_name']
It looks like the schema of the table and new_schema have to be in the exact same order, or else this error is raised. There is even a github issue amongst the pyarrow maintainers discussing whether they should relax this constraint (https://github.com/apache/arrow/issues/27425).
There needs to be some logic to rearrange the new_schema to be the same order as the table schema to avoid this I think.