distilabel icon indicating copy to clipboard operation
distilabel copied to clipboard

[BUG] ValueError raised in write_buffer.py when pyarrow.Table.cast is called

Open afolabiaji opened this issue 5 months ago • 1 comments

When running my pipeline is seem to be getting this error:

Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/base.py", line 734, in _output_queue_loop
    self._process_batch(batch)
  File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/base.py", line 794, in _process_batch
    self._write_buffer.add_batch(batch)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/write_buffer.py", line 102, in add_batch
    self._write(step_name)
  File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/write_buffer.py", line 135, in _write
    table = table.cast(new_schema)
  File "pyarrow/table.pxi", line 4547, in pyarrow.lib.Table.cast

ValueError: Target schema's field names are not matching the table's field names: ['listing_id', 'listing_text', 'profiles', 'instruction', 'generation', 'model_name', 'cv_sections'], ['cv_sections', 'profiles', 'listing_id', 'listing_text', 'instruction', 'generation', 'model_name']

It looks like the schema of the table and new_schema have to be in the exact same order, or else this error is raised. There is even a github issue amongst the pyarrow maintainers discussing whether they should relax this constraint (https://github.com/apache/arrow/issues/27425).

There needs to be some logic to rearrange the new_schema to be the same order as the table schema to avoid this I think.

afolabiaji avatar Aug 29 '24 14:08 afolabiaji