ludwig icon indicating copy to clipboard operation
ludwig copied to clipboard

Unpinning Dask causes issue when saving image features to parquet

Open geoffreyangus opened this issue 3 years ago • 0 comments

Describe the bug When I unpin Dask (at time of writing installing dask==2022.7.1), we get the following error when running `pytest tests/integration_tests/test_ray.py::test_ray_image

E                       ray.exceptions.RayTaskError(ValueError): ray::dask:('store-to-parquet-3342a5973f9b624adbc7ab7b6c59c3c3', 0) (pid=93553, ip=127.0.0.1)
E                         At least one of the input arguments for this task could not be computed:
E                       ray.exceptions.RayTaskError: ray::dask:('to-parquet-3342a5973f9b624adbc7ab7b6c59c3c3', 25) (pid=93553, ip=127.0.0.1)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/util/dask/scheduler.py", line 432, in dask_task_wrapper
E                           result = func(*actual_args)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/dask/optimization.py", line 990, in __call__
E                           return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/dask/core.py", line 149, in get
E                           result = _execute_task(task, cache)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/dask/core.py", line 119, in _execute_task
E                           return func(*(_execute_task(a, cache) for a in args))
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py", line 163, in __call__
E                           return self.engine.write_partition(
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py", line 686, in write_partition
E                           t = cls._pandas_to_arrow_table(df, preserve_index=preserve_index, schema=schema)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py", line 647, in _pandas_to_arrow_table
E                           raise ValueError(
E                       ValueError: Failed to convert partition to expected pyarrow schema:
E                           `ArrowTypeError("Expected bytes, got a 'numpy.ndarray' object", 'Conversion failed for column image_AA50B_vXMs1R with type object')`
E                       
E                       Expected partition schema:
E                           binary_B525D_mZFLky: bool
E                           image_AA50B_vXMs1R: string
E                       
E                       Received partition schema:
E                           binary_B525D_mZFLky: bool
E                           image_AA50B_vXMs1R: list<item: uint8>
E                             child 0, item: uint8
E                       
E                       This error *may* be resolved by passing in schema information for
E                       the mismatched column(s) using the `schema` keyword in `to_parquet`.

To Reproduce Steps to reproduce the behavior:

  1. upgrade dask
  2. run pytest tests/integration_tests/test_ray.py::test_ray_image

Expected behavior I expected to be able to save out parquet files with numpy array objects.

Environment (please complete the following information):

  • OS: MacOS 12.3.1
  • Python version: 3.9
  • Ludwig version: 0.6.dev0

Additional context It seems likely that this has to do with us reading images in as numpy arrays and storing them in the Dask DataFrame. We will likely need to figure out how to pass a PyArrow schema into the dask.DataFrame.to_parquet file that tells it to save a numpy array instead of a raw bytes object.

geoffreyangus avatar Jul 29 '22 20:07 geoffreyangus