ludwig icon indicating copy to clipboard operation
ludwig copied to clipboard

Ludwig does not gracefully handle empty partitions during saving

Open geoffreyangus opened this issue 3 years ago • 1 comments

If, after dataset splitting and preprocessing, there are empty DataFrame partitions (when training using a Ray/Dask backend), Ray throws the following error.

E                       ray.exceptions.RayTaskError(AssertionError): ray::_get_read_tasks() (pid=10328, ip=127.0.0.1)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/read_api.py", line 1136, in _get_read_tasks
E                           reader = ds.create_reader(**kwargs)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 167, in create_reader
E                           return _ParquetDatasourceReader(**kwargs)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 230, in __init__
E                           self._encoding_ratio = self._estimate_files_encoding_ratio()
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 318, in _estimate_files_encoding_ratio
E                           sample_ratios = ray.get(futures)
E                       ray.exceptions.RayTaskError(AssertionError): ray::_sample_piece() (pid=10352, ip=127.0.0.1)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 437, in _sample_piece
E                           assert num_rows > 0 and metadata.num_rows > 0, (
E                       AssertionError: Sampled number of rows: 0 and total number of rows: 0 should be positive

To Reproduce

Run the following unit test with num_examples=20 and npartitions=10.

pytest -xsrP tests/integration_tests/test_preprocessing.py::test_dask_known_divisions
  • OS: macOS
  • Version: 12.3.1
  • Python version: 3.9
  • Ludwig version: 0.6.dev0
  • Ray version: nightly (July 28th, 2022)

geoffreyangus avatar Jul 28 '22 16:07 geoffreyangus

Adding a more permanent solution in this PR: #2328

geoffreyangus avatar Jul 28 '22 21:07 geoffreyangus