ludwig
ludwig copied to clipboard
Ludwig does not gracefully handle empty partitions during saving
If, after dataset splitting and preprocessing, there are empty DataFrame partitions (when training using a Ray/Dask backend), Ray throws the following error.
E ray.exceptions.RayTaskError(AssertionError): ray::_get_read_tasks() (pid=10328, ip=127.0.0.1)
E File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/read_api.py", line 1136, in _get_read_tasks
E reader = ds.create_reader(**kwargs)
E File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 167, in create_reader
E return _ParquetDatasourceReader(**kwargs)
E File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 230, in __init__
E self._encoding_ratio = self._estimate_files_encoding_ratio()
E File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 318, in _estimate_files_encoding_ratio
E sample_ratios = ray.get(futures)
E ray.exceptions.RayTaskError(AssertionError): ray::_sample_piece() (pid=10352, ip=127.0.0.1)
E File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 437, in _sample_piece
E assert num_rows > 0 and metadata.num_rows > 0, (
E AssertionError: Sampled number of rows: 0 and total number of rows: 0 should be positive
To Reproduce
Run the following unit test with num_examples=20 and npartitions=10.
pytest -xsrP tests/integration_tests/test_preprocessing.py::test_dask_known_divisions
- OS: macOS
- Version: 12.3.1
- Python version: 3.9
- Ludwig version: 0.6.dev0
- Ray version: nightly (July 28th, 2022)
Adding a more permanent solution in this PR: #2328