ludwig Custom to_dask() implementation to include meta during Dask DataFrame creation

Custom to_dask() implementation to include meta during Dask DataFrame creation

Open arnavgarg1 opened this issue 3 years ago • 2 comments

Ray's to_dask() function is called when predictions are being generated, but the Ray implementation doesn't pass in any meta related information into dd.from_delayed. I believe that in some instances (small dataset with a large number of partitions, empty partitions, etc.), there isn't enough data in the first Dask partition for Dask to infer the meta correctly (or at all).

This PR updates from_ray_dataset within our DaskEngine to use a custom implementation that creates meta information from the first 100 rows of the ray dataset and uses it during the creation of the Dask DataFrame from the Ray Dataset. Dask won't need to infer the meta information anymore for the dataframe, guaranteeing fewer problems in downstream transformations. This gives us more downstream control.

The hack has been added because in Ray 1.12 and 1.13, the resulting column from read_binary_files was cast into type object for being passed into Dask, but in Ray nightly, they've started casting the column to type TensorDtype (thanks for the catch @geoffreyangus). This change is also needed for https://github.com/ludwig-ai/ludwig/pull/2241 to pass the tests for Ray nightly.

Jul 27 '22 08:07 arnavgarg1

Unit Test Results

      5 files ±  0       5 suites ±0 1h 55m 6s :stopwatch: +11s 2 945 tests +  4 2 896 :heavy_check_mark: +  4   49 :zzz: ±0 0 :x: ±0 8 710 runs +12 8 548 :heavy_check_mark: +12 162 :zzz: ±0 0 :x: ±0

Results for commit 8355fe2a. ± Comparison against base commit a2bce0ed.

:recycle: This comment has been updated with latest results.

Jul 27 '22 08:07 github-actions[bot]

Making this a draft PR for now since we haven't run into this issue yet.

Aug 31 '22 17:08 arnavgarg1

Closing this PR in favor of https://github.com/ludwig-ai/ludwig/pull/2728

Nov 07 '22 10:11 arnavgarg1

ludwig ludwig copied to clipboard

Custom to_dask() implementation to include meta during Dask DataFrame creation

Unit Test Results

ludwig
ludwig copied to clipboard