ludwig
                                
                                
                                
                                    ludwig copied to clipboard
                            
                            
                            
                        Custom to_dask() implementation to include meta during Dask DataFrame creation
Ray's to_dask() function is called when predictions are being generated, but the Ray implementation doesn't pass in any meta related information into dd.from_delayed. I believe that in some instances (small dataset with a large number of partitions, empty partitions, etc.), there isn't enough data in the first Dask partition for Dask to infer the meta correctly (or at all).
This PR updates from_ray_dataset within our DaskEngine to use a custom implementation that creates meta information from the first 100 rows of the ray dataset and uses it during the creation of the Dask DataFrame from the Ray Dataset. Dask won't need to infer the meta information anymore for the dataframe, guaranteeing fewer problems in downstream transformations. This gives us more downstream control.
The hack has been added because in Ray 1.12 and 1.13, the resulting column from read_binary_files was cast into type object for being passed into Dask, but in Ray nightly, they've started casting the column to type TensorDtype (thanks for the catch @geoffreyangus). This change is also needed for https://github.com/ludwig-ai/ludwig/pull/2241 to pass the tests for Ray nightly.
Unit Test Results
5 files ± 0 5 suites ±0 1h 55m 6s :stopwatch: +11s 2 945 tests + 4 2 896 :heavy_check_mark: + 4 49 :zzz: ±0 0 :x: ±0 8 710 runs +12 8 548 :heavy_check_mark: +12 162 :zzz: ±0 0 :x: ±0
Results for commit 8355fe2a. ± Comparison against base commit a2bce0ed.
:recycle: This comment has been updated with latest results.
Making this a draft PR for now since we haven't run into this issue yet.
Closing this PR in favor of https://github.com/ludwig-ai/ludwig/pull/2728