modin icon indicating copy to clipboard operation
modin copied to clipboard

Use Ray design patterns and avoid anti-patterns

Open YarShev opened this issue 2 years ago • 2 comments

Having read design patterns and anti-patterns in Ray docs I think that we could use/avoid some of them in our code to gain benefit/performance. Here are some examples.

Individual issues/tasks can be created from this list in order to be investigated and resolved separately/more granularly.

I encourage everyone from @modin-project/modin-core to read Ray design patterns and anti-patterns so we could probably see more gaps in our code.

Aside: resolving the concrete issues we should be careful regarding performance of every engine because that may speed up one engine, but slow down other.

YarShev avatar Jan 05 '23 13:01 YarShev

@YarShev I was looking into Anti-pattern: Passing the same large argument by value repeatedly harms performance. In query_compiler.merge and query_compiler.join operations.

The large values used for eg in right_pandas = right.to_pandas() are passed as default arguments to the map_func. As map_func function is put() to ray object storage the arguments are also serialized and put to the object storage which I confirmed experimenting on a minimal reproducer.

Is it still required to put these arguments again to object storage and access them from the ray remote function?

arunjose696 avatar Jan 04 '24 17:01 arunjose696

We can use broadcast_apply_full_axis and pass the right dataframe in to it. Further in the flow we just pass object references of the right dataframe in to ray remote functions. Ray materializes the object references to pandas dataframes without copy as pandas dataframe supports out of band serialization. Also, we are not blocked in the main process by calling to_pandas.

YarShev avatar Jan 08 '24 13:01 YarShev