modin
modin copied to clipboard
Use Ray design patterns and avoid anti-patterns
Having read design patterns and anti-patterns in Ray docs I think that we could use/avoid some of them in our code to gain benefit/performance. Here are some examples.
- Using generators to reduce heap memory usage in remote functions.
- Anti-pattern: Calling ray.get in a loop harms parallelism. We have some places where we call ray.get() (in particular, .width() that calls ray.get() under the hood) in a loop. One of the examples is here.
- Anti-pattern: Passing the same large argument by value repeatedly harms performance. We use this anti-pattern in query_compiler.merge and query_compiler.join operations, for instance.
- Anti-pattern: Closure capturing large objects harms performance. We use this anti-pattern in ~query_compiler.corr~ and query_compiler.groupby_agg operations, for instance.
Individual issues/tasks can be created from this list in order to be investigated and resolved separately/more granularly.
I encourage everyone from @modin-project/modin-core to read Ray design patterns and anti-patterns so we could probably see more gaps in our code.
Aside: resolving the concrete issues we should be careful regarding performance of every engine because that may speed up one engine, but slow down other.
@YarShev I was looking into Anti-pattern: Passing the same large argument by value repeatedly harms performance. In query_compiler.merge and query_compiler.join operations.
The large values used for eg in right_pandas = right.to_pandas() are passed as default arguments to the map_func. As map_func function is put() to ray object storage the arguments are also serialized and put to the object storage which I confirmed experimenting on a minimal reproducer.
Is it still required to put these arguments again to object storage and access them from the ray remote function?
We can use broadcast_apply_full_axis and pass the right dataframe in to it. Further in the flow we just pass object references of the right dataframe in to ray remote functions. Ray materializes the object references to pandas dataframes without copy as pandas dataframe supports out of band serialization. Also, we are not blocked in the main process by calling to_pandas.