modin icon indicating copy to clipboard operation
modin copied to clipboard

PERF: avoid to_pandas in join

Open jbrockmendel opened this issue 2 years ago • 1 comments

I've been looking into what it will take to avoid the right = right.to_pandas() in QueryCompiler.join. My thought is to re-use as much of the logic in pandas.core.reshape.merge as we can, adapting it where necessary to take modin.pandas.DataFrame where relevant.

This would require adding to methods to DataFrame: _check_label_or_level_ambiguity, _is_label_or_level_reference, _is_level_reference, _is_label_reference, _get_label_or_level_values, _drop_labels_or_levels. Some of these are easy to port, but _get_label_or_level_values a) uses xs, which modin current defaults_to_pandas, and b) uses Series._values, which modin doesn't currently have. To get the latter we'd need to implement some kind of distributed ExtensionArray. I can do this, but it's a pretty big task. Having such a thing would make a lot of pandas-reuse feasible.

Then we'd need to adapt some of the lower-level libjoin and hashtable code (particularly in _factorize_keys). The latter would probably also get us an efficient implementation of Series.factorize.

Finally we'd need an efficient take to do something equivalent to concatenate_block_managers.

jbrockmendel avatar Aug 16 '22 21:08 jbrockmendel

On further reading, i'm wondering if we can cheaply convert to a dask dataframe and just use their join implementation

jbrockmendel avatar Aug 18 '22 17:08 jbrockmendel

Closed in https://github.com/modin-project/modin/pull/6850.

YarShev avatar Jan 11 '24 20:01 YarShev