modin
modin copied to clipboard
PERF: avoid to_pandas in join
I've been looking into what it will take to avoid the right = right.to_pandas()
in QueryCompiler.join
. My thought is to re-use as much of the logic in pandas.core.reshape.merge
as we can, adapting it where necessary to take modin.pandas.DataFrame where relevant.
This would require adding to methods to DataFrame: _check_label_or_level_ambiguity
, _is_label_or_level_reference
, _is_level_reference
, _is_label_reference
, _get_label_or_level_values
, _drop_labels_or_levels
. Some of these are easy to port, but _get_label_or_level_values
a) uses xs
, which modin current defaults_to_pandas, and b) uses Series._values
, which modin doesn't currently have. To get the latter we'd need to implement some kind of distributed ExtensionArray
. I can do this, but it's a pretty big task. Having such a thing would make a lot of pandas-reuse feasible.
Then we'd need to adapt some of the lower-level libjoin and hashtable code (particularly in _factorize_keys). The latter would probably also get us an efficient implementation of Series.factorize
.
Finally we'd need an efficient take
to do something equivalent to concatenate_block_managers
.
On further reading, i'm wondering if we can cheaply convert to a dask dataframe and just use their join implementation
Closed in https://github.com/modin-project/modin/pull/6850.