modin
modin copied to clipboard
Performance: .isin lookups slower than Pandas
I have a recursive function that I run on a 23 million row dataframe. The effective steps that I take are:
from distributed import Client
client = Client(n_workers=1) # or 6
import modin.pandas as pd
# load df and set index
df= pd.read_csv(r'df.csv', low_memory=False)
df= df.set_index('SOURCE_ID')
# initial parents value; it can be n-sized list of similar strings as recursion proceeds
parents = ['ZA116F0F91154DC09137D3B1E369C4E9']
# recursive step; repeat until nothing found
parents = df[df.index.isin(parents)][ 'RELATED_ID'].to_numpy()
I have only been able to use the Dask engine (windows user). With plain pandas, a single lookup runs about 10 seconds. With modin + dask and 6 workers, it takes about 50 seconds. By setting n_workers to 1, it takes about 90 seconds.
The warnings I get, if substantive, are:
UserWarning: `DataFrame.__getitem__` for empty DataFrame defaulting to pandas implementation.
To request implementation, send an email to [email protected].
UserWarning: Distributing <class 'pandas.core.frame.DataFrame'> object. This may take some time.
UserWarning: `DataFrame.to_numpy` for empty DataFrame defaulting to pandas implementation.
Is what I am doing out of scope for potential benefits?
Hi @afogarty85 thanks for posting!
In this case, it is likely the to_numpy call that is causing the slowdown. Would you be able to split that line and time the df[df.index.isin(parents)][ 'RELATED_ID'] separately from the to_numpy so we can make sure?
Thanks again for posting, it's not out of scope.
Thanks for the feedback! I split the timings, suggesting that the to_numpy() is maybe not the culprit.
t1 accounts for the main isin search operation
t2 accounts for to_numpy(), some np.concatenate and data storage, and calling the function again for recursion.
t1: 50.28409809999994
t2: 0.6318219999998291
Thanks for the update @afogarty85! We will take a deeper look.
Hi @afogarty85! I'm trying to reproduce this issue - would it be possible to share the dataset with me so I can take a closer look at what might be happening? Also do you mind sharing what version of Modin you're using?
@afogarty85 we've recently merged a PR which should improve some cases with [] lookup, would you be able to run off master and try it again? As we don't have the way to reproduce the issue ourselves, it's hard to see if if https://github.com/modin-project/modin/pull/4753 actually fixes the issue without your help. Thanks in advance!
Sorry for the delay -- there is a notable improvement in the speed and computation from what I experienced last time of roughly 50s for a single iteration. Now I iterate at roughly 5s using this setup:
from distributed import Client
client = Client(n_workers=6) # or 6
import modin.pandas as pd
import numpy as np
import time
from modin.config import Engine
Engine.put("dask") # Modin will use Dask
@afogarty85 that's good to hear. is pandas still taking about 10 seconds for each iteration? Are you satisfied with Modin's performance? Even if you are satisfied with the performance, I'm interested in how long each Modin function takes. You might be better able to profile Modin's performance if you turn on benchmark mode by either setting MODIN_BENCHMARK_MODE=TRUE or running from modin.config import BenchmarkMode; BenchmarkMode.put(True) at the very beginning of the script.
@afogarty85, did you have a chance to test latest Modin?