mito Performance issues on large datasets

Performance issues on large datasets

Open aarondr77 opened this issue 2 years ago • 4 comments

Generate a very large (but realistic sized for a user I just worked with) dataframe using this code

import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(0,10000,size=(10000000, 200)))
df2 = pd.DataFrame(np.random.randint(0,10000,size=(10000000, 200)))

Create a mitosheet and import these dataframes through the import dataframes taskpane
Merge them together: change the lookup type, the merge keys, and the columns selected.
The merge has been loading for about 20 minutes now and I still have not created a new dataframe.
And see the loading indicator -- once I do finish creating the first dataframe, its going to try to create it again a ton of times over:

Some ideations:

Benchmark where the merge is taking so long and see if there is anything we can do to speed it up.
When the datasets are so large, move live updating task panes to useSendEditOnClick so that they don't try to create the dataframe so many times
Add a warning message in the merge taskpane that this is going to take a very long time
When the user imports such a large dataset, give them a message that says this large of a dataset might cause Mito to perform slowly. This is a result of the data being too large for pandas. While creating this script, we recommend using the first 100k rows of the data. Then, once your script is finalized, remove the .head(100000) and run the script on your full dataset.
If the user imports a large datasets, have a popup message that says "do you want to take a random sample of the data to use while building this analysis?"

Feb 08 '23 16:02 aarondr77

mito mito copied to clipboard

Performance issues on large datasets

mito
mito copied to clipboard