mito icon indicating copy to clipboard operation
mito copied to clipboard

Performance issues on large datasets

Open aarondr77 opened this issue 2 years ago • 4 comments

  1. Generate a very large (but realistic sized for a user I just worked with) dataframe using this code
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(0,10000,size=(10000000, 200)))
df2 = pd.DataFrame(np.random.randint(0,10000,size=(10000000, 200)))
  1. Create a mitosheet and import these dataframes through the import dataframes taskpane
  2. Merge them together: change the lookup type, the merge keys, and the columns selected.
  3. The merge has been loading for about 20 minutes now and I still have not created a new dataframe.
  4. And see the loading indicator -- once I do finish creating the first dataframe, its going to try to create it again a ton of times over: Screenshot 2023-02-08 at 11 16 17 AM

Some ideations:

  1. Benchmark where the merge is taking so long and see if there is anything we can do to speed it up.
  2. When the datasets are so large, move live updating task panes to useSendEditOnClick so that they don't try to create the dataframe so many times
  3. Add a warning message in the merge taskpane that this is going to take a very long time
  4. When the user imports such a large dataset, give them a message that says this large of a dataset might cause Mito to perform slowly. This is a result of the data being too large for pandas. While creating this script, we recommend using the first 100k rows of the data. Then, once your script is finalized, remove the .head(100000) and run the script on your full dataset.
  5. If the user imports a large datasets, have a popup message that says "do you want to take a random sample of the data to use while building this analysis?"

aarondr77 avatar Feb 08 '23 16:02 aarondr77