pycytominer icon indicating copy to clipboard operation
pycytominer copied to clipboard

[Enhancement] Improve performance/scaling by enabling an optional install of modin.

Open kenibrewer opened this issue 1 year ago • 2 comments

Certain pycytominer functions run into scaling issues for extremely large datasets. This can be solved by enabling an optional install of modin which is a drop-in replacement for pandas that leverages either ray, dask, or unidist backend engines to automatically shard data and parallelize operations across many cpus or instances.

When I first started using pycytominer, I didn't realize that feature-selection was supposed to happen against well-aggregated profiles and I attempted to run feature-selection on a profiles dataframe with millions of rows. I let the feature_selection step run for 4 hours before killing it. To solve the "problem" I forked pycytominer and did simple "import modin.pandas as pd" replacements in all the files. When I tried to re-run feature_selection it completed in under 60 seconds after it properly scaled to all 16 cpus of the instance I was using. I later realized my mistake in approach and left the branch by the wayside.

There were some problems with my quickly hacked together solution (especially with annotate) and there were problems with certain functions that manually detect datatypes (e.g. if type(obj) == pd.Dataframe checks). But it may be worth systematically tackling those issues to get this functionality to work.

kenibrewer avatar May 11 '23 16:05 kenibrewer