pycytominer Update aggregate to use Dask dataframe

I'm experiencing what I believe (but am not 100% sure) are memory leaks when using pandas in aggregate.py. I think it is has to do with how I'm reusing the variable population_df several times. Plus, pandas has several documented memory leak issues and we've noticed at least one when using the recipe (see #142).

I am also running into long loading times in a separate project. Given that aggregate.py is essentially just taking the mean or median of all feature columns, it should be relatively straightforward to move to dask dataframe. This will also be a helpful switch in anticipation of pycytominer handling parquet files.

May 31 '21 22:05 gwaybio

it should be relatively straightforward to move to dask dataframe.

This turned out not to be the case. I tested this in a toy example with two dataframes written to temp files. One of the dataframe had missing values, both had the same columns with mixed dtypes. When trying to compute the mean (simulating aggregate.py), I received this error:

~/miniconda3/envs/pycytominer-test/lib/python3.8/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
    665     def _set_axis(self, axis: int, labels: Index) -> None:
    666         labels = ensure_index(labels)
--> 667         self._mgr.set_axis(axis, labels)
    668         self._clear_item_cache()
    669 

~/miniconda3/envs/pycytominer-test/lib/python3.8/site-packages/pandas/core/internals/managers.py in set_axis(self, axis, new_labels)
    218 
    219         if new_len != old_len:
--> 220             raise ValueError(
    221                 f"Length mismatch: Expected axis has {old_len} elements, new "
    222                 f"values have {new_len} elements"

This error results from the two files having different metadata columns.

I then did some digging and determined that for aggregating single cell output from CellProfiler, dask is not a straightfoward solution.

Dask requires that all CSV files have uniform structure.
- This is not guaranteed for CellProfiler output. Missing values in otherwise int columns, different column order across files, metadata with different dtypes are relatively common.
- See dask/dask#2752. It appears that d6tstack might be a useful intermediate step, but it might involve data redundancy after organization.

It might still be worth adding an implementation to read files from multiple csv locations - worth investigating a bit further for the pooled cell painting project.

I also still need to figure out if aggregate is causing a memory leak, and to fix cyclical variable assignment

Jun 01 '21 15:06 gwaybio

this appears to only be a problem for me 😂

some differences I can see between my implementation and Niranj or Beths is that I am using gzipped csv files with mtime=0. Perhaps it is reading these kinds of files specifically that is causing an issue 🤔

Jun 04 '21 17:06 gwaybio