anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Dask dataframe support

Open rahulbshrestha opened this issue 2 years ago • 3 comments

This PR introduces support for Dask dataframes in anndata.

TODOs:

  • [ ] Indexing
  • [ ] Writing / Reading
  • [ ] Concatenation
  • [x] assert_equal for tests
  • [ ] adata.to_memory() / adata.copy()

Related PR (Dask array support): https://github.com/scverse/anndata/pull/813 Contributors: @rahulbshrestha @syelman

rahulbshrestha avatar Sep 28 '22 20:09 rahulbshrestha

Codecov Report

Merging #823 (af84fdd) into master (919d34c) will decrease coverage by 0.15%. The diff coverage is 57.14%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #823      +/-   ##
==========================================
- Coverage   83.49%   83.33%   -0.16%     
==========================================
  Files          34       32       -2     
  Lines        4441     4333     -108     
==========================================
- Hits         3708     3611      -97     
+ Misses        733      722      -11     
Impacted Files Coverage Δ
anndata/compat/__init__.py 85.96% <28.57%> (-2.45%) :arrow_down:
anndata/tests/helpers.py 95.12% <85.71%> (-0.34%) :arrow_down:
anndata/_core/merge.py 93.71% <0.00%> (-0.28%) :arrow_down:
anndata/__init__.py
anndata/utils.py

codecov[bot] avatar Sep 28 '22 20:09 codecov[bot]

So, I've looked into the length thing a bit. It looks like there is still no way to include info on number of rows for a dask dataframe. This is tracked multiple places in the dask repo, but this issue looks most recent: https://github.com/dask/dask/issues/5633

It's possible we can do something clever to work around this, like persisting the index of the data frame and doing length checks there. We could also not do length checks on dask dataframes until we try to compute, and error then.

@ryan-williams, any chance you have thoughts here? Is it best to just wait on dask some more?

ivirshup avatar Oct 04 '22 15:10 ivirshup

Here is a gist with some code for reading a dataframe saved in AnnData to a dask DataFrame

ivirshup avatar Oct 06 '22 13:10 ivirshup

@ivirshup I've got a branch with your gist - I can start an issue for this but so far what I see is that:

  1. calling len(df) when df is a dask dataframe loads the whole dataframe into memory
  2. the index has no is_unique attribute Both seems manageable as PR's into dask (if they're actually issues) but just figured I'd document this somewhere.

ilan-gold avatar Oct 21 '22 00:10 ilan-gold