anndata
anndata copied to clipboard
Dask dataframe support
This PR introduces support for Dask dataframes in anndata.
TODOs:
- [ ] Indexing
- [ ] Writing / Reading
- [ ] Concatenation
- [x] assert_equal for tests
- [ ] adata.to_memory() / adata.copy()
Related PR (Dask array support): https://github.com/scverse/anndata/pull/813 Contributors: @rahulbshrestha @syelman
Codecov Report
Merging #823 (af84fdd) into master (919d34c) will decrease coverage by
0.15%
. The diff coverage is57.14%
.
Additional details and impacted files
@@ Coverage Diff @@
## master #823 +/- ##
==========================================
- Coverage 83.49% 83.33% -0.16%
==========================================
Files 34 32 -2
Lines 4441 4333 -108
==========================================
- Hits 3708 3611 -97
+ Misses 733 722 -11
Impacted Files | Coverage Δ | |
---|---|---|
anndata/compat/__init__.py | 85.96% <28.57%> (-2.45%) |
:arrow_down: |
anndata/tests/helpers.py | 95.12% <85.71%> (-0.34%) |
:arrow_down: |
anndata/_core/merge.py | 93.71% <0.00%> (-0.28%) |
:arrow_down: |
anndata/__init__.py | ||
anndata/utils.py |
So, I've looked into the length thing a bit. It looks like there is still no way to include info on number of rows for a dask dataframe. This is tracked multiple places in the dask repo, but this issue looks most recent: https://github.com/dask/dask/issues/5633
It's possible we can do something clever to work around this, like persisting the index of the data frame and doing length checks there. We could also not do length checks on dask dataframes until we try to compute, and error then.
@ryan-williams, any chance you have thoughts here? Is it best to just wait on dask some more?
Here is a gist with some code for reading a dataframe saved in AnnData to a dask DataFrame
@ivirshup I've got a branch with your gist - I can start an issue for this but so far what I see is that:
- calling
len(df)
whendf
is a dask dataframe loads the whole dataframe into memory - the index has no
is_unique
attribute Both seems manageable as PR's into dask (if they're actually issues) but just figured I'd document this somewhere.