anndata
anndata copied to clipboard
Dask array support
A PR to get started on #693. Going to update this as we move forward. Also, please correct me if I got something wrong.
We should start looking on having a dask.Array in obsm, varm, layers, etc. First objective is to write the suitable test functions and maybe discuss solutions for them to pass the tests.
Current state
So one should understand the AnnData repository overall (especially the _core
and _io
folders) to see the expected behavior and write tests according to it (i.e., find the methods that should raise a NotImplementedError in this state).
After the tests or meanwhile writing the tests, we can discuss possible solutions.
Update: I just added dask.Array
to the default of tests.helpers.gen_adata
for obsm and varm like this
obsm = dict(
array=np.random.random((M, 50)),
...
da=da.random.random((M, 50), chunks=(min(M,50)//5+1,)*2),
)
Now it gives the all the errors we want to see I think when we call test_readwrite or something. Example output
FAILED anndata/tests/test_readwrite.py::test_changed_obs_var_names[h5ad] - TypeError: No method has been defined for writing <class 'dask.array.core.Array'> elements to <class 'h5py._hl.group.Group'>
ERROR anndata/tests/test_readwrite.py::test_maintain_layers - TypeError: No method has been defined for writing <class 'dask.array.core.Array'> elements to <class 'h5py._hl.group.Group'>
I think I will do some reading on this h5 and zarr group concepts, then try to solve these test_readwrite errors.
ping: @ivirshup @rahulbshrestha
Codecov Report
Merging #813 (1654b25) into master (742bbf0) will increase coverage by
0.30%
. The diff coverage is96.73%
.
:exclamation: Current head 1654b25 differs from pull request most recent head 58cd74e. Consider uploading reports for the commit 58cd74e to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## master #813 +/- ##
==========================================
+ Coverage 83.42% 83.72% +0.30%
==========================================
Files 32 32
Lines 4319 4405 +86
==========================================
+ Hits 3603 3688 +85
- Misses 716 717 +1
Impacted Files | Coverage Δ | |
---|---|---|
anndata/_core/merge.py | 93.65% <89.28%> (-0.07%) |
:arrow_down: |
anndata/_core/anndata.py | 83.78% <100.00%> (+0.36%) |
:arrow_up: |
anndata/_core/file_backing.py | 91.13% <100.00%> (+1.58%) |
:arrow_up: |
anndata/_core/index.py | 92.91% <100.00%> (+0.41%) |
:arrow_up: |
anndata/_core/views.py | 89.34% <100.00%> (+0.45%) |
:arrow_up: |
anndata/_io/specs/methods.py | 84.04% <100.00%> (+0.25%) |
:arrow_up: |
anndata/compat/__init__.py | 88.41% <100.00%> (ø) |
|
anndata/tests/helpers.py | 95.79% <100.00%> (+0.33%) |
:arrow_up: |
I think you should also start looking at having a dask.Array
in obsm
, varm
, layers
, etc. In many cases I think those will be more straightforward.
@ivirshup @rahulbshrestha, I did some changes that enables the Zarr support. I still don't know what encoding-type and encoding-version refer to, so there is probably a cleaner way to do this.
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
Added Hdf5 and Zarr support for read and write operations as array. Optional dependency issue is also solved. I will now be looking to the index operations. I also added dask array to test_views.py
and it fails on 3 tests with not implemented error.
@ivirshup @rahulbshrestha, I added DaskArray to the view dispatch and added the matrix_type
. But the thing is, this doesn't cover all the tests.
Should I create more tests or just refactor the old test functions so that they take matrix_type
as an argument when they can? For example this function:
def test_views():
X = np.array(X_list)
adata = ad.AnnData(X, obs=obs_dict, var=var_dict, uns=uns_dict, dtype="int32")
assert adata[:, 0].is_view
assert adata[:, 0].X.tolist() == np.reshape([1, 4, 7], (3, 1)).tolist()
adata[:2, 0].X = [0, 0]
assert adata[:, 0].X.tolist() == np.reshape([0, 0, 7], (3, 1)).tolist()
adata_subset = adata[:2, [0, 1]]
assert adata_subset.is_view
# now transition to actual object
adata_subset.obs["foo"] = range(2)
assert not adata_subset.is_view
assert adata_subset.obs["foo"].tolist() == list(range(2))
These functions use the gen_adata
function often, should I add a dask.array support somehow in it?
Should I create more tests or just refactor the old test functions so that they take matrix_type as an argument when they can? For example this function:
This would be nice, but I think I'd wait on this until you're more familiar with the test suite.
That specific test is fairly old, and I would like to keep it as is. If there are multiple tests which the same logic but different datatypes it would be great to combine these. Sometimes the logic here can be tricky.
These functions use the gen_adata function often, should I add a dask.array support somehow in it?
I think this is a broadly a good idea. However, this is used all over the place and many tests would start failing. There are also places like IO tests where the logic is different for dask arrays.
If you want to do this, I would probably make objects returned by gen_adata
not contain dask arrays by default. So tests would have to opt in to getting a dask array.
I would probably make objects returned by
gen_adata
not contain dask arrays by default. So tests would have to opt in to getting a dask array.
I was planning to do it that way. Then I will move on with this.
Added new cases, having errors like these:
FAILED anndata/tests/test_views.py::test_view_delattr[array_subset-X] - NotImplementedError: Don't yet support nd fancy indexing
...
FAILED anndata/tests/test_views.py::test_view_delattr[array_subset-varp] - NotImplementedError: Don't yet support nd fancy indexing
Update: I was able to reproduce it. The tests themselves seem ok.
import dask.array as da
arr = da.ones((100,100))
arr[[1,2],[1,4]]
This works for other arrays.
@ivirshup How should we handle this error message? Should we treat dask arrays differently in the views.py
file (e.g., create a DaskArrayView class in it and register it separately).
@syelman, for arr[[1,2],[1,4]]
, we handle this through the subset
single dispatch method. The dask array could probably use pretty similar code to numpy.
We actually do this because if arr
was a numpy array, arr[[1,2],[1,4]]
would return a 1d array with two values. E.g. you are grabbing values at (1, 1) and (2, 4). But I don't think anyone likes that numpy does this, so most other types don't do this.
_subset
is defined in anndata/_core/index.py
While looking for the concatenation support, I ran into some errors. My guess is that it has something to do with the indexing in dask. I don't know if it is a bug or not, but I found these errors when indexing
import numpy as np
import scipy.sparse as sp
import dask.array as da
idx =([12,12],[1,1])
a = da.from_array(np.random.rand(50,50)) # dask.array<array, shape=(50, 50), dtype=float64, chunksize=(50, 50), chunktype=numpy.ndarray>
b = da.from_array(sp.random(50,50,format="csr")) # dask.array<array, shape=(50, 50), dtype=float64, chunksize=(50, 50), chunktype=scipy.csr_matrix>
a.vindex[np.ix_(*idx)].compute() # passes
b.vindex[np.ix_(*idx)].compute() # fails
# the following also gives different results
a.vindex[idx].compute()
b.vindex[idx].compute()
And I'd like to understand if this is normal or not before looking for the concatenation aspect, i.e., to check if there is some extra code I need to write for the concatenate method. Because the behavior of concatenate seems to depend on the chunktype
of the dask array according to the documentation.
I was going to implement a workaround inside the _subset_dask
function by indexing depending on the chunktype
of the dask array, but I couldn't find an easy way to access the chunk type.
Hi @ivirshup, I will focus on these two, while @rahulbshrestha will do the other two.
* [ ] `adata.to_memory` should load dask arrays into memory * [ ] `adata.copy()` should not load dask arrays into memory * This is currently true, but should be tested
But I have some questions, I added the dask function to some of the tests/test_hdf5_backing.py
to test the to_memory()
and copy()
. These work but, I think it's because we currently only read as numpy array. Otherwise, I still didn't touch the to_memory
dispatch function because there doesn't seem to be a case where dask array wouldn't be in memory yet. Unless we should consider that the user can load a remote dask array when initializing the adata then we would need to change the to_memory logic in adata I think.
Hi @ivirshup,
While, waiting for clarifications on my questions above, I added dask array to test_concatenate.py
functions. There are errors coming from this function in anndata/__core/merge.py
https://github.com/scverse/anndata/blob/919d34cfe08faf54daa9f9caa76090739fd66262/anndata/_core/merge.py#L367-L381
This is probably because dask doesn't support pd.api.extensions
thing. I am wondering if I should go with what was done in AwkardArrays or do some reading to make this call work on _apply_to_array
.
Update: Apparently we have to compute the array to call take on it. When I added the following code to the above method, the error stops.
if hasattr(el,"compute"):
el = el.compute()
I have no idea if this is ok or not.
Apart from this, I fixed another compatibility issue on concatenation. Traced the merge functions and found out there was a bug because we didn't have equal implemented for DaskArray. So I implemented it like this
@equal.register(DaskArray)
def equal_dask_array(a, b) -> bool:
import dask.array as da
if isinstance(b, DaskArray):
return da.equal(a, b, where=~(da.isnan(a) == da.isnan(b))).all().compute()
return equal(b, asarray(a))
The problem with dask equal was it was returning false when np.nan == np.nan.
Now I added plenty of test functions for concatenation. There are two tests don't pass right now, but it's a bug from array_equal
probably.
The problem with dask equal was it was returning false when np.nan == np.nan.
I think we can leave np.nan to be not equal to np.nan as it is part of the definition.
Relevant discussion: https://stackoverflow.com/questions/10034149/why-is-nan-not-equal-to-nan
I think we can leave np.nan to be not equal to np.nan as it is part of the definition.
But in anndata we don't want this behavior. Please see https://github.com/scverse/anndata/blob/919d34cfe08faf54daa9f9caa76090739fd66262/anndata/_core/merge.py#L93-L102
@syelman, I have an example for outer indexing to share with you:
Here we are adding columns 5, 6, and 7 to the array with our indexer:
import numpy as np, pandas as pd, dask.array as da
from anndata._core import merge
np_array = np.broadcast_to(np.arange(5), (5, 5))
dask_array = da.array(np_array)
r = merge.Reindexer(pd.Index(range(5)), pd.Index(range(8)))
r.apply(np_array, axis=1)
array([[ 0., 1., 2., 3., 4., nan, nan, nan],
[ 0., 1., 2., 3., 4., nan, nan, nan],
[ 0., 1., 2., 3., 4., nan, nan, nan],
[ 0., 1., 2., 3., 4., nan, nan, nan],
[ 0., 1., 2., 3., 4., nan, nan, nan]])
r.apply(dask_array, axis=1).compute()
array([[0, 1, 2, 3, 4, 4, 4, 4],
[0, 1, 2, 3, 4, 4, 4, 4],
[0, 1, 2, 3, 4, 4, 4, 4],
[0, 1, 2, 3, 4, 4, 4, 4],
[0, 1, 2, 3, 4, 4, 4, 4]])
It looks like it's being padded with the wrong values, as the failing test would suggest.
Xarray achieves similar behavior, though I'm not sure how yet:
import numpy as np, pandas as pd, xarray as xr, dask.array as da
dask_dataarray = xr.DataArray(
da.array(np.arange(6).reshape(2, 3)), [("x", ["a", "b"]), ("y", [10, 20, 30])]
)
xr.concat([dask_dataarray[:, :1], dask_dataarray], dim="x").compute()
array([[ 0., nan, nan],
[ 3., nan, nan],
[ 0., 1., 2.],
[ 3., 4., 5.]])
But it has a fairly complicated dask graph to show for it:
xr.concat([dask_dataarray[:, :1], dask_dataarray], dim="x").variable.data.visualize()
Task graph for outer index
hi @ivirshup ,
Besides a conversation, we didn't resolve about the warning/error on concatenation, I added all the necessary features and their tests. Here is a summary of what is done last week:
Last week
- Added tests for concatenation for different types to test the equal dispatch with different arguments.
- Added
dask-array
option togen_adata
helper functions (most of them)
About to_memory
:
-
to_memory
now doesn't give an error when not backed - When
to_memory
is called each object gets copied, if they are lazy objects they are instantiated by acompute()
call - Added tests for
X
,raw
,uns
,obsm
,varm
,layers
to see if they are the expected type (dask-array or just array) and if their contents match.
In general
So I need a general review. Except the behavior about warning on concatenation. These are completed/tested:
-
copy
method -
to_memory
method - read/write
- concatenation
- indexing