A PR to get started on #693. Going to update this as we move forward. Also, please correct me if I got something wrong.

We should start looking on having a dask.Array in obsm, varm, layers, etc. First objective is to write the suitable test functions and maybe discuss solutions for them to pass the tests.

Current state

So one should understand the AnnData repository overall (especially the _core and _io folders) to see the expected behavior and write tests according to it (i.e., find the methods that should raise a NotImplementedError in this state).

After the tests or meanwhile writing the tests, we can discuss possible solutions.

Update: I just added dask.Array to the default of tests.helpers.gen_adata for obsm and varm like this

obsm = dict(
    array=np.random.random((M, 50)),
    ...
    da=da.random.random((M, 50), chunks=(min(M,50)//5+1,)*2), 
)

Now it gives the all the errors we want to see I think when we call test_readwrite or something. Example output

FAILED anndata/tests/test_readwrite.py::test_changed_obs_var_names[h5ad] - TypeError: No method has been defined for writing <class 'dask.array.core.Array'> elements to <class 'h5py._hl.group.Group'> 
ERROR anndata/tests/test_readwrite.py::test_maintain_layers - TypeError: No method has been defined for writing <class 'dask.array.core.Array'> elements to <class 'h5py._hl.group.Group'>

I think I will do some reading on this h5 and zarr group concepts, then try to solve these test_readwrite errors.

ping: @ivirshup @rahulbshrestha

Sep 05 '22 10:09 selmanozleyen

Codecov Report

Merging #813 (1654b25) into master (742bbf0) will increase coverage by 0.30%. The diff coverage is 96.73%.

:exclamation: Current head 1654b25 differs from pull request most recent head 58cd74e. Consider uploading reports for the commit 58cd74e to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #813      +/-   ##
==========================================
+ Coverage   83.42%   83.72%   +0.30%     
==========================================
  Files          32       32              
  Lines        4319     4405      +86     
==========================================
+ Hits         3603     3688      +85     
- Misses        716      717       +1

Impacted Files	Coverage Δ
anndata/_core/merge.py	`93.65% <89.28%> (-0.07%)`	:arrow_down:
anndata/_core/anndata.py	`83.78% <100.00%> (+0.36%)`	:arrow_up:
anndata/_core/file_backing.py	`91.13% <100.00%> (+1.58%)`	:arrow_up:
anndata/_core/index.py	`92.91% <100.00%> (+0.41%)`	:arrow_up:
anndata/_core/views.py	`89.34% <100.00%> (+0.45%)`	:arrow_up:
anndata/_io/specs/methods.py	`84.04% <100.00%> (+0.25%)`	:arrow_up:
anndata/compat/__init__.py	`88.41% <100.00%> (ø)`
anndata/tests/helpers.py	`95.79% <100.00%> (+0.33%)`	:arrow_up:

Sep 05 '22 10:09 codecov[bot]

I think you should also start looking at having a dask.Array in obsm, varm, layers, etc. In many cases I think those will be more straightforward.

Sep 05 '22 12:09 ivirshup

@ivirshup @rahulbshrestha, I did some changes that enables the Zarr support. I still don't know what encoding-type and encoding-version refer to, so there is probably a cleaner way to do this.

Sep 07 '22 21:09 selmanozleyen

/azp run

Sep 08 '22 12:09 ivirshup

Azure Pipelines successfully started running 1 pipeline(s).

Sep 08 '22 12:09 azure-pipelines[bot]

Added Hdf5 and Zarr support for read and write operations as array. Optional dependency issue is also solved. I will now be looking to the index operations. I also added dask array to test_views.py and it fails on 3 tests with not implemented error.

Sep 08 '22 18:09 selmanozleyen

@ivirshup @rahulbshrestha, I added DaskArray to the view dispatch and added the matrix_type. But the thing is, this doesn't cover all the tests.

Should I create more tests or just refactor the old test functions so that they take matrix_type as an argument when they can? For example this function:

def test_views():
    X = np.array(X_list)
    adata = ad.AnnData(X, obs=obs_dict, var=var_dict, uns=uns_dict, dtype="int32")

    assert adata[:, 0].is_view
    assert adata[:, 0].X.tolist() == np.reshape([1, 4, 7], (3, 1)).tolist()

    adata[:2, 0].X = [0, 0]

    assert adata[:, 0].X.tolist() == np.reshape([0, 0, 7], (3, 1)).tolist()

    adata_subset = adata[:2, [0, 1]]

    assert adata_subset.is_view
    # now transition to actual object
    adata_subset.obs["foo"] = range(2)
    assert not adata_subset.is_view

    assert adata_subset.obs["foo"].tolist() == list(range(2))

These functions use the gen_adata function often, should I add a dask.array support somehow in it?

Sep 12 '22 13:09 selmanozleyen

Should I create more tests or just refactor the old test functions so that they take matrix_type as an argument when they can? For example this function:

This would be nice, but I think I'd wait on this until you're more familiar with the test suite.

That specific test is fairly old, and I would like to keep it as is. If there are multiple tests which the same logic but different datatypes it would be great to combine these. Sometimes the logic here can be tricky.

These functions use the gen_adata function often, should I add a dask.array support somehow in it?

I think this is a broadly a good idea. However, this is used all over the place and many tests would start failing. There are also places like IO tests where the logic is different for dask arrays.

If you want to do this, I would probably make objects returned by gen_adata not contain dask arrays by default. So tests would have to opt in to getting a dask array.

Sep 13 '22 18:09 ivirshup

I would probably make objects returned by gen_adata not contain dask arrays by default. So tests would have to opt in to getting a dask array.

I was planning to do it that way. Then I will move on with this.

Sep 14 '22 09:09 selmanozleyen

Added new cases, having errors like these:

FAILED anndata/tests/test_views.py::test_view_delattr[array_subset-X] - NotImplementedError: Don't yet support nd fancy indexing
...
FAILED anndata/tests/test_views.py::test_view_delattr[array_subset-varp] - NotImplementedError: Don't yet support nd fancy indexing

Update: I was able to reproduce it. The tests themselves seem ok.

import dask.array as da
arr = da.ones((100,100))
arr[[1,2],[1,4]]

This works for other arrays.

@ivirshup How should we handle this error message? Should we treat dask arrays differently in the views.py file (e.g., create a DaskArrayView class in it and register it separately).

Sep 14 '22 12:09 selmanozleyen

@syelman, for arr[[1,2],[1,4]], we handle this through the subset single dispatch method. The dask array could probably use pretty similar code to numpy.

We actually do this because if arr was a numpy array, arr[[1,2],[1,4]] would return a 1d array with two values. E.g. you are grabbing values at (1, 1) and (2, 4). But I don't think anyone likes that numpy does this, so most other types don't do this.

Sep 14 '22 12:09 ivirshup

_subset is defined in anndata/_core/index.py

Sep 14 '22 12:09 ivirshup

While looking for the concatenation support, I ran into some errors. My guess is that it has something to do with the indexing in dask. I don't know if it is a bug or not, but I found these errors when indexing

import numpy as np
import scipy.sparse as sp
import dask.array as da

idx =([12,12],[1,1])

a = da.from_array(np.random.rand(50,50)) # dask.array<array, shape=(50, 50), dtype=float64, chunksize=(50, 50), chunktype=numpy.ndarray>
b = da.from_array(sp.random(50,50,format="csr")) # dask.array<array, shape=(50, 50), dtype=float64, chunksize=(50, 50), chunktype=scipy.csr_matrix>

a.vindex[np.ix_(*idx)].compute() # passes
b.vindex[np.ix_(*idx)].compute() # fails

# the following also gives different results
a.vindex[idx].compute()
b.vindex[idx].compute()

And I'd like to understand if this is normal or not before looking for the concatenation aspect, i.e., to check if there is some extra code I need to write for the concatenate method. Because the behavior of concatenate seems to depend on the chunktype of the dask array according to the documentation.

I was going to implement a workaround inside the _subset_dask function by indexing depending on the chunktype of the dask array, but I couldn't find an easy way to access the chunk type.

Sep 15 '22 13:09 selmanozleyen

Hi @ivirshup, I will focus on these two, while @rahulbshrestha will do the other two.

* [ ]  `adata.to_memory` should load dask arrays into memory

* [ ]  `adata.copy()` should not load dask arrays into memory
  
  * This is currently true, but should be tested

But I have some questions, I added the dask function to some of the tests/test_hdf5_backing.py to test the to_memory() and copy(). These work but, I think it's because we currently only read as numpy array. Otherwise, I still didn't touch the to_memory dispatch function because there doesn't seem to be a case where dask array wouldn't be in memory yet. Unless we should consider that the user can load a remote dask array when initializing the adata then we would need to change the to_memory logic in adata I think.

Sep 21 '22 15:09 selmanozleyen

Hi @ivirshup, While, waiting for clarifications on my questions above, I added dask array to test_concatenate.py functions. There are errors coming from this function in anndata/__core/merge.py

https://github.com/scverse/anndata/blob/919d34cfe08faf54daa9f9caa76090739fd66262/anndata/_core/merge.py#L367-L381

This is probably because dask doesn't support pd.api.extensions thing. I am wondering if I should go with what was done in AwkardArrays or do some reading to make this call work on _apply_to_array.

Update: Apparently we have to compute the array to call take on it. When I added the following code to the above method, the error stops.

if hasattr(el,"compute"):
  el = el.compute()

I have no idea if this is ok or not.

Sep 21 '22 23:09 selmanozleyen

Apart from this, I fixed another compatibility issue on concatenation. Traced the merge functions and found out there was a bug because we didn't have equal implemented for DaskArray. So I implemented it like this

@equal.register(DaskArray)
def equal_dask_array(a, b) -> bool:

    import dask.array as da

    if isinstance(b, DaskArray):
        return da.equal(a, b, where=~(da.isnan(a) == da.isnan(b))).all().compute()
    return equal(b, asarray(a))

The problem with dask equal was it was returning false when np.nan == np.nan.

Now I added plenty of test functions for concatenation. There are two tests don't pass right now, but it's a bug from array_equal probably.

Sep 22 '22 16:09 selmanozleyen

The problem with dask equal was it was returning false when np.nan == np.nan.

I think we can leave np.nan to be not equal to np.nan as it is part of the definition.

Relevant discussion: https://stackoverflow.com/questions/10034149/why-is-nan-not-equal-to-nan

Sep 26 '22 08:09 rahulbshrestha

I think we can leave np.nan to be not equal to np.nan as it is part of the definition.

But in anndata we don't want this behavior. Please see https://github.com/scverse/anndata/blob/919d34cfe08faf54daa9f9caa76090739fd66262/anndata/_core/merge.py#L93-L102

Sep 26 '22 10:09 selmanozleyen

@syelman, I have an example for outer indexing to share with you:

Here we are adding columns 5, 6, and 7 to the array with our indexer:

import numpy as np, pandas as pd, dask.array as da
from anndata._core import merge

np_array = np.broadcast_to(np.arange(5), (5, 5))
dask_array = da.array(np_array)

r = merge.Reindexer(pd.Index(range(5)), pd.Index(range(8)))
r.apply(np_array, axis=1)

array([[ 0.,  1.,  2.,  3.,  4., nan, nan, nan],
       [ 0.,  1.,  2.,  3.,  4., nan, nan, nan],
       [ 0.,  1.,  2.,  3.,  4., nan, nan, nan],
       [ 0.,  1.,  2.,  3.,  4., nan, nan, nan],
       [ 0.,  1.,  2.,  3.,  4., nan, nan, nan]])

r.apply(dask_array, axis=1).compute()

array([[0, 1, 2, 3, 4, 4, 4, 4],
       [0, 1, 2, 3, 4, 4, 4, 4],
       [0, 1, 2, 3, 4, 4, 4, 4],
       [0, 1, 2, 3, 4, 4, 4, 4],
       [0, 1, 2, 3, 4, 4, 4, 4]])

It looks like it's being padded with the wrong values, as the failing test would suggest.

Sep 29 '22 14:09 ivirshup

Xarray achieves similar behavior, though I'm not sure how yet:

import numpy as np, pandas as pd, xarray as xr, dask.array as da

dask_dataarray = xr.DataArray(
    da.array(np.arange(6).reshape(2, 3)), [("x", ["a", "b"]), ("y", [10, 20, 30])]
)

xr.concat([dask_dataarray[:, :1], dask_dataarray], dim="x").compute()

array([[ 0., nan, nan],
       [ 3., nan, nan],
       [ 0.,  1.,  2.],
       [ 3.,  4.,  5.]])

But it has a fairly complicated dask graph to show for it:

xr.concat([dask_dataarray[:, :1], dask_dataarray], dim="x").variable.data.visualize()

Task graph for outer index

Sep 29 '22 14:09 ivirshup

hi @ivirshup ,

Besides a conversation, we didn't resolve about the warning/error on concatenation, I added all the necessary features and their tests. Here is a summary of what is done last week:

Last week

Added tests for concatenation for different types to test the equal dispatch with different arguments.
Added dask-array option to gen_adata helper functions (most of them)

About to_memory:

to_memory now doesn't give an error when not backed
When to_memory is called each object gets copied, if they are lazy objects they are instantiated by a compute() call
Added tests for X, raw, uns, obsm, varm, layers to see if they are the expected type (dask-array or just array) and if their contents match.

In general

So I need a general review. Except the behavior about warning on concatenation. These are completed/tested:

copy method
to_memory method
read/write
concatenation
indexing

Oct 18 '22 13:10 selmanozleyen

anndata
anndata copied to clipboard

Dask array support

Current state

Codecov Report

Last week

In general

anndata anndata copied to clipboard

Dask array support

Current state

Codecov Report

Last week

In general

anndata
anndata copied to clipboard