scanpy read_10x_h5() `genome` argument appears recently broken for 10x v2 format

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of scanpy.
[x] (optional) I have confirmed this bug exists on the master branch of scanpy.

To reproduce this issue:

download the public 10x dataset here (https://cf.10xgenomics.com/samples/cell-exp/2.1.0/hgmm_12k/hgmm_12k_raw_gene_bc_matrices_h5.h5)
run the following

import scanpy as sc

adata_human = sc.read_10x_h5('hgmm_12k_raw_gene_bc_matrices_h5.h5', genome='hg19')
adata_mouse = sc.read_10x_h5('hgmm_12k_raw_gene_bc_matrices_h5.h5', genome='mm10')

assert (adata_human.X != adata_mouse.X).sum() > 0, 'these count matrices are equal'

which produces the assertion error. We see that the loaded data is the same regardless of 'genome' argument. A look at the file itself shows this is not the case (notice the number of gene names, which are different for hg19 and mm10):

Versions

Also I think I can say confidently that this was working fine as of scanpy 1.8.1

anndata 0.8.0 scanpy 1.9.1

PIL 8.1.0 appnope 0.1.2 backcall 0.2.0 cached_property 1.5.2 cellbender NA cffi 1.14.5 colorcet 3.0.0 cycler 0.10.0 cython_runtime NA dateutil 2.8.1 decorator 5.0.9 fontTools 4.33.3 h5py 3.2.0 igraph 0.9.10 ipykernel 5.5.5 ipython_genutils 0.2.0 ipywidgets 7.6.3 jedi 0.18.0 joblib 1.0.1 kiwisolver 1.3.1 leidenalg 0.8.10 llvmlite 0.38.0 lxml 4.8.0 matplotlib 3.5.1 matplotlib_inline NA mkl 2.3.0 mpl_toolkits NA natsort 7.1.1 numba 0.55.1 numexpr 2.7.3 numpy 1.19.2 packaging 20.9 pandas 1.2.3 param 1.12.1 parso 0.8.2 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prompt_toolkit 3.0.18 psutil 5.8.0 ptyprocess 0.7.0 pycparser 2.20 pygments 2.8.0 pynndescent 0.5.6 pyparsing 2.4.7 pytz 2021.1 scipy 1.6.1 seaborn 0.11.2 session_info 1.0.0 six 1.15.0 sklearn 0.24.1 skmisc 0.1.4 sphinxcontrib NA statsmodels 0.12.2 storemagic NA tables 3.6.1 texttable 1.6.4 tornado 6.1 tqdm 4.55.1 traitlets 5.0.5 typing_extensions NA umap 0.5.3 wcwidth 0.2.5 yaml 6.0 zipp NA zmq 22.0.3

IPython 7.23.1 jupyter_client 6.1.12 jupyter_core 4.7.1 notebook 6.4.0

Python 3.7.9 (default, Aug 31 2020, 07:22:35) [Clang 10.0.0 ] Darwin-20.6.0-x86_64-i386-64bit

Apr 28 '22 21:04 sjfleming

It seems like _read_legacy_10x_h5() invokes _collect_datasets() without taking genome into account? Perhaps this was lost during a refactor? https://github.com/scverse/scanpy/blob/bd06cc3d1e0bd990f6994e54414512fa0b25fea0/scanpy/readwrite.py#L222

Apr 28 '22 21:04 sjfleming

Yes, I think line 222 should be

    _collect_datasets(dsets, f[genome])

I will make a PR, and I will try to add a couple of tests to highlight this issue.

May 02 '22 14:05 sjfleming

@ivirshup or @flying-sheep or any other active maintainers, if you get a chance, could you consider the associated PR for this issue? It'd be great to have this fixed in a future release.

Jul 27 '22 14:07 sjfleming

@sjfleming I am currently facing the same issue. I was able to load the h5 output with this input function you stated here https://lightrun.com/answers/broadinstitute-cellbender-read_10x_h5-error-in-scanpy-191

But after further analysis and I wanted to save the adata object with the write function to h5ad format, I am not able to read that saved h5ad object with scanpy again with error test.h5ad contains more than one genome. For legacy 10x h5 files you must specify the genome if more than one is present. Available genomes are: ['X', 'layers', 'obs', 'obsm', 'obsp', 'raw', 'uns', 'var', 'varm', 'varp']

For some reason, the genome "GRCh38" is not showing up in the available genomes options. And when I tried to use command test = sc.read_10x_h5 ('test.h5ad', genome = "GRCh38), this error shows up again Could not find genome 'GRCh38' in 'test.h5ad'. Available genomes are: ['X', 'layers', 'obs', 'obsm', 'obsp', 'raw', 'uns', 'var', 'varm', 'varp'].

Do you know if this error is related to this pull request? and is there any fix to it so that I can save processed h5ad after cellbender and able to read it again? Thank you very much!

Sep 15 '22 00:09 karenlawwc

@karenlawwc For the test.h5ad that you’ve saved using adata.write, I think you want to load it with sc.read_h5ad() rather than sc.read_10x_h5(), since the saved test file will be in AnnData h5ad format

Sep 22 '22 22:09 sjfleming

scanpy scanpy copied to clipboard

read_10x_h5() `genome` argument appears recently broken for 10x v2 format

Versions

anndata 0.8.0 scanpy 1.9.1

IPython 7.23.1 jupyter_client 6.1.12 jupyter_core 4.7.1 notebook 6.4.0

Python 3.7.9 (default, Aug 31 2020, 07:22:35) [Clang 10.0.0 ] Darwin-20.6.0-x86_64-i386-64bit

scanpy
scanpy copied to clipboard