scanpy
scanpy copied to clipboard
read_10x_h5() `genome` argument appears recently broken for 10x v2 format
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of scanpy.
- [x] (optional) I have confirmed this bug exists on the master branch of scanpy.
To reproduce this issue:
- download the public 10x dataset here (https://cf.10xgenomics.com/samples/cell-exp/2.1.0/hgmm_12k/hgmm_12k_raw_gene_bc_matrices_h5.h5)
- run the following
import scanpy as sc
adata_human = sc.read_10x_h5('hgmm_12k_raw_gene_bc_matrices_h5.h5', genome='hg19')
adata_mouse = sc.read_10x_h5('hgmm_12k_raw_gene_bc_matrices_h5.h5', genome='mm10')
assert (adata_human.X != adata_mouse.X).sum() > 0, 'these count matrices are equal'
which produces the assertion error. We see that the loaded data is the same regardless of 'genome'
argument. A look at the file itself shows this is not the case (notice the number of gene names, which are different for hg19 and mm10):
Versions
Also I think I can say confidently that this was working fine as of scanpy 1.8.1
anndata 0.8.0 scanpy 1.9.1
PIL 8.1.0 appnope 0.1.2 backcall 0.2.0 cached_property 1.5.2 cellbender NA cffi 1.14.5 colorcet 3.0.0 cycler 0.10.0 cython_runtime NA dateutil 2.8.1 decorator 5.0.9 fontTools 4.33.3 h5py 3.2.0 igraph 0.9.10 ipykernel 5.5.5 ipython_genutils 0.2.0 ipywidgets 7.6.3 jedi 0.18.0 joblib 1.0.1 kiwisolver 1.3.1 leidenalg 0.8.10 llvmlite 0.38.0 lxml 4.8.0 matplotlib 3.5.1 matplotlib_inline NA mkl 2.3.0 mpl_toolkits NA natsort 7.1.1 numba 0.55.1 numexpr 2.7.3 numpy 1.19.2 packaging 20.9 pandas 1.2.3 param 1.12.1 parso 0.8.2 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prompt_toolkit 3.0.18 psutil 5.8.0 ptyprocess 0.7.0 pycparser 2.20 pygments 2.8.0 pynndescent 0.5.6 pyparsing 2.4.7 pytz 2021.1 scipy 1.6.1 seaborn 0.11.2 session_info 1.0.0 six 1.15.0 sklearn 0.24.1 skmisc 0.1.4 sphinxcontrib NA statsmodels 0.12.2 storemagic NA tables 3.6.1 texttable 1.6.4 tornado 6.1 tqdm 4.55.1 traitlets 5.0.5 typing_extensions NA umap 0.5.3 wcwidth 0.2.5 yaml 6.0 zipp NA zmq 22.0.3
IPython 7.23.1 jupyter_client 6.1.12 jupyter_core 4.7.1 notebook 6.4.0
Python 3.7.9 (default, Aug 31 2020, 07:22:35) [Clang 10.0.0 ] Darwin-20.6.0-x86_64-i386-64bit
It seems like _read_legacy_10x_h5()
invokes _collect_datasets()
without taking genome
into account? Perhaps this was lost during a refactor?
https://github.com/scverse/scanpy/blob/bd06cc3d1e0bd990f6994e54414512fa0b25fea0/scanpy/readwrite.py#L222
Yes, I think line 222 should be
_collect_datasets(dsets, f[genome])
I will make a PR, and I will try to add a couple of tests to highlight this issue.
@ivirshup or @flying-sheep or any other active maintainers, if you get a chance, could you consider the associated PR for this issue? It'd be great to have this fixed in a future release.
@sjfleming I am currently facing the same issue. I was able to load the h5 output with this input function you stated here https://lightrun.com/answers/broadinstitute-cellbender-read_10x_h5-error-in-scanpy-191
But after further analysis and I wanted to save the adata object with the write function to h5ad format, I am not able to read that saved h5ad object with scanpy again with error test.h5ad contains more than one genome. For legacy 10x h5 files you must specify the genome if more than one is present. Available genomes are: ['X', 'layers', 'obs', 'obsm', 'obsp', 'raw', 'uns', 'var', 'varm', 'varp']
For some reason, the genome "GRCh38" is not showing up in the available genomes options. And when I tried to use command test = sc.read_10x_h5 ('test.h5ad', genome = "GRCh38), this error shows up again Could not find genome 'GRCh38' in 'test.h5ad'. Available genomes are: ['X', 'layers', 'obs', 'obsm', 'obsp', 'raw', 'uns', 'var', 'varm', 'varp'].
Do you know if this error is related to this pull request? and is there any fix to it so that I can save processed h5ad after cellbender and able to read it again? Thank you very much!
@karenlawwc For the test.h5ad that you’ve saved using adata.write, I think you want to load it with sc.read_h5ad() rather than sc.read_10x_h5(), since the saved test file will be in AnnData h5ad format