scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

read_10x_h5() `genome` argument appears recently broken for 10x v2 format

Open sjfleming opened this issue 2 years ago • 5 comments

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of scanpy.
  • [x] (optional) I have confirmed this bug exists on the master branch of scanpy.

To reproduce this issue:

  1. download the public 10x dataset here (https://cf.10xgenomics.com/samples/cell-exp/2.1.0/hgmm_12k/hgmm_12k_raw_gene_bc_matrices_h5.h5)
  2. run the following
import scanpy as sc

adata_human = sc.read_10x_h5('hgmm_12k_raw_gene_bc_matrices_h5.h5', genome='hg19')
adata_mouse = sc.read_10x_h5('hgmm_12k_raw_gene_bc_matrices_h5.h5', genome='mm10')

assert (adata_human.X != adata_mouse.X).sum() > 0, 'these count matrices are equal'

which produces the assertion error. We see that the loaded data is the same regardless of 'genome' argument. A look at the file itself shows this is not the case (notice the number of gene names, which are different for hg19 and mm10):

image

Versions

Also I think I can say confidently that this was working fine as of scanpy 1.8.1


anndata 0.8.0 scanpy 1.9.1

PIL 8.1.0 appnope 0.1.2 backcall 0.2.0 cached_property 1.5.2 cellbender NA cffi 1.14.5 colorcet 3.0.0 cycler 0.10.0 cython_runtime NA dateutil 2.8.1 decorator 5.0.9 fontTools 4.33.3 h5py 3.2.0 igraph 0.9.10 ipykernel 5.5.5 ipython_genutils 0.2.0 ipywidgets 7.6.3 jedi 0.18.0 joblib 1.0.1 kiwisolver 1.3.1 leidenalg 0.8.10 llvmlite 0.38.0 lxml 4.8.0 matplotlib 3.5.1 matplotlib_inline NA mkl 2.3.0 mpl_toolkits NA natsort 7.1.1 numba 0.55.1 numexpr 2.7.3 numpy 1.19.2 packaging 20.9 pandas 1.2.3 param 1.12.1 parso 0.8.2 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prompt_toolkit 3.0.18 psutil 5.8.0 ptyprocess 0.7.0 pycparser 2.20 pygments 2.8.0 pynndescent 0.5.6 pyparsing 2.4.7 pytz 2021.1 scipy 1.6.1 seaborn 0.11.2 session_info 1.0.0 six 1.15.0 sklearn 0.24.1 skmisc 0.1.4 sphinxcontrib NA statsmodels 0.12.2 storemagic NA tables 3.6.1 texttable 1.6.4 tornado 6.1 tqdm 4.55.1 traitlets 5.0.5 typing_extensions NA umap 0.5.3 wcwidth 0.2.5 yaml 6.0 zipp NA zmq 22.0.3

IPython 7.23.1 jupyter_client 6.1.12 jupyter_core 4.7.1 notebook 6.4.0

Python 3.7.9 (default, Aug 31 2020, 07:22:35) [Clang 10.0.0 ] Darwin-20.6.0-x86_64-i386-64bit

sjfleming avatar Apr 28 '22 21:04 sjfleming

It seems like _read_legacy_10x_h5() invokes _collect_datasets() without taking genome into account? Perhaps this was lost during a refactor? https://github.com/scverse/scanpy/blob/bd06cc3d1e0bd990f6994e54414512fa0b25fea0/scanpy/readwrite.py#L222

sjfleming avatar Apr 28 '22 21:04 sjfleming

Yes, I think line 222 should be

    _collect_datasets(dsets, f[genome])

I will make a PR, and I will try to add a couple of tests to highlight this issue.

sjfleming avatar May 02 '22 14:05 sjfleming

@ivirshup or @flying-sheep or any other active maintainers, if you get a chance, could you consider the associated PR for this issue? It'd be great to have this fixed in a future release.

sjfleming avatar Jul 27 '22 14:07 sjfleming

@sjfleming I am currently facing the same issue. I was able to load the h5 output with this input function you stated here https://lightrun.com/answers/broadinstitute-cellbender-read_10x_h5-error-in-scanpy-191

But after further analysis and I wanted to save the adata object with the write function to h5ad format, I am not able to read that saved h5ad object with scanpy again with error test.h5ad contains more than one genome. For legacy 10x h5 files you must specify the genome if more than one is present. Available genomes are: ['X', 'layers', 'obs', 'obsm', 'obsp', 'raw', 'uns', 'var', 'varm', 'varp']

For some reason, the genome "GRCh38" is not showing up in the available genomes options. And when I tried to use command test = sc.read_10x_h5 ('test.h5ad', genome = "GRCh38), this error shows up again Could not find genome 'GRCh38' in 'test.h5ad'. Available genomes are: ['X', 'layers', 'obs', 'obsm', 'obsp', 'raw', 'uns', 'var', 'varm', 'varp'].

Do you know if this error is related to this pull request? and is there any fix to it so that I can save processed h5ad after cellbender and able to read it again? Thank you very much!

karenlawwc avatar Sep 15 '22 00:09 karenlawwc

@karenlawwc For the test.h5ad that you’ve saved using adata.write, I think you want to load it with sc.read_h5ad() rather than sc.read_10x_h5(), since the saved test file will be in AnnData h5ad format

sjfleming avatar Sep 22 '22 22:09 sjfleming