scanpy
scanpy copied to clipboard
Reading 10x scRNA-seq with read_10x_mtx()
I get an error when I use sc.read_10x_mtx() to read scRNA-seq files (matrix.mtx.gz, barcodes.tsv.gz, and features.tsv.gz) generated by "Illumina HiSeq 4000". The function works fine when I read files generated by other 10x platforms such as "Illumina NovaSeq 6000".
Example:
sc.read_10x_mtx("GSE145328_RAW")
Error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 2
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-20-26443e0aed95> in <module>
----> 1 rnaseq1 = sc.read_10x_mtx("GSE145328_RAW")
~/anaconda3/lib/python3.8/site-packages/scanpy/readwrite.py in read_10x_mtx(path, var_names, make_unique, cache, cache_compression, gex_only, prefix)
479 genefile_exists = (path / f'{prefix}genes.tsv').is_file()
480 read = _read_legacy_10x_mtx if genefile_exists else _read_v3_10x_mtx
--> 481 adata = read(
482 str(path),
483 var_names=var_names,
~/anaconda3/lib/python3.8/site-packages/scanpy/readwrite.py in _read_v3_10x_mtx(path, var_names, make_unique, cache, cache_compression, prefix)
560 else:
561 raise ValueError("`var_names` needs to be 'gene_symbols' or 'gene_ids'")
--> 562 adata.var['feature_types'] = genes[2].values
563 adata.obs_names = pd.read_csv(path / f'{prefix}barcodes.tsv.gz', header=None)[
564 0
~/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
3022 if self.columns.nlevels > 1:
3023 return self._getitem_multilevel(key)
-> 3024 indexer = self.columns.get_loc(key)
3025 if is_integer(indexer):
3026 indexer = [indexer]
~/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
-> 3082 raise KeyError(key) from err
3083
3084 if tolerance is not None:
KeyError: 2
Versions
scanpy==1.8.0 anndata==0.7.6 umap==0.5.1 numpy==1.19.2 scipy==1.6.3 pandas==1.2.4 scikit-learn==0.24.2 statsmodels==0.12.2 python-igraph==0.9.1 pynndescent==0.5.2
[Paste the output of scanpy.logging.print_versions() leaving a blank line after the details tag]
Thanks for the report!
First question: Is this directly out of cell ranger? Or is that the supplementary file from this GEO entry (GSE145328)? If this is the entry from GEO, I wouldn't expect sc.read_10x_mtx
, since that doesn't look like the output of cellranger.
GSE145328 is from NCBI GEO. GSE139555_RAW.tar is another example that gives the same error. These are the output files generated by Cell Ranger and uploaded to NCBI GEO, based on the authors' notes in their paper.
Some examples from NCBI GEO that can be read without errors: GSE173231 and GSE158803.
So the problem is actually from GEO. When people submitted the files processed by Cellranger version 2, they gzip-ed the files. However when Scanpy sees .gz file it recognized the version as Cellranger version 3 by default, which is a little bit different from the version 2 format.
All you need to do is just to gunzip
the matrix.mtx.gz
, barcodes.tsv.gz
, and genes.tsv.gz
files. You can tell whether the data was processed using version 2 or 3 from the description in GEO, or by the name of the uploaded gene files: genes.tsv
(version 2, could be genes.tsv.gz
) or features.tsv.gz
(version 3)
Hello,
I am having a similar issue with read_10x_mtx
, except my error is showing Key Error: 1
and it is from 10X data we produced ourselves, do doesn't involve a GEO incompabitibility.
Error below:
>>> juno = sc.read_10x_mtx(path = "../../Data/juno_influenza_pilot/hash_t_cells/umi_count")
Traceback (most recent call last):
File "/home/daniel/miniconda/envs/scvi/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 2131, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 2140, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 1
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/daniel/miniconda/envs/scvi/lib/python3.8/site-packages/scanpy/readwrite.py", line 481, in read_10x_mtx
adata = read(
File "/home/daniel/miniconda/envs/scvi/lib/python3.8/site-packages/scanpy/readwrite.py", line 552, in _read_v3_10x_mtx
var_names = genes[1].values
File "/home/daniel/miniconda/envs/scvi/lib/python3.8/site-packages/pandas/core/frame.py", line 3455, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/daniel/miniconda/envs/scvi/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 1
thanks a lot @hurleyLi, dzip -d the zipped files fix the problem for me.
i would add a pull request but i see there are 44 already to go...
pandas transparently handles gzipping. We could also simply add support for gzipped files:
def maybe_zipped_path(path: Path) -> Path:
zipped_path = Path(f'{path}.gz')
if zipped_path.is_file(): return zipped_path
return path
And then we use it:
https://github.com/scverse/scanpy/blob/a38a22ab85074f17788b8d1effa89c1373e0c978/scanpy/readwrite.py#L488
to
genefile_exists = maybe_zipped_path(path / f'{prefix}genes.tsv').is_file()
and
https://github.com/scverse/scanpy/blob/a38a22ab85074f17788b8d1effa89c1373e0c978/scanpy/readwrite.py#L525
to
genes = pd.read_csv(maybe_zipped_path(path / f'{prefix}genes.tsv'), header=None, sep='\t')
and so on.
I've figured out what was causing my error. The scanpy function to read in features.tsv.gz
expects three columns: ['gene_symbols, 'gene_ids', 'feature_types']
Where 'feature types' is a text string like 'Gene Expression' and usually repeated along the whole length of the file.
The file I was reading in was from HTO data and only had one column:
Hashtag1-GTCAACTCTTTAGCG Hashtag2-TTCCGCCTCTCTTTG Hashtag3-AAGTATCGTTTCGCA unmapped
So if others run into this same error, just add in some extra columns to the features.tsv
file so it doesn't error out when looking for the extra columns. Something like this (different features file):
RP11-34P13.7 RP11-34P13.7 Gene Expression FO538757.3 FO538757.3 Gene Expression FO538757.2 FO538757.2 Gene Expression AP006222.2 AP006222.2 Gene Expression RP4-669L17.10 RP4-669L17.10 Gene Expression
It would also be helpful if scanpy would validate the number of columns at the start. At the moment it looks like it reads in the whole .mtx
file before trying to map the feature names and producing this error, so it takes a while to fail.
Faced exactly the same problem with file "GSE185477_GSM3178784_C41_SC_raw_counts.zip" from GSE185477
The advice from @hurleyLi helped, thanks a lot! But the error is quite confusing. It would be great to read files without extra actions for the user, but is it possible to at least change the error message? E.g.
try:
....
except KeyError:
raise KeyError("Unexpected error, probably due to Cellranger version. Make sure to unarchive gzipped file coming from Cellranger v2")
I've figured out what was causing my error. The scanpy function to read in
features.tsv.gz
expects three columns:['gene_symbols, 'gene_ids', 'feature_types']
Where 'feature types' is a text string like 'Gene Expression' and usually repeated along the whole length of the file. The file I was reading in was from HTO data and only had one column:Hashtag1-GTCAACTCTTTAGCG Hashtag2-TTCCGCCTCTCTTTG Hashtag3-AAGTATCGTTTCGCA unmapped
So if others run into this same error, just add in some extra columns to the
features.tsv
file so it doesn't error out when looking for the extra columns. Something like this (different features file):RP11-34P13.7 RP11-34P13.7 Gene Expression FO538757.3 FO538757.3 Gene Expression FO538757.2 FO538757.2 Gene Expression AP006222.2 AP006222.2 Gene Expression RP4-669L17.10 RP4-669L17.10 Gene Expression
It would also be helpful if scanpy would validate the number of columns at the start. At the moment it looks like it reads in the whole
.mtx
file before trying to map the feature names and producing this error, so it takes a while to fail.
when you do add those columns it makes the number of genes to 0 while reading it!
Hello,
my issue is similar to the ones above, so I hope you can help me with that.
I tried to read in some sample data from a liver cell database (Liver Cell Atlas) with the function sc.read_10x_mtx. When I pass in the data just like that, I get a KeyError: 1. I adjusted the file to have two more columns (with numbers 1:x as gene IDs and a feature types column with "Gene Expression" in all rows) , but then I get the following Error: ValueError: Length of passed value for var_names is 31054, but this AnnData has shape: (389056, 31053). I think this is because the function treats the column names as values, but I don’t know why. When I look at the features file with pandas, it is displayed correctly.
Any suggestions? Thanks a lot in advance!
Not sure whether it is resolved, just put here another solution to read_mtx and add to anndata one by one
import pandas as pd
import scanpy as sc
adata = sc.read_mtx('./matrix.mtx')
adata_bc=pd.read_csv('./barcodes.tsv',header=None)
adata_features=pd.read_csv('./features.tsv',header=None)
adata= adata.T
adata.obs['cell_id']= adata_bc
adata.var['gene_name']= adata_features[0].tolist()
adata.var.index= adata.var['gene_name']
All right, I will try that. Thanks