SnapATAC2
SnapATAC2 copied to clipboard
the description of read_mtx is not clear
Hi Kai,
I found that the description of read_mtx is not clear.
- obs_names: File that stores the observation names.
- obs_name_file would be better, otherwise, it's hard to know it's a file.
- it should be row / column of the read_mtx? From my usage, I think it should be row names. But this is usually not the order of 10x_mtx (see https://scanpy.readthedocs.io/en/stable/generated/scanpy.read_10x_mtx.html). And I think people use read_mtx usually because people wants to read 10x_mtx.
- same for var_names.
Maybe we can provide another function like scanpy to directly load mtx from a directory?
Sincerely, Songpeng
Good idea! Yes, we can add read_10x_mtx
to the package. Do you want to give it a try?
Sure. I see scanpy
is also our dependence. I can basically call scancp's function, which load data into memory, then create SnapATAC2's AnnData to store it in file. What do you think?
scanpy is not a mandatory dependency. It would be better not to rely on scanpy for this.
A function called "read_10x_mtx" was added with this commit 1a7f69269514fd9ba4b31c293ca68d0999642782 for reading mtx files generated by 10x genomics. I'm closing this issue now.
Thanks, Kai! Sorry for the late response, I am now in the job market...
Hi Kai @kaizhang ,
I would suggest we copy the behavior of getting gene symbols by default in scanpy.read_10x_mtx (https://github.com/scverse/scanpy/blob/214e05bdc54df61c520dc563ab39b7780e6d3358/scanpy/readwrite.py#L570):
genes = pd.read_csv(
path / f"{prefix}{'genes' if is_legacy else 'features'}.tsv{suffix}",
header=None,
sep="\t",
)
if var_names == "gene_symbols":
var_names_idx = pd.Index(genes[1].values)
if make_unique:
var_names_idx = anndata.utils.make_index_unique(var_names_idx)
adata.var_names = var_names_idx
adata.var["gene_ids"] = genes[0].values
elif var_names == "gene_ids":
adata.var_names = genes[0].values
adata.var["gene_symbols"] = genes[1].values
else:
raise ValueError("`var_names` needs to be 'gene_symbols' or 'gene_ids'")
I noticed that in our package, we will get ENSMUSG-like index, but most of the time, I think we need gene symbols.
Thanks! Songpeng