SnapATAC2 icon indicating copy to clipboard operation
SnapATAC2 copied to clipboard

the description of read_mtx is not clear

Open beyondpie opened this issue 1 year ago • 6 comments

Hi Kai,

I found that the description of read_mtx is not clear.

  1. obs_names: File that stores the observation names.
    • obs_name_file would be better, otherwise, it's hard to know it's a file.
    • it should be row / column of the read_mtx? From my usage, I think it should be row names. But this is usually not the order of 10x_mtx (see https://scanpy.readthedocs.io/en/stable/generated/scanpy.read_10x_mtx.html). And I think people use read_mtx usually because people wants to read 10x_mtx.
  2. same for var_names.

Maybe we can provide another function like scanpy to directly load mtx from a directory?

Sincerely, Songpeng

beyondpie avatar Nov 25 '23 07:11 beyondpie

Good idea! Yes, we can add read_10x_mtx to the package. Do you want to give it a try?

kaizhang avatar Nov 25 '23 09:11 kaizhang

Sure. I see scanpy is also our dependence. I can basically call scancp's function, which load data into memory, then create SnapATAC2's AnnData to store it in file. What do you think?

beyondpie avatar Nov 25 '23 16:11 beyondpie

scanpy is not a mandatory dependency. It would be better not to rely on scanpy for this.

kaizhang avatar Nov 26 '23 10:11 kaizhang

A function called "read_10x_mtx" was added with this commit 1a7f69269514fd9ba4b31c293ca68d0999642782 for reading mtx files generated by 10x genomics. I'm closing this issue now.

kaizhang avatar Jan 04 '24 07:01 kaizhang

Thanks, Kai! Sorry for the late response, I am now in the job market...

beyondpie avatar Jan 04 '24 07:01 beyondpie

Hi Kai @kaizhang ,

I would suggest we copy the behavior of getting gene symbols by default in scanpy.read_10x_mtx (https://github.com/scverse/scanpy/blob/214e05bdc54df61c520dc563ab39b7780e6d3358/scanpy/readwrite.py#L570):

    genes = pd.read_csv(
        path / f"{prefix}{'genes' if is_legacy else 'features'}.tsv{suffix}",
        header=None,
        sep="\t",
    )
    if var_names == "gene_symbols":
        var_names_idx = pd.Index(genes[1].values)
        if make_unique:
            var_names_idx = anndata.utils.make_index_unique(var_names_idx)
        adata.var_names = var_names_idx
        adata.var["gene_ids"] = genes[0].values
    elif var_names == "gene_ids":
        adata.var_names = genes[0].values
        adata.var["gene_symbols"] = genes[1].values
    else:
        raise ValueError("`var_names` needs to be 'gene_symbols' or 'gene_ids'")

I noticed that in our package, we will get ENSMUSG-like index, but most of the time, I think we need gene symbols.

Thanks! Songpeng

beyondpie avatar Mar 28 '24 19:03 beyondpie