anndata scanpy/anndata inconsistent loading of anndata gene IDs, specific 0.8.0

I see inconsistent behavior between anndata 0.8.0 and 0.7.8.

I created an AnnData file from a Seurat object. This is serialized to disk, and read in python using 'sc.read(annDataFile)'. When using anndata 0.7.8, this works fine. The problem exists for 0.8.0. The issue is that instead of having the string gene names as row names, it reports numeric indexes. For example:

# This returns "['0' '1' '10' ... '9997' '9998' '9999']", instead of strings
adFeats = sc.read(anndataFile).var_names.values;

When I force anndata 0.7.8, this returns strings, as it should. Has anything deliberately changed? Is there a different expectation on how gene names are serialized in the file?

Mar 28 '22 17:03 bbimber

Thanks for the report! Could you provide an example file or the commands you used to create this?

Mar 29 '22 12:03 ivirshup

@ivirshup I'm happy to try to share just the h5ad file (which isnt huge); however, here is some R code using public data that should do it. Again, at least in our hands (and github actions) anndata 0.7.8 works, but 0.8.0 loads an object with numeric gene names, rather than the string gene names.

library(SeuratData)
library(SeuratDisk)
library(anndata)

# To install packages:
#remotes::install_github("mojaveazure/seurat-disk")
#devtools::install_github('satijalab/seurat-data')

seuratToAnnData <- function (seuratObj, outFileBaseName, assayName = NULL) {
  tmpFile <- outFileBaseName
  SeuratDisk::SaveH5Seurat(seuratObj, filename = tmpFile)
  h5seurat <- paste0(tmpFile, ".h5seurat")
  SeuratDisk::Convert(source = h5seurat, dest = "h5ad", overwrite = T)
  unlink(h5seurat)
  return(paste0(outFileBaseName, ".h5ad"))
}

# Load public example data:
suppressWarnings(SeuratData::InstallData("pbmc3k", force.reinstall = F))
suppressWarnings(data("pbmc3k"))
seuratObj <- suppressWarnings(pbmc3k)

seuratObj <- Seurat::NormalizeData(seuratObj, verbose = FALSE)
annFile <- seuratToAnnData(seuratObj, outFileBaseName = 'outFile')

# Now inspect the gene/feature names:
print('Reading with python/anndata:')
ad <- anndata::read_h5ad(annFile)
print(head(ad$var_names))

print('Converting to h5seurat and reading using seurat-disk')
x <- SeuratDisk::Convert(annFile, dest = "h5seurat", overwrite = T)
x <- SeuratDisk::LoadH5Seurat(x)
print(head(sort(rownames(x$RNA))))

and then associated python code might be something like:

import scanpy as sc
import sys

anndata = sys.argv[1]
print(anndata)

adFeats = sc.read(anndata).var_names.values;
adFeats.sort();
print('feats in aData:')
print(adFeats)

Mar 29 '22 15:03 bbimber

I think I see what's happening here. Using your example:

import anndata as ad

a = ad.read_h5ad("./outFile.h5ad")
a

AnnData object with n_obs × n_vars = 2700 × 13714
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'seurat_annotations'
    var: '_index', 'features'

The obs_names are read correctly, while the var_names are not.

Checking how the dataframes have been encoded:

import h5py

f = h5py.File("outFile.h5ad")

dict(f["obs"].attrs)

{'_index': '_index',
 'column-order': array(['orig.ident', 'nCount_RNA', 'nFeature_RNA', 'seurat_annotations'],
       dtype=object),
 'encoding-type': 'dataframe',
 'encoding-version': '0.1.0'}

dict(f["var"].attrs)

{'_index': '_index', 'column-order': array(['features'], dtype=object)}

So, this is essentially the same issue as reported in #731. Basically, since dataframes were stored in a columnar format, anndata always annotated the hdf5 groups with an encoding type. However, we never checked for it, so it didn't matter for reading.

Now that we are relying on that encoding info, and never wrote out a dataframe without it, we didn't account for "partially correct" written files in this release. Seurat disk should write this metadata, but I'll make a bug fix which works while complaining loudly about this. See the conversation on the other issue for more.

Mar 29 '22 16:03 ivirshup

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

Jun 18 '23 02:06 github-actions[bot]

Since this is basically a duplicate of #731, let’s track if this is still relevant there

Jun 19 '23 08:06 flying-sheep

anndata anndata copied to clipboard

scanpy/anndata inconsistent loading of anndata gene IDs, specific 0.8.0

anndata
anndata copied to clipboard