ArchR icon indicating copy to clipboard operation
ArchR copied to clipboard

Mismatch between gene expression and features

Open TomVuod opened this issue 9 months ago • 2 comments

There is a bug in ArrowRead.R file which produces a mismatch between rows in the gene expression matrix and features in rowData.

matchI <- match(i, idxRows, nomatch = 0)
idxI <- which(matchI > 0)
i <- i[idxI]
j <- j[idxI]
i <- matchI[idxI]

The last line replaces the row indices with the corresponding indices of the sorted genes. This behavior has no effect as long as the genes are already sorted, meaning that the idx vector is identical to seq_along(idx). However, if the genes are unsorted, this operation alters the row positions of at least some elements in the assembled matrix, leading to inconsistencies. As a result, the new row indices no longer align with the rows in rowData, which is retrieved independently. To reproduce this bug it's sufficient to pass to addGeneExpressionMatrix a SummarizedExperiment with genes unsorted relative to their position in the genome. After saving the ArchR project and loading it from the Arrow files the gene expression doesn't match the gene names as rows where rowData(GeneExpressionMatrix)$idx != seq_along(rowData(GeneExpressionMatrix)$idx).

TomVuod avatar Mar 03 '25 20:03 TomVuod

Hi @TomVuod! Thanks for using ArchR! Please make sure that your post belongs in the Issues section. Only bugs and error reports belong in the Issues section. Usage questions and feature requests should be posted in the Discussions section, not in Issues.
If you are getting an error, it is likely due to something specific to your dataset, usage, or computational environment, all of which are extremely challenging to troubleshoot. As such, we require reproducible examples (preferably using the tutorial dataset) from users who want assistance. If you cannot reproduce your error, we will not be able to help. Before going through the work of making a reproducible example, search the previous Issues, Discussions, function definitions, or the ArchR manual and you will likely find the answers you are looking for. If your post does not contain a reproducible example, it is unlikely to receive a response.
In addition to a reproducible example, you must do the following things before we help you, unless your original post already contained this information: 1. If you've encountered an error, have you already searched previous Issues to make sure that this hasn't already been solved? 2. Did you post your log file? If not, add it now. 3. Remove any screenshots that contain text and instead copy and paste the text using markdown's codeblock syntax (three consecutive backticks). You can do this by editing your original post.

immanuelazn avatar Mar 03 '25 20:03 immanuelazn

@immanuelazn Hi Immanuel, thanks for maintaining the ArchR package! This bug reported by @TomVuod can be easily reproduced with the tutorial dataset. It affects the function getMatrixFromArrow and plotEmbedding, resulting in row-mismatched SummarizedExperiment object and wrong gene expression plot, respectively.

To reproduce, we just need to shuffle the rows of seRNA in the multiome tutorial, i.e., od <- sample(nrow(seRNA), replace = FALSE) seRNA <- seRNA[od,]

library(ggplot2)
library(patchwork)
library(ArchR)
addArchRGenome("hg38")
addArchRThreads(12)
set.seed(1)

setwd("/home/zeemeeuw/data/Multiome/")

atacFiles <- c("pbmc_sorted_3k" = "pbmc_sorted_3k.fragments.tsv.gz", 
               "pbmc_unsorted_3k" = "pbmc_unsorted_3k.fragments.tsv.gz")
rnaFiles <- c("pbmc_sorted_3k" = "pbmc_sorted_3k.filtered_feature_bc_matrix.h5", 
              "pbmc_unsorted_3k" = "pbmc_unsorted_3k.filtered_feature_bc_matrix.h5")

ArrowFiles <- createArrowFiles(
  inputFiles = atacFiles,
  sampleNames = names(atacFiles),
  minTSS = 4,
  minFrags = 1000,
  addTileMat = TRUE,
  addGeneScoreMat = TRUE
)

projMulti1 <- ArchRProject(ArrowFiles = ArrowFiles)

# Shuffle, wrong gex
seRNA <- import10xFeatureMatrix(
  input = rnaFiles,
  names = names(rnaFiles),
  strictMatch = TRUE
)
od <- sample(nrow(seRNA), replace = FALSE)
seRNA <- seRNA[od,]

cellsToKeep <- which(getCellNames(projMulti1) %in% colnames(seRNA))

projMulti2 <- subsetArchRProject(ArchRProj = projMulti1, 
                                 cells = getCellNames(projMulti1)[cellsToKeep], 
                                 outputDirectory = "Save-ProjMulti2", force = TRUE)

projMulti2 <- addGeneExpressionMatrix(input = projMulti2, seRNA = seRNA, strictMatch = TRUE, force = TRUE)


projMulti2 <- addIterativeLSI(
  ArchRProj = projMulti2,
  useMatrix = "GeneExpressionMatrix", 
  name = "IterativeLSI", 
  iterations = 2, 
  clusterParams = list( #See Seurat::FindClusters
    resolution = c(0.2), 
    sampleCells = 10000, 
    n.start = 10
  ), 
  varFeatures = 25000, 
  dimsToUse = 1:30,
  force = TRUE
)

projMulti2 <- addUMAP(
  ArchRProj = projMulti2, 
  reducedDims = "IterativeLSI", 
  name = "UMAP", 
  nNeighbors = 30, 
  minDist = 0.5, 
  metric = "cosine",
  force = TRUE
)

mat_wrong <- getMatrixFromProject(projMulti2, useMatrix = "GeneExpressionMatrix")


# Do not shuffle, correct gex
seRNA <- import10xFeatureMatrix(
  input = rnaFiles,
  names = names(rnaFiles),
  strictMatch = TRUE
)

cellsToKeep <- which(getCellNames(projMulti1) %in% colnames(seRNA))

projMulti3 <- subsetArchRProject(ArchRProj = projMulti1, 
                                 cells = getCellNames(projMulti1)[cellsToKeep], 
                                 outputDirectory = "Save-ProjMulti3", force = TRUE)

projMulti3 <- addGeneExpressionMatrix(input = projMulti3, seRNA = seRNA, strictMatch = TRUE, force = TRUE)


projMulti3 <- addIterativeLSI(
  ArchRProj = projMulti3,
  useMatrix = "GeneExpressionMatrix", 
  name = "IterativeLSI", 
  iterations = 2, 
  clusterParams = list( #See Seurat::FindClusters
    resolution = c(0.2), 
    sampleCells = 10000, 
    n.start = 10
  ), 
  varFeatures = 25000, 
  dimsToUse = 1:30,
  force = TRUE
)

projMulti3 <- addUMAP(
  ArchRProj = projMulti3, 
  reducedDims = "IterativeLSI", 
  name = "UMAP", 
  nNeighbors = 30, 
  minDist = 0.5, 
  metric = "cosine",
  force = TRUE
)

mat_correct <- getMatrixFromProject(projMulti3, useMatrix = "GeneExpressionMatrix")


data <- data.frame(x = projMulti2@embeddings$UMAP$df[,1],
                   y = projMulti2@embeddings$UMAP$df[,2],
                   exp_wrong = assay(mat_wrong)[which(rowData(mat_wrong)$name == "TCF7"), rownames(projMulti2@cellColData)],
                   exp_correct = assay(mat_correct)[which(rowData(mat_correct)$name == "TCF7"), rownames(projMulti2@cellColData)])

p1 <- ggplot(data, aes(x = x, y = y)) +
  theme_classic() +
  geom_point(aes(color = exp_wrong), size = 0.5) +
  scale_color_gradientn(colors = colorRampPalette(c("#f0f0f0", "#F9CC29", "#EE8300", "#EE8300", "#EE3B00", "#C80038"))(256)) +
  ggtitle("Wrong TCF7 GEX")
p2 <- ggplot(data, aes(x = x, y = y)) +
  theme_classic() +
  geom_point(aes(color = exp_correct), size = 0.5) +
  scale_color_gradientn(colors = colorRampPalette(c("#f0f0f0", "#F9CC29", "#EE8300", "#EE8300", "#EE3B00", "#C80038"))(256)) +
  ggtitle("Correct TCF7 GEX")
p1 | p2

p3 <- plotEmbedding(projMulti2, embedding = "UMAP", colorBy = "GeneExpressionMatrix", name = "TCF7", plotAs = "points", pal = colorRampPalette(c("#f0f0f0", "#F9CC29", "#EE8300", "#EE8300", "#EE3B00", "#C80038"))(256))
p4 <- plotEmbedding(projMulti3, embedding = "UMAP", colorBy = "GeneExpressionMatrix", name = "TCF7", plotAs = "points", pal = colorRampPalette(c("#f0f0f0", "#F9CC29", "#EE8300", "#EE8300", "#EE3B00", "#C80038"))(256))
p3 | p4

Image Image

Also, the sparse gene expression mentioned in #1617 and #1452 was probably related to this bug.

ArchR-plotEmbedding-33311a1c4f26-Date-2025-03-11_Time-14-55-39.log

Lantianqianmu avatar Mar 11 '25 07:03 Lantianqianmu