BPCells icon indicating copy to clipboard operation
BPCells copied to clipboard

Inconsistent genes scores in scATAC analysis

Open plbngl opened this issue 6 months ago • 7 comments

Hello, thanks a lot for the useful tool! I am analysing a PBMC scATAC dataset with BPCells and Signac and I have calculated gene estimates with their respective functions: gene_score_archr and GeneActivity. While with the signac function i observe expected results (markers clustering in celltypes), with the gene_score_archr I do not. This happens both if I use signac plotting function or bpcells plotting function. Also, when I use the plot embedding function as below, I have a warning message "Cell names are mismatched between source and embedding", which is not the case.

plot_embedding( source = gs[,rownames(umap)], umap, features = c("GAPDH", "GNLY", "CD3E", "LEF1", "FCER1A", "FCGR3A", "LYZ", "CD4","CD8") )

I am struggling to understand where is the problem, genes annotation, order of cells in frag file... but they all seem correct, and I have used the same approach for a different dataset (different computer as well) -- any suggestion? Thank you!

plbngl avatar Jun 09 '25 12:06 plbngl

Hi @plbngl, thanks for the issue report! To clarify -- have you been able to run this approach successfully on another dataset, or does it always have the same problem? The only initial data problem I can think of would be if there's a mismatch between the gene annotation genome version and the fragment file genome version (e.g. hg19 vs hg38).

The code spitting out the warning message seems to be buggy (should be anyNA(mapping), rather than !all(is.na(mapping)), so I should fix that but it might not be the source of your problem.

I'll try reproducing your problem on the PBMC 3k dataset from the tutorial. If I can reproduce the problem it should be pretty fast to track down the issue and push a fix to github if needed. If you happen to be working on a small-ish public dataset and have code to share, that would save me some time trying to reproduce the issue.

bnprks avatar Jun 10 '25 07:06 bnprks

Hi, thank you for getting back to me! Yes I was able in the past to run it correctly on another system and dataset but I cannot share it or reproduce it myself because data was lost recently.

To make sure it was not data-related I have also run your tutorial with the PBMC 3k exactly as in the vignette, and added the following:

gs = gene_score_archr(
  frags,
  genes,
  chrom_sizes,
  gene_name_column = "gene_name")

plot_embedding(
  source = gs,
  umap,
  features = c("MS4A1", "GNLY", "CD3E", 
               "CD14", "FCER1A", "FCGR3A", 
               "LYZ", "CD4","CD8"),
)

which gives this plot:

Image

On the other hand, if I use the same data, but the signac function, as shown in snippets below, I have results more consistent with rna-seq.

ranges = chrom_sizes
ranges$tile_width = 5000
atac_mat = tile_matrix(fragments=frags, ranges=ranges, mode =  "fragments",explicit_tile_names = TRUE)

#### Signac processing
suppressMessages(annotations <- GetGRangesFromEnsDb(ensdb=EnsDb.Hsapiens.v86))
seqlevelsStyle(annotations)  <- 'UCSC'
genome(annotations)          <- 'hg38'
suppressWarnings(chrom_assay <- CreateChromatinAssay(counts=atac_mat, sep=c(':', '-'), 
                                                      genome='hg38', fragments="pbmc_3k_10x.fragments.tsv.gz", 
                                                      min.cells=0, min.features=0, 
                                                      annotation=annotations))


adata <-CreateSeuratObject(chrom_assay, project = 'test', assay = "ATAC",
  min.cells = 0, min.features = 0, names.field = 1, names.delim = "_")
DefaultAssay(adata) <- "ATAC"
gene.activities <- GeneActivity(adata)
adata[['GA_seurat']] <- CreateAssayObject(counts = gene.activities)
DefaultAssay(adata) <- "GA_seurat"
adata <- NormalizeData(adata)
norm_adata = adata[['GA_seurat']]@data
plot_embedding(
  source = norm_adata,
  umap,
  features = c("MS4A1", "GNLY", "CD3E", 
               "CD14", "FCER1A", "FCGR3A", 
               "LYZ", "CD4","CD8"),
)

Image

Let me know if you can reproduce the issue and if there are any updates, Thank you very much, Paola

plbngl avatar Jun 10 '25 12:06 plbngl

Hi @plbngl, thanks for providing the test script to run on the pbmc3k demo dataset! I tried re-running your example but I got different results. How recently have you re-installed BPCells? I don't think we've done any recent fixes that would affect these gene score calculations, but it's always worth a check.

Here's the plot, and I've included the full end-to-end script I used to produce it below. The UMAP layout is flipped but the gene activity distribution looks reasonable. I'm also a bit confused why the (low-high) ranges don't match up between our plots as that part should be deterministic.

I'm curious what would happen if you ran the exact same end-to-end script as I did on your machine.

Image

Click here to see the full script I used
library(BPCells)
suppressPackageStartupMessages({
  library(dplyr)
})

# Substitute your preferred working directory for data_dir
data_dir <- file.path(tempdir(), "pbmc-3k")
dir.create(data_dir, recursive = TRUE, showWarnings = FALSE)
setwd(data_dir)

url_base <- "https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/"
rna_raw_url <- paste0(url_base, "pbmc_granulocyte_sorted_3k_raw_feature_bc_matrix.h5")
atac_raw_url <- paste0(url_base, "pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz")

# Increase download timeout from 60 seconds to 5 minutes
options(timeout=300)

# Only download files if we haven't downloaded already
if (!file.exists("pbmc_3k_10x.h5")) {
  download.file(rna_raw_url, "pbmc_3k_10x.h5", mode="wb")
}
if (!file.exists("pbmc_3k_10x.fragments.tsv.gz")) {
  download.file(atac_raw_url, "pbmc_3k_10x.fragments.tsv.gz", mode="wb")
}


# Check if we already ran import
if (!file.exists("pbmc_3k_rna_raw")) {
  mat_raw <- open_matrix_10x_hdf5("pbmc_3k_10x.h5", feature_type="Gene Expression") %>% 
    write_matrix_dir("pbmc_3k_rna_raw")
} else {
  mat_raw <- open_matrix_dir("pbmc_3k_rna_raw")
}

# Check if we already ran import
if (!file.exists("pbmc_3k_frags")) {
  frags_raw <- open_fragments_10x("pbmc_3k_10x.fragments.tsv.gz") %>%
      write_fragments_dir("pbmc_3k_frags")
} else {
  frags_raw <- open_fragments_dir("pbmc_3k_frags")
}

reads_per_cell <- Matrix::colSums(mat_raw)
plot_read_count_knee(reads_per_cell, cutoff = 1e3)

genes <- read_gencode_transcripts(
  "./references", 
  release="42", 
  transcript_choice="MANE_Select",
  annotation_set = "basic", 
  features="transcript" # Make sure to set this so we don't get exons as well
)

blacklist <- read_encode_blacklist("./references", genome="hg38")

chrom_sizes <- read_ucsc_chrom_sizes("./references", genome="hg38")

atac_qc <- qc_scATAC(frags_raw, genes, blacklist)

plot_tss_scatter(atac_qc, min_frags=1000, min_tss=10)

plot_fragment_length(frags_raw) + plot_tss_profile(frags_raw, genes)

pass_atac <- atac_qc %>%
    dplyr::filter(nFrags > 1000, TSSEnrichment > 10) %>%
    dplyr::pull(cellName)
pass_rna <- colnames(mat_raw)[Matrix::colSums(mat_raw) > 1e3]
keeper_cells <- intersect(pass_atac, pass_rna)

frags <- frags_raw %>% select_cells(keeper_cells)

keeper_genes <- Matrix::rowSums(mat_raw) > 3
mat <- mat_raw[keeper_genes,keeper_cells]


# Normalize by reads-per-cell
mat <- multiply_cols(mat, 1/Matrix::colSums(mat))

# Log normalization
mat <- log1p(mat * 10000) # Log normalization
stats <- matrix_stats(mat, row_stats="variance")

# To keep the example small, we'll do a very naive variable gene selection
variable_genes <- order(stats$row_stats["variance",], decreasing=TRUE) %>% 
  head(1000) %>% 
  sort()

mat_norm <- mat[variable_genes,]

mat_norm <- mat_norm %>% write_matrix_memory(compress=FALSE)
gene_means <- stats$row_stats["mean",variable_genes]
gene_vars <- stats$row_stats["variance", variable_genes]
mat_norm <- (mat_norm - gene_means) / gene_vars
svd <- BPCells::svds(mat_norm, k=50)
# Alternate option: irlba::irlba(mat_norm, nv=50)
pca <- multiply_cols(svd$v, svd$d)

set.seed(12341512)
umap <- uwot::umap(pca)


gs = gene_score_archr(
  frags,
  genes,
  chrom_sizes,
  gene_name_column = "gene_name")

plot_embedding(
  source = gs,
  umap,
  features = c("MS4A1", "GNLY", "CD3E", 
               "CD14", "FCER1A", "FCGR3A", 
               "LYZ", "CD4","CD8"),
)

bnprks avatar Jun 11 '25 21:06 bnprks

Thanks for your feedback! I tried running your code using 2 different conda environement and in one (original I used) I have the wrong result and in the other (a new one I made, different R) worked as expected and as you also see. In the old setting I also have this warning while running the gene score function, which I don't have in the new version:

Warning message: “Tiles given out of order. Reordering with order_ranges()”

Here is the sessionInfo output for the two installations, if you see any apparent reason for this discrepancy. Problematic one:

R version 4.4.3 (2025-02-28)
Platform: x86_64-conda-linux-gnu
Running under: Red Hat Enterprise Linux 8.8 (Ootpa)

Matrix products: default
BLAS/LAPACK: /home/paola.benaglio/conda_envs/renv_snatac/lib/libopenblasp-r0.3.29.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Rome
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4   BPCells_0.3.0

loaded via a namespace (and not attached):
 [1] generics_0.1.4          stringi_1.8.7           lattice_0.22-7         
 [4] hms_1.1.3               digest_0.6.37           magrittr_2.0.3         
 [7] evaluate_1.0.3          grid_4.4.3              RColorBrewer_1.1-3     
[10] pbdZMQ_0.3-14           fastmap_1.2.0           jsonlite_2.0.0         
[13] Matrix_1.7-3            GenomeInfoDb_1.42.3     httr_1.4.7             
[16] UCSC.utils_1.2.0        scales_1.4.0            cli_3.6.5              
[19] rlang_1.1.6             crayon_1.5.3            XVector_0.46.0         
[22] uwot_0.2.3              bit64_4.6.0-1           base64enc_0.1-3        
[25] withr_3.0.2             repr_1.1.7              FNN_1.1.4.1            
[28] parallel_4.4.3          tools_4.4.3             tzdb_0.5.0             
[31] uuid_1.2-1              ggplot2_3.5.2           GenomeInfoDbData_1.2.13
[34] BiocGenerics_0.52.0     IRdisplay_1.1           vctrs_0.6.5            
[37] R6_2.6.1                matrixStats_1.5.0       stats4_4.4.3           
[40] lifecycle_1.0.4         zlibbioc_1.52.0         stringr_1.5.1          
[43] bit_4.6.0               S4Vectors_0.44.0        IRanges_2.40.1         
[46] vroom_1.6.5             irlba_2.3.5.1           pkgconfig_2.0.3        
[49] hexbin_1.28.5           pillar_1.10.2           gtable_0.3.6           
[52] glue_1.8.0              Rcpp_1.0.14             tibble_3.3.0           
[55] GenomicRanges_1.58.0    tidyselect_1.2.1        MatrixGenerics_1.18.1  
[58] IRkernel_1.3.2          dichromat_2.0-0.1       farver_2.1.2           
[61] patchwork_1.3.0         htmltools_0.5.8.1       labeling_0.4.3         
[64] readr_2.1.5             compiler_4.4.3         

Good one:

R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.8 (Ootpa)

Matrix products: default
BLAS/LAPACK: /home/paola.benaglio/conda_envs/renv_snatac_az/lib/libopenblasp-r0.3.29.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Rome
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4   BPCells_0.3.0

loaded via a namespace (and not attached):
 [1] generics_0.1.4          bitops_1.0-9            stringi_1.8.7          
 [4] lattice_0.22-7          hms_1.1.3               digest_0.6.37          
 [7] magrittr_2.0.3          evaluate_1.0.3          grid_4.3.3             
[10] RColorBrewer_1.1-3      pbdZMQ_0.3-14           fastmap_1.2.0          
[13] jsonlite_2.0.0          Matrix_1.6-5            GenomeInfoDb_1.38.1    
[16] scales_1.4.0            cli_3.6.5               rlang_1.1.6            
[19] crayon_1.5.3            XVector_0.42.0          uwot_0.2.3             
[22] bit64_4.6.0-1           base64enc_0.1-3         withr_3.0.2            
[25] repr_1.1.7              FNN_1.1.4.1             parallel_4.3.3         
[28] tools_4.3.3             tzdb_0.5.0              uuid_1.2-1             
[31] ggplot2_3.5.2           GenomeInfoDbData_1.2.11 BiocGenerics_0.48.1    
[34] IRdisplay_1.1           vctrs_0.6.5             R6_2.6.1               
[37] matrixStats_1.5.0       stats4_4.3.3            lifecycle_1.0.4        
[40] zlibbioc_1.48.0         stringr_1.5.1           bit_4.6.0              
[43] S4Vectors_0.40.2        IRanges_2.36.0          vroom_1.6.5            
[46] irlba_2.3.5.1           pkgconfig_2.0.3         hexbin_1.28.5          
[49] pillar_1.10.2           gtable_0.3.6            glue_1.8.0             
[52] Rcpp_1.0.14             tibble_3.3.0            GenomicRanges_1.54.1   
[55] tidyselect_1.2.1        MatrixGenerics_1.14.0   IRkernel_1.3.2         
[58] dichromat_2.0-0.1       farver_2.1.2            patchwork_1.3.0        
[61] htmltools_0.5.8.1       labeling_0.4.3          readr_2.1.5            
[64] compiler_4.3.3          RCurl_1.98-1.16        

Thanks again!

plbngl avatar Jun 12 '25 10:06 plbngl

Okay, that's helpful to know. My first instinct is to look closer at how BPCells was installed. Did you install BPCells from the r-bpcells bioconda package or using remotes::install_github("bnprks/BPCells/r") from within an R session? I normally recommend installing from github to get the latest fixes + improvements, but unfortunately the version number shown in sessionInfo() is the same until the next time we mark a version bump.

My hope is that this is a bug which perhaps existed in the 0.3.0 release but has been fixed since (e.g. maybe the bad env used an r-bpcells conda install, whereas the good one installed straight from github). It's also possible that there is some bad interaction with another library, but I think that is somewhat unlikely. I'm running on even newer library versions than your non-working env, so it doesn't really seem like a dependency update broke things.

In any case, let me know how your BPCells versions were installed and I'm curious if just running remotes::install_github("bnprks/BPCells/r") in the non-working env would fix things.

My sessionInfo()
> sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Arch Linux

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas.so.0.3;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] dplyr_1.1.4   BPCells_0.3.0

loaded via a namespace (and not attached):
 [1] bit_4.6.0               Matrix_1.7-3            gtable_0.3.6
 [4] jsonlite_2.0.0          crayon_1.5.3            compiler_4.5.0
 [7] tidyselect_1.2.1        Rcpp_1.0.14             FNN_1.1.4.1
[10] stringr_1.5.1           GenomicRanges_1.60.0    parallel_4.5.0
[13] IRanges_2.42.0          scales_1.4.0            uwot_0.2.3
[16] lattice_0.22-6          hexbin_1.28.5           ggplot2_3.5.2
[19] readr_2.1.5             R6_2.6.1                XVector_0.48.0
[22] patchwork_1.3.0         labeling_0.4.3          generics_0.1.4
[25] GenomeInfoDb_1.44.0     BiocGenerics_0.54.0     tibble_3.3.0
[28] GenomeInfoDbData_1.2.14 tzdb_0.5.0              pillar_1.10.2
[31] RColorBrewer_1.1-3      rlang_1.1.6             stringi_1.8.7
[34] bit64_4.6.0-1           cli_3.6.5               withr_3.0.2
[37] magrittr_2.0.3          grid_4.5.0              irlba_2.3.5.1
[40] vroom_1.6.5             hms_1.1.3               lifecycle_1.0.4
[43] S4Vectors_0.46.0        vctrs_0.6.5             glue_1.8.0
[46] farver_2.1.2            stats4_4.5.0            httr_1.4.7
[49] tools_4.5.0             pkgconfig_2.0.3         UCSC.utils_1.4.0

bnprks avatar Jun 12 '25 20:06 bnprks

Yes that was the case: he bad env used an r-bpcells conda install, whereas the good one installed by remotes::install_github("bnprks/BPCells/r") !! I am glad this makes sense to you -

plbngl avatar Jun 13 '25 12:06 plbngl

Perfect, glad that fixed your problem! It sounds like we should look into why the conda version has this bug and maybe tag a new release so that the conda package will get updated with the latest fixes from github. Thanks for bringing this to our attention

bnprks avatar Jun 13 '25 18:06 bnprks