ArchR icon indicating copy to clipboard operation
ArchR copied to clipboard

Different group labels for the same predicted cell in integrated scRNA

Open Zepeng-Mu opened this issue 2 years ago • 4 comments

Hi, I have noticed strange behavior when adding GeneIntegrationMatrix. I have posted a Discussion several months ago but I think it might be useful to open an Issue as well.

Basically, when integrating scRNA into scATAC, sometime scRNA barcode is assigned to more than one scATAC cell, in this case I have noticed that the nameGroup from integration for the same cell can have DIFFERENT labels.

The Discussion post I created before that gave a detailed example is here: https://github.com/GreenleafLab/ArchR/discussions/1234 I'm using release_1.0.2.

Thanks!!

Zepeng-Mu avatar Mar 22 '22 14:03 Zepeng-Mu

Hi @Zepeng-Mu! Thanks for using ArchR! Please make sure that your post belongs in the Issues section. Only bugs and error reports belong in the Issues section. Usage questions and feature requests should be posted in the Discussions section, not in Issues.
Before we help you, you must respond to the following questions unless your original post already contained this information: 1. If you've encountered an error, have you already searched previous Issues to make sure that this hasn't already been solved? 2. Can you recapitulate your error using the tutorial code and dataset? If so, provide a reproducible example. 3. Did you post your log file? If not, add it now.

rcorces avatar Mar 22 '22 14:03 rcorces

Hi @rcorces, I can confirm this behavior is reproducible using tutorial data.

library(ArchR)

inputFiles <- getTutorialData("Hematopoiesis")
inputFiles

addArchRGenome("hg19")
addArchRThreads(threads = 14) 

ArrowFiles <- createArrowFiles(
  inputFiles = inputFiles,
  sampleNames = names(inputFiles),
  minTSS = 4, #Dont set this too high because you can always increase later
  minFrags = 1000, 
  addTileMat = TRUE,
  addGeneScoreMat = TRUE
)

doubScores <- addDoubletScores(
  input = ArrowFiles,
  k = 10, #Refers to how many cells near a "pseudo-doublet" to count.
  knnMethod = "UMAP", #Refers to the embedding to use for nearest neighbor search with doublet projection.
  LSIMethod = 1
)

projHeme1 <- ArchRProject(
  ArrowFiles = ArrowFiles, 
  outputDirectory = "HemeTutorial",
  copyArrows = TRUE #This is recommened so that if you modify the Arrow files you have an original copy for later usage.
)

projHeme2 <- filterDoublets(projHeme1)

projHeme2 <- addIterativeLSI(
  ArchRProj = projHeme2,
  useMatrix = "TileMatrix", 
  name = "IterativeLSI", 
  iterations = 2, 
  clusterParams = list( #See Seurat::FindClusters
    resolution = c(0.2), 
    sampleCells = 10000, 
    n.start = 10
  ), 
  varFeatures = 25000, 
  dimsToUse = 1:30
)

projHeme2 <- addHarmony(
  ArchRProj = projHeme2,
  reducedDims = "IterativeLSI",
  name = "Harmony",
  groupBy = "Sample"
)

seRNA <- readRDS("scRNA-Hematopoiesis-Granja-2019.rds")

projHeme2 <- addGeneIntegrationMatrix(
  ArchRProj = projHeme2, 
  useMatrix = "GeneScoreMatrix",
  matrixName = "GeneIntegrationMatrix",
  reducedDims = "IterativeLSI",
  seRNA = seRNA,
  addToArrow = FALSE,
  groupRNA = "BioClassification",
  nameCell = "predictedCell_Un",
  nameGroup = "predictedGroup_Un",
  nameScore = "predictedScore_Un"
)

and then we can check:

sort(table(projHeme2$predictedCell_Un), decreasing=T)[1]

gives

CD34_32_R5:CGTAGCGAGTTCGCGC-1 
                          615

So this cell is mapped to 615 cells in scATAC dataset.

seRNA$BioClassification[colnames(seRNA) == "CD34_32_R5:CGTAGCGAGTTCGCGC-1"]
[1] "02_Early.Eryth"

and in seRNA this is annotated as "02_Early.Eryth". However,

> table(projHeme2$predictedGroup_Un[projHeme2$predictedCell_Un == "CD34_32_R5:CGTAGCGAGTTCGCGC-1"])

        01_HSC 02_Early.Eryth  03_Late.Eryth    08_GMP.Neut 
           354                       255                       1                          5

Basically for scATAC cells with predictedCell_Un "CD34_32_R5:CGTAGCGAGTTCGCGC-1", the corresponding predictedGroup_Un has four different annotated groups.

Here is the log:

          /   \     |   _  \      /      ||  |  |  | |   _  \
         /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |
        /  /_\  \   |      /     |  |     |   __   | |      /
       /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
      /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|

Logging With ArchR!

Start Time : 2022-03-24 10:11:07

------- ArchR Info

ArchRThreads = 14
ArchRGenome = Hg19

------- System Info

Computer OS = unix
Total Cores = 28

------- Session Info

R version 4.0.5 (2021-03-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS/LAPACK: /scratch/midway2/zepengmu/conda_envs/newbase/lib/libopenblasp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] harmony_0.1.0                     Rcpp_1.0.8.3                      uwot_0.1.11                       gridExtra_2.3                     nabor_0.5.1
 [6] SeuratObject_4.0.4                Seurat_4.1.0                      BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.58.0                   rtracklayer_1.50.0
[11] Biostrings_2.58.0                 XVector_0.30.0                    ArchR_1.0.2                       magrittr_2.0.2                    rhdf5_2.34.0
[16] Matrix_1.4-0                      data.table_1.14.2                 SummarizedExperiment_1.20.0       Biobase_2.50.0                    GenomicRanges_1.42.0
[21] GenomeInfoDb_1.26.7               IRanges_2.24.1                    S4Vectors_0.28.1                  BiocGenerics_0.36.1               MatrixGenerics_1.2.1
[26] matrixStats_0.61.0                ggplot2_3.3.5

loaded via a namespace (and not attached):
  [1] plyr_1.8.6               igraph_1.2.11            lazyeval_0.2.2           splines_4.0.5            BiocParallel_1.24.1      listenv_0.8.0            scattermore_0.7
  [8] digest_0.6.29            htmltools_0.5.2          fansi_1.0.2              tensor_1.5               cluster_2.1.2            ROCR_1.0-11              globals_0.14.0
 [15] spatstat.sparse_2.0-0    colorspace_2.0-3         ggrepel_0.9.1            xfun_0.30                dplyr_1.0.8              crayon_1.5.0             RCurl_1.98-1.6
 [22] jsonlite_1.8.0           spatstat.data_2.1-0      survival_3.2-13          zoo_1.8-9                glue_1.6.2               polyclip_1.10-0          gtable_0.3.0
 [29] zlibbioc_1.36.0          leiden_0.3.9             DelayedArray_0.16.3      Rhdf5lib_1.12.1          future.apply_1.8.1       abind_1.4-5              scales_1.1.1
 [36] DBI_1.1.2                miniUI_0.1.1.1           viridisLite_0.4.0        xtable_1.8-4             reticulate_1.24          spatstat.core_2.3-1      htmlwidgets_1.5.4
 [43] httr_1.4.2               RColorBrewer_1.1-2       ellipsis_0.3.2           ica_1.0-2                pkgconfig_2.0.3          XML_3.99-0.9             farver_2.1.0
 [50] deldir_1.0-6             utf8_1.2.2               tidyselect_1.1.2         labeling_0.4.2           rlang_1.0.2              reshape2_1.4.4           later_1.3.0
 [57] munsell_0.5.0            tools_4.0.5              cli_3.2.0                generics_0.1.2           ggridges_0.5.3           stringr_1.4.0            fastmap_1.1.0
 [64] goftest_1.2-3            fitdistrplus_1.1-6       purrr_0.3.4              RANN_2.6.1               pbapply_1.5-0            future_1.24.0            nlme_3.1-155
 [71] mime_0.12                compiler_4.0.5           rstudioapi_0.13          plotly_4.10.0            png_0.1-7                spatstat.utils_2.2-0     tibble_3.1.6
 [78] stringi_1.7.6            RSpectra_0.16-0          lattice_0.20-45          vctrs_0.3.8              pillar_1.7.0             lifecycle_1.0.1          rhdf5filters_1.2.1
 [85] spatstat.geom_2.3-0      lmtest_0.9-39            RcppAnnoy_0.0.19         cowplot_1.1.1            bitops_1.0-7             irlba_2.3.5              httpuv_1.6.5
 [92] patchwork_1.1.1          R6_2.5.1                 promises_1.2.0.1         KernSmooth_2.23-20       parallelly_1.30.0        codetools_0.2-18         MASS_7.3-55
 [99] gtools_3.9.2             assertthat_0.2.1         withr_2.5.0              GenomicAlignments_1.26.0 sctransform_0.3.3        Rsamtools_2.6.0          GenomeInfoDbData_1.2.4
[106] mgcv_1.8-39              rpart_4.1-15             tidyr_1.2.0              Cairo_1.5-15             Rtsne_0.15               shiny_1.7.1              tinytex_0.37


------- Log Info

2022-03-24 10:11:07 : Running Seurat's Integration Stuart* et al 2019, 0.001 mins elapsed.

2022-03-24 10:11:07 : Input-Parameters, Class = list

Input-Parameters$: length = 1

1 function (name)
2 .Internal(args(name))


Input-Parameters$ArchRProj: length = 1

Input-Parameters$useMatrix: length = 1
[1] "GeneScoreMatrix"


Input-Parameters$matrixName: length = 1
[1] "GeneIntegrationMatrix"


Input-Parameters$reducedDims: length = 1
[1] "IterativeLSI"


Input-Parameters$seRNA: length = 20287
class: RangedSummarizedExperiment
dim: 6 35582
metadata(0):
assays(1): counts
rownames(6): FAM138A OR4F5 ... OR4F16 FAM87B
rowData names(3): gene_name gene_id exonLength
colnames(35582): CD34_32_R5:AAACCTGAGTATCGAA-1 CD34_32_R5:AAACCTGAGTCGTTTG-1 ... BMMC_10x_GREENLEAF_REP2:TTTGTTGCATGTGTCA-1 BMMC_10x_GREENLEAF_REP2:TTTGTTGCATTGAAAG-1
colData names(10): Group nUMI_pre ... BioClassification Barcode


Input-Parameters$groupATAC: length = 0
NULL


Input-Parameters$groupRNA: length = 1
[1] "BioClassification"


Input-Parameters$groupList: length = 0
NULL


Input-Parameters$sampleCellsATAC: length = 1
[1] 10000


Input-Parameters$sampleCellsRNA: length = 1
[1] 10000


Input-Parameters$embeddingATAC: length = 0
NULL


Input-Parameters$embeddingRNA: length = 0
NULL


Input-Parameters$dimsToUse: length = 30
[1] 1 2 3 4 5 6


Input-Parameters$scaleDims: length = 0
NULL


Input-Parameters$corCutOff: length = 1
[1] 0.75


Input-Parameters$plotUMAP: length = 1
[1] TRUE


Input-Parameters$nGenes: length = 1
[1] 2000


Input-Parameters$useImputation: length = 1
[1] TRUE


Input-Parameters$reduction: length = 1
[1] "cca"


Input-Parameters$addToArrow: length = 1
[1] FALSE


Input-Parameters$scaleTo: length = 1
[1] 10000


Input-Parameters$genesUse: length = 0
NULL


Input-Parameters$nameCell: length = 1
[1] "predictedCell_Un"


Input-Parameters$nameGroup: length = 1
[1] "predictedGroup_Un"


Input-Parameters$nameScore: length = 1
[1] "predictedScore_Un"


Input-Parameters$threads: length = 1
[1] 14


Input-Parameters$verbose: length = 1
[1] TRUE


Input-Parameters$force: length = 1
[1] FALSE


Input-Parameters$logFile: length = 1
[1] "ArchRLogs/ArchR-addGeneIntegrationMatrix-ac5dcbaebc-Date-2022-03-24_Time-10-11-07.log"


2022-03-24 10:11:07 : Checking ATAC Input, 0.006 mins elapsed.
2022-03-24 10:11:07 : Checking RNA Input, 0.011 mins elapsed.
2022-03-24 10:11:20 : Found 18601 overlapping gene names from gene scores and rna matrix!, 0.213 mins elapsed.
2022-03-24 10:11:20 : Creating Integration Blocks, 0.213 mins elapsed.
2022-03-24 10:11:20 : Prepping Interation Data, 0.218 mins elapsed.
2022-03-24 10:11:21 : Computing Integration in 1 Integration Blocks!, 0 mins elapsed.
2022-03-24 10:11:21 : Block (1 of 1) : Computing Integration, 0 mins elapsed.
2022-03-24 10:11:25 : Block (1 of 1) : Identifying Variable Genes, 0.063 mins elapsed.
2022-03-24 10:11:29 : Block (1 of 1) : Getting GeneScoreMatrix, 0.133 mins elapsed.

2022-03-24 10:11:37 : GeneScoreMat-Block-1, Class = dgCMatrix
GeneScoreMat-Block-1: nRows = 2000, nCols = 10250
GeneScoreMat-Block-1: NonZeroEntries = 3684343, EntryRange = [ 0.022 , 229.174 ]
5 x 5 sparse Matrix of class "dgCMatrix"
         scATAC_BMMC_R1#TCACCTGTCCTCCTGA-1 scATAC_PBMC_R1#CGTTCCACAGAGATGC-1 scATAC_CD34_BMMC_R1#GCTCACTAGTCGTACT-1 scATAC_PBMC_R1#GTCTACCAGGATCCTT-1 scATAC_PBMC_R1#ATGTCTTTCCATAACG-1
HES4                                 .                                 2.246                                  .                                     .                             .
ISG15                                .                                 .                                      .                                     .                             3.644
TNFRSF18                             4.124                             .                                      1.913                                 .                             .
TNFRSF4                              .                                 .                                      2.249                                 .                             .
CALML6                               .                                 .                                      .                                     .                             .


2022-03-24 10:11:37 : Block (1 of 1) : Imputing GeneScoreMatrix, 0.266 mins elapsed.

2022-03-24 10:11:37 : addImputeWeights Input-Parameters, Class = list

addImputeWeights Input-Parameters$ArchRProj: length = 1

addImputeWeights Input-Parameters$reducedDims: length = 1
[1] "IterativeLSI"


addImputeWeights Input-Parameters$dimsToUse: length = 30
[1] 1 2 3 4 5 6


addImputeWeights Input-Parameters$scaleDims: length = 0
NULL


addImputeWeights Input-Parameters$corCutOff: length = 1
[1] 0.75


addImputeWeights Input-Parameters$td: length = 1
[1] 3


addImputeWeights Input-Parameters$ka: length = 1
[1] 4


addImputeWeights Input-Parameters$sampleCells: length = 1
[1] 5000


addImputeWeights Input-Parameters$nRep: length = 1
[1] 2


addImputeWeights Input-Parameters$k: length = 1
[1] 15


addImputeWeights Input-Parameters$epsilon: length = 1
[1] 1


addImputeWeights Input-Parameters$useHdf5: length = 1
[1] TRUE


addImputeWeights Input-Parameters$randomSuffix: length = 1
[1] TRUE


addImputeWeights Input-Parameters$threads: length = 1
[1] 1


addImputeWeights Input-Parameters$seed: length = 1
[1] 1


addImputeWeights Input-Parameters$verbose: length = 1
[1] TRUE


addImputeWeights Input-Parameters$logFile: length = 1
[1] "ArchRLogs/ArchR-addGeneIntegrationMatrix-ac5dcbaebc-Date-2022-03-24_Time-10-11-07.log"


2022-03-24 10:11:37 : Computing Impute Weights Using Magic (Cell 2018), 0 mins elapsed.
2022-03-24 10:11:37 : Computing Partial Diffusion Matrix with Magic (1 of 2), 0 mins elapsed.
2022-03-24 10:11:46 : Computing Partial Diffusion Matrix with Magic (2 of 2), 0.158 mins elapsed.
2022-03-24 10:11:56 : Completed Getting Magic Weights!, 0.32 mins elapsed.

2022-03-24 10:11:56 : imputeMatrix Input-Parameters, Class = list

imputeMatrix Input-Parameters$mat: nRows = 2000, nCols = 10250
imputeMatrix Input-Parameters$mat: NonZeroEntries = 3684343, EntryRange = [ 0.022 , 229.174 ]
5 x 5 sparse Matrix of class "dgCMatrix"
         scATAC_BMMC_R1#TCACCTGTCCTCCTGA-1 scATAC_PBMC_R1#CGTTCCACAGAGATGC-1 scATAC_CD34_BMMC_R1#GCTCACTAGTCGTACT-1 scATAC_PBMC_R1#GTCTACCAGGATCCTT-1 scATAC_PBMC_R1#ATGTCTTTCCATAACG-1
HES4                                 .                                 2.246                                  .                                     .                             .
ISG15                                .                                 .                                      .                                     .                             3.644
TNFRSF18                             4.124                             .                                      1.913                                 .                             .
TNFRSF4                              .                                 .                                      2.249                                 .                             .
CALML6                               .                                 .                                      .                                     .                             .


imputeMatrix Input-Parameters$threads: length = 1
[1] 14


imputeMatrix Input-Parameters$verbose: length = 1
[1] FALSE


imputeMatrix Input-Parameters$logFile: length = 1
[1] "ArchRLogs/ArchR-addGeneIntegrationMatrix-ac5dcbaebc-Date-2022-03-24_Time-10-11-07.log"



2022-03-24 10:11:56 : mat, Class = dgCMatrix
mat: nRows = 2000, nCols = 10250
mat: NonZeroEntries = 3684343, EntryRange = [ 0.022 , 229.174 ]
5 x 5 sparse Matrix of class "dgCMatrix"
         scATAC_BMMC_R1#TCACCTGTCCTCCTGA-1 scATAC_PBMC_R1#CGTTCCACAGAGATGC-1 scATAC_CD34_BMMC_R1#GCTCACTAGTCGTACT-1 scATAC_PBMC_R1#GTCTACCAGGATCCTT-1 scATAC_PBMC_R1#ATGTCTTTCCATAACG-1
HES4                                 .                                 2.246                                  .                                     .                             .
ISG15                                .                                 .                                      .                                     .                             3.644
TNFRSF18                             4.124                             .                                      1.913                                 .                             .
TNFRSF4                              .                                 .                                      2.249                                 .                             .
CALML6                               .                                 .                                      .                                     .                             .



2022-03-24 10:11:56 : weightList, Class = SimpleList

weightList$w1: length = 1
[1] "HemeTutorial/ImputeWeights/Impute-Weights-ac2e421c7e-Date-2022-03-24_Time-10-11-37-Rep-1"


weightList$w2: length = 1
[1] "HemeTutorial/ImputeWeights/Impute-Weights-ac2e421c7e-Date-2022-03-24_Time-10-11-37-Rep-2"


2022-03-24 10:11:56 : Imputing Matrix (1 of 2), 0 mins elapsed.

2022-03-24 10:11:56 :

2022-03-24 10:11:56 :

2022-03-24 10:11:56 :
2022-03-24 10:12:05 : Imputing Matrix (2 of 2), 0.15 mins elapsed.

2022-03-24 10:12:05 :

2022-03-24 10:12:05 :

2022-03-24 10:12:05 :
2022-03-24 10:12:14 : Finished Imputing Matrix, 0.304 mins elapsed.

2022-03-24 10:12:14 : GeneScoreMat-Block-Impute-1, Class = dgeMatrix
GeneScoreMat-Block-Impute-1: nRows = 2000, nCols = 10250
GeneScoreMat-Block-Impute-1: NonZeroEntries = 20500000, EntryRange = [ 0 , 9.2441243685058 ]
5 x 5 Matrix of class "dgeMatrix"
         scATAC_BMMC_R1#TCACCTGTCCTCCTGA-1 scATAC_PBMC_R1#CGTTCCACAGAGATGC-1 scATAC_CD34_BMMC_R1#GCTCACTAGTCGTACT-1 scATAC_PBMC_R1#GTCTACCAGGATCCTT-1 scATAC_PBMC_R1#ATGTCTTTCCATAACG-1
HES4                             0.7374332                         0.8325161                              0.4844393                         1.1526103                         0.8610469
ISG15                            0.6732143                         0.7313002                              0.4624542                         0.5490451                         0.8563325
TNFRSF18                         1.0225179                         0.9803957                              0.9646273                         0.2038027                         0.7665383
TNFRSF4                          0.2278850                         0.6113133                              0.4263749                         0.2423220                         0.4514156
CALML6                           2.0809765                         1.0019240                              1.1559091                         1.0064192                         1.2641407


2022-03-24 10:12:21 : Block (1 of 1) : Seurat FindTransferAnchors, 1.004 mins elapsed.

2022-03-24 10:14:14 : transferAnchors-1, Class = character

transferAnchors-1: length = 1
[1] "An AnchorSet object containing 1387 anchors between the reference and query Seurat objects. \n This can be used as input to TransferData."



2022-03-24 10:14:14 : rDSub-1, Class = matrix

2022-03-24 10:14:14 : rDSub-1, Class = array
                                            LSI1       LSI2       LSI3       LSI4      LSI5
scATAC_BMMC_R1#TCACCTGTCCTCCTGA-1      -4.129696 -0.5075478 -0.9600561  2.4942942 0.9356967
scATAC_PBMC_R1#CGTTCCACAGAGATGC-1      -5.034763 -1.0731695  0.5978558  0.7327498 0.3442506
scATAC_CD34_BMMC_R1#GCTCACTAGTCGTACT-1 -4.044496 -0.2141475 -1.5874410  1.6677084 0.3994403
scATAC_PBMC_R1#GTCTACCAGGATCCTT-1      -4.638248  1.9784523  1.1754489 -0.1663717 0.4113733
scATAC_PBMC_R1#ATGTCTTTCCATAACG-1      -4.851607 -1.5353766  0.8343660  0.0959220 0.3402740

rDSub-1: nRows = 10250, nCols = 30

2022-03-24 10:14:14 : Block (1 of 1) : Seurat TransferData Cell Group Labels, 2.885 mins elapsed.
2022-03-24 10:14:16 : Block (1 of 1) : Seurat TransferData Cell Names Labels, 2.926 mins elapsed.
2022-03-24 10:14:45 : Block (1 of 1) : Saving TransferAnchors Joint CCA, 3.411 mins elapsed.
2022-03-24 10:14:47 : Block (1 of 1) : Completed Integration, 3.434 mins elapsed.
2022-03-24 10:14:48 : Block (1 of 1) : Plotting Joint UMAP, 3.446 mins elapsed.
2022-03-24 10:15:19 : Completed Integration with RNA Matrix, 3.967 mins elapsed.

------- Completed

End Time : 2022-03-24 10:15:19
Elapsed Time Minutes = 4.20759488344193
Elapsed Time Hours = 0.0701266454988056

-------

Thanks!

Zepeng-Mu avatar Mar 24 '22 15:03 Zepeng-Mu

Hi, I'm wondering whether this is because of some stochasticity in Seurat TransferData and it is run twice to get cellGroup and cellName? https://github.com/GreenleafLab/ArchR/blob/968e4421ce7187a8ac7ea1cf6077412126876d5f/R/RNAIntegration.R#L472-L479

Zepeng-Mu avatar Apr 05 '22 15:04 Zepeng-Mu

That seems like a likely culprit. Thanks for pointing that out. We will take a look. Sorry its taking some time but we will get to this eventually.

rcorces avatar Apr 05 '22 19:04 rcorces