seurat Integration + data correction

Hi Seurat Team!

I know that this topic has been addressed in previous issues (#7976 , #5879 ) but I'm still confused about it.

It has been discussed that the cell cycle regression (or other types of correction) should be done after the integration. However, I don't know if this correction will impact the following clustering. Maybe I'm losing something, but I think that the ScaleData() function doesn't modify the assay with the integrated values, which are the ones used for clustering, right?

The code I'm using is as follows (please correct me if it's wrong). It is based on the integration vignette:

# Create object and split
s.object <- CreateSeuratObject(count.table)
s.object[["RNA"]] <- split(s.object[["RNA"]], f = s.object[[integrationFactor]][,integrationFactor])

# Preprocessing
s.object <- NormalizeData(s.object)
s.object <- FindVariableFeatures(s.object)
s.object <- ScaleData(s.object)
s.object <- RunPCA(s.object)

# Integration
s.object <- IntegrateLayers(object = s.object, method = CCAIntegration, orig.reduction = "pca", new.reduction = "integrated.cca", verbose = FALSE)
s.object[["RNA"]] <- JoinLayers(s.object[["RNA"]])

# Clustering
s.object <- FindNeighbors(s.object, reduction = "integrated.cca", dims = 1:30)
s.object <- FindClusters(s.object, resolution = 1)
s.object <- RunUMAP(s.object, dims = 1:30, reduction = "pca")

# Cell-cycle score and regression
s.object.corrected <- CellCycleScoring(s.object, s.features = s.genes, g2m.features = g2m.genes)
s.object.corrected <- ScaleData(s.object.corrected, vars.to.regress = c("S.Score", "G2M.Score"))

# Clustering with the corrected values
s.object.corrected <- FindNeighbors(s.object.corrected, reduction = "integrated.cca", dims = 1:30)
s.object.corrected <- FindClusters(s.object.corrected, resolution = 1)
s.object.corrected <- RunUMAP(s.object.corrected, dims = 1:30, reduction = "pca")

Thus, the scale.data slots are indeed different:

> s.object@assays$RNA$scale.data[1:3,1:3]
                   dog10-AAACCTGCAGCTCGAC dog10-AAACCTGGTGACCAAG dog10-AAACCTGTCTCGGACG
ENSCAFG00845016176            -0.03958831            -0.03958831            -0.03958831
ENSCAFG00845016988            -0.08154677            -0.08154677            -0.08154677
ENSCAFG00845017128            -0.04624699            -0.04624699            -0.04624699
> s.object.corrected@assays$RNA$scale.data[1:3,1:3]
                   dog10-AAACCTGCAGCTCGAC dog10-AAACCTGGTGACCAAG dog10-AAACCTGTCTCGGACG
ENSCAFG00845016176            -0.11109859             -0.6112979             0.01813269
ENSCAFG00845016988            -0.02641256             -0.6635741            -0.04989789
ENSCAFG00845017128            -0.12496140             -0.5977094             0.01136535

But since the "integrated.cca" reduction is used for clustering and it remains unchanged, the two clustering are the same: > DimPlot(s.object)

> DimPlot(s.object.corrected)

> table(s.object$seurat_clusters)
  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
734 530 480 468 450 439 100  82  69  62  62  44  39  33  27 

> table(s.object.corrected$seurat_clusters)
  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
734 530 480 468 450 439 100  82  69  62  62  44  39  33  27

However, I don't know how to proceed to do the regression before the integration.

If I try to perform the CellCycleScoring() before splitting the object:

s.object <- CreateSeuratObject(count.table)
s.object <- CellCycleScoring(s.object, s.features = s.genes, g2m.features = g2m.genes

Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'rowMeans': subscript out of bounds
9. h(simpleError(msg, call))
8. .handleSimpleError(function (cond)
.Internal(C_tryCatchHelper(addr, 1L, cond)), "subscript out of bounds",
base::quote(.subscript.2ary(x, i, , drop = TRUE)))
7. stop("subscript out of bounds")
6. .subscript.2ary(x, i, , drop = TRUE)
5. assay.data[pool, ]
4. assay.data[pool, ]
3. Matrix::rowMeans(x = assay.data[pool, ])
2. AddModuleScore(object = object, features = features, name = name,
ctrl = ctrl, ...)
1. CellCycleScoring(s.object, s.features = s.genes, g2m.features = g2m.genes)

If I try to compute the cell cycle scores after normalization, an error appears saying that it can be computed on split objects:

s.object[["RNA"]] <- split(s.object[["RNA"]], f = s.object[[integrationFactor]][,integrationFactor])
s.object.corrected <- NormalizeData(s.object)
s.object.corrected <- FindVariableFeatures(s.object.corrected)
s.object.corrected <- CellCycleScoring(s.object.corrected, s.features = s.genes, g2m.features = g2m.genes)
s.object.corrected <- ScaleData(s.object.corrected, vars.to.regress = c("S.Score", "G2M.Score"))
s.object.corrected <- RunPCA(s.object.corrected)
> s.object.corrected <- CellCycleScoring(s.object.corrected, s.features = s.genes, g2m.features = g2m.genes)
Error in `GetAssayData()`:
! GetAssayData doesn't work for multiple layers in v5 assay.
Backtrace:
 1. Seurat::CellCycleScoring(...)
 2. Seurat::AddModuleScore(...)
 4. SeuratObject:::GetAssayData.Seurat(...)
 6. SeuratObject:::GetAssayData.StdAssay(object = object[[assay]], layer = layer)

How should I proceed then? For me, the idea of correcting for the cell cycle is that it doesn't influence the clustering.

Thank you in advance!

> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Madrid
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gdata_3.0.0           rhdf5_2.44.0          Matrix_1.6-4          hdf5r_1.3.8           SeuratDisk_0.0.0.9021 Seurat_5.0.1          SeuratObject_5.0.1   
[8] sp_2.1-2             

loaded via a namespace (and not attached):
  [1] deldir_2.0-2           pbapply_1.7-2          gridExtra_2.3          rlang_1.1.2            magrittr_2.0.3         RcppAnnoy_0.0.21       spatstat.geom_3.2-7   
  [8] matrixStats_1.2.0      ggridges_0.5.4         compiler_4.3.2         png_0.1-8              vctrs_0.6.5            reshape2_1.4.4         stringr_1.5.1         
 [15] crayon_1.5.2           pkgconfig_2.0.3        fastmap_1.1.1          ellipsis_0.3.2         utf8_1.2.4             promises_1.2.1         bit_4.0.5             
 [22] purrr_1.0.2            xfun_0.41              jsonlite_1.8.8         goftest_1.2-3          later_1.3.2            rhdf5filters_1.12.1    Rhdf5lib_1.22.1       
 [29] spatstat.utils_3.0-4   irlba_2.3.5.1          parallel_4.3.2         cluster_2.1.5          R6_2.5.1               ica_1.0-3              stringi_1.8.3         
 [36] RColorBrewer_1.1-3     spatstat.data_3.0-3    reticulate_1.34.0      parallelly_1.36.0      lmtest_0.9-40          scattermore_1.2        Rcpp_1.0.11           
 [43] knitr_1.45             tensor_1.5             future.apply_1.11.0    zoo_1.8-12             sctransform_0.4.1      httpuv_1.6.13          splines_4.3.2         
 [50] igraph_1.6.0           tidyselect_1.2.0       rstudioapi_0.15.0      abind_1.4-5            spatstat.random_3.2-2  codetools_0.2-19       miniUI_0.1.1.1        
 [57] spatstat.explore_3.2-5 listenv_0.9.0          lattice_0.22-5         tibble_3.2.1           plyr_1.8.9             withr_2.5.2            shiny_1.8.0           
 [64] ROCR_1.0-11            Rtsne_0.17             future_1.33.0          fastDummies_1.7.3      survival_3.5-7         polyclip_1.10-6        fitdistrplus_1.1-11   
 [71] pillar_1.9.0           KernSmooth_2.23-22     plotly_4.10.3          generics_0.1.3         RcppHNSW_0.5.0         ggplot2_3.4.4          munsell_0.5.0         
 [78] scales_1.3.0           gtools_3.9.5           globals_0.16.2         xtable_1.8-4           glue_1.6.2             lazyeval_0.2.2         tools_4.3.2           
 [85] data.table_1.14.10     RSpectra_0.16-1        RANN_2.6.1             leiden_0.4.3.1         dotCall64_1.1-1        cowplot_1.1.1          grid_4.3.2            
 [92] tidyr_1.3.0            colorspace_2.1-0       nlme_3.1-163           patchwork_1.1.3        cli_3.6.2              spatstat.sparse_3.0-3  spam_2.10-0           
 [99] fansi_1.0.6            viridisLite_0.4.2      dplyr_1.1.4            uwot_0.1.16            gtable_0.3.4           digest_0.6.33          progressr_0.14.0      
[106] ggrepel_0.9.4          htmlwidgets_1.6.4      htmltools_0.5.7        lifecycle_1.0.4        httr_1.4.7             mime_0.12              bit64_4.0.5           
[113] MASS_7.3-60

Jan 08 '24 09:01 MartaBenegas

I'm having the same issue. Developers, please help!

Mar 01 '24 17:03 SITRR

Hi!

I'm not an expert but I regressed out the cell cycle score difference using the following steps:

Normalize each datasets separately
Join the layers to run CellCycleScoring, because this function doesn't work with split layers
Run scaledata to regress out the cell cycle score and recalculated PCA
Split again the merged object to integrate the datasets

This is my script:

seu <- merge(seurat.list[[1]], seurat.list[2:8], add.cell.ids=samples.newname)
# Normalizing and scaling RNA data
seu <- NormalizeData(seu)
seu <- FindVariableFeatures(seu, selection.method = "vst")
seu <- ScaleData(seu)
seu <- RunPCA(seu)
seu <- RunUMAP(seu, dims=1:50)

# Calculating Cell cycle score
seu <- JoinLayers(seu)
seu <- CellCycleScoring(seu,
                        s.features=s.genes,
                        g2m.features = g2m.genes)

# Regress out cell cycle score
seu$CC.Difference <- seu$S.Score - seu$G2M.Score
seu <- ScaleData(seu, vars.to.regress = "CC.Difference", features=VariableFeatures(seu))
seu <- RunPCA(seu, reduction.name = "pca.ccregress")

# Perform RPCA integration
seu[["RNA"]] <- split(seu[["RNA"]], f=seu$Sample_label)

options(future.globals.maxSize = 8000 * 1024^2)
seu <- IntegrateLayers(object = seu,
                       method = RPCAIntegration,
                       dims=1:50,
                       #assay="RNA"
                       normalization.method = "LogNormalize",
                       orig.reduction = "pca.ccregress",
                       new.reduction = "integrated.rpca",
                        verbose = FALSE)

# re-join layers after integration
seu[["RNA"]] <- JoinLayers(seu[["RNA"]])
seu <- RunUMAP(seu, dims = 1:50, reduction = "integrated.rpca")

If anyone sees something wrong with this process, please tell me!

Mar 08 '24 18:03 Montsecb

Hi @MartaBenegas,

Thanks for your question!

In an effort to more promptly address user issues, we’ve started asking users to direct questions like this to our Discussions board, where community members and developers can provide more targeted assistance.

We’re going to close this issue for now but strongly encourage you to repost your question in the other forum.

We look forward to seeing your post there - thanks for using Seurat!

Jun 24 '24 19:06 Gesmira