Integration + data correction
Hi Seurat Team!
I know that this topic has been addressed in previous issues (#7976 , #5879 ) but I'm still confused about it.
It has been discussed that the cell cycle regression (or other types of correction) should be done after the integration. However, I don't know if this correction will impact the following clustering. Maybe I'm losing something, but I think that the ScaleData() function doesn't modify the assay with the integrated values, which are the ones used for clustering, right?
The code I'm using is as follows (please correct me if it's wrong). It is based on the integration vignette:
# Create object and split
s.object <- CreateSeuratObject(count.table)
s.object[["RNA"]] <- split(s.object[["RNA"]], f = s.object[[integrationFactor]][,integrationFactor])
# Preprocessing
s.object <- NormalizeData(s.object)
s.object <- FindVariableFeatures(s.object)
s.object <- ScaleData(s.object)
s.object <- RunPCA(s.object)
# Integration
s.object <- IntegrateLayers(object = s.object, method = CCAIntegration, orig.reduction = "pca", new.reduction = "integrated.cca", verbose = FALSE)
s.object[["RNA"]] <- JoinLayers(s.object[["RNA"]])
# Clustering
s.object <- FindNeighbors(s.object, reduction = "integrated.cca", dims = 1:30)
s.object <- FindClusters(s.object, resolution = 1)
s.object <- RunUMAP(s.object, dims = 1:30, reduction = "pca")
# Cell-cycle score and regression
s.object.corrected <- CellCycleScoring(s.object, s.features = s.genes, g2m.features = g2m.genes)
s.object.corrected <- ScaleData(s.object.corrected, vars.to.regress = c("S.Score", "G2M.Score"))
# Clustering with the corrected values
s.object.corrected <- FindNeighbors(s.object.corrected, reduction = "integrated.cca", dims = 1:30)
s.object.corrected <- FindClusters(s.object.corrected, resolution = 1)
s.object.corrected <- RunUMAP(s.object.corrected, dims = 1:30, reduction = "pca")
Thus, the scale.data slots are indeed different:
> s.object@assays$RNA$scale.data[1:3,1:3]
dog10-AAACCTGCAGCTCGAC dog10-AAACCTGGTGACCAAG dog10-AAACCTGTCTCGGACG
ENSCAFG00845016176 -0.03958831 -0.03958831 -0.03958831
ENSCAFG00845016988 -0.08154677 -0.08154677 -0.08154677
ENSCAFG00845017128 -0.04624699 -0.04624699 -0.04624699
> s.object.corrected@assays$RNA$scale.data[1:3,1:3]
dog10-AAACCTGCAGCTCGAC dog10-AAACCTGGTGACCAAG dog10-AAACCTGTCTCGGACG
ENSCAFG00845016176 -0.11109859 -0.6112979 0.01813269
ENSCAFG00845016988 -0.02641256 -0.6635741 -0.04989789
ENSCAFG00845017128 -0.12496140 -0.5977094 0.01136535
But since the "integrated.cca" reduction is used for clustering and it remains unchanged, the two clustering are the same:
> DimPlot(s.object)
> DimPlot(s.object.corrected)
> table(s.object$seurat_clusters)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
734 530 480 468 450 439 100 82 69 62 62 44 39 33 27
> table(s.object.corrected$seurat_clusters)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
734 530 480 468 450 439 100 82 69 62 62 44 39 33 27
However, I don't know how to proceed to do the regression before the integration.
If I try to perform the CellCycleScoring() before splitting the object:
s.object <- CreateSeuratObject(count.table)
s.object <- CellCycleScoring(s.object, s.features = s.genes, g2m.features = g2m.genes
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'rowMeans': subscript out of bounds
9. h(simpleError(msg, call))
8. .handleSimpleError(function (cond)
.Internal(C_tryCatchHelper(addr, 1L, cond)), "subscript out of bounds",
base::quote(.subscript.2ary(x, i, , drop = TRUE)))
7. stop("subscript out of bounds")
6. .subscript.2ary(x, i, , drop = TRUE)
5. assay.data[pool, ]
4. assay.data[pool, ]
3. Matrix::rowMeans(x = assay.data[pool, ])
2. AddModuleScore(object = object, features = features, name = name,
ctrl = ctrl, ...)
1. CellCycleScoring(s.object, s.features = s.genes, g2m.features = g2m.genes)
If I try to compute the cell cycle scores after normalization, an error appears saying that it can be computed on split objects:
s.object[["RNA"]] <- split(s.object[["RNA"]], f = s.object[[integrationFactor]][,integrationFactor])
s.object.corrected <- NormalizeData(s.object)
s.object.corrected <- FindVariableFeatures(s.object.corrected)
s.object.corrected <- CellCycleScoring(s.object.corrected, s.features = s.genes, g2m.features = g2m.genes)
s.object.corrected <- ScaleData(s.object.corrected, vars.to.regress = c("S.Score", "G2M.Score"))
s.object.corrected <- RunPCA(s.object.corrected)
> s.object.corrected <- CellCycleScoring(s.object.corrected, s.features = s.genes, g2m.features = g2m.genes)
Error in `GetAssayData()`:
! GetAssayData doesn't work for multiple layers in v5 assay.
Backtrace:
1. Seurat::CellCycleScoring(...)
2. Seurat::AddModuleScore(...)
4. SeuratObject:::GetAssayData.Seurat(...)
6. SeuratObject:::GetAssayData.StdAssay(object = object[[assay]], layer = layer)
How should I proceed then? For me, the idea of correcting for the cell cycle is that it doesn't influence the clustering.
Thank you in advance!
> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Madrid
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gdata_3.0.0 rhdf5_2.44.0 Matrix_1.6-4 hdf5r_1.3.8 SeuratDisk_0.0.0.9021 Seurat_5.0.1 SeuratObject_5.0.1
[8] sp_2.1-2
loaded via a namespace (and not attached):
[1] deldir_2.0-2 pbapply_1.7-2 gridExtra_2.3 rlang_1.1.2 magrittr_2.0.3 RcppAnnoy_0.0.21 spatstat.geom_3.2-7
[8] matrixStats_1.2.0 ggridges_0.5.4 compiler_4.3.2 png_0.1-8 vctrs_0.6.5 reshape2_1.4.4 stringr_1.5.1
[15] crayon_1.5.2 pkgconfig_2.0.3 fastmap_1.1.1 ellipsis_0.3.2 utf8_1.2.4 promises_1.2.1 bit_4.0.5
[22] purrr_1.0.2 xfun_0.41 jsonlite_1.8.8 goftest_1.2-3 later_1.3.2 rhdf5filters_1.12.1 Rhdf5lib_1.22.1
[29] spatstat.utils_3.0-4 irlba_2.3.5.1 parallel_4.3.2 cluster_2.1.5 R6_2.5.1 ica_1.0-3 stringi_1.8.3
[36] RColorBrewer_1.1-3 spatstat.data_3.0-3 reticulate_1.34.0 parallelly_1.36.0 lmtest_0.9-40 scattermore_1.2 Rcpp_1.0.11
[43] knitr_1.45 tensor_1.5 future.apply_1.11.0 zoo_1.8-12 sctransform_0.4.1 httpuv_1.6.13 splines_4.3.2
[50] igraph_1.6.0 tidyselect_1.2.0 rstudioapi_0.15.0 abind_1.4-5 spatstat.random_3.2-2 codetools_0.2-19 miniUI_0.1.1.1
[57] spatstat.explore_3.2-5 listenv_0.9.0 lattice_0.22-5 tibble_3.2.1 plyr_1.8.9 withr_2.5.2 shiny_1.8.0
[64] ROCR_1.0-11 Rtsne_0.17 future_1.33.0 fastDummies_1.7.3 survival_3.5-7 polyclip_1.10-6 fitdistrplus_1.1-11
[71] pillar_1.9.0 KernSmooth_2.23-22 plotly_4.10.3 generics_0.1.3 RcppHNSW_0.5.0 ggplot2_3.4.4 munsell_0.5.0
[78] scales_1.3.0 gtools_3.9.5 globals_0.16.2 xtable_1.8-4 glue_1.6.2 lazyeval_0.2.2 tools_4.3.2
[85] data.table_1.14.10 RSpectra_0.16-1 RANN_2.6.1 leiden_0.4.3.1 dotCall64_1.1-1 cowplot_1.1.1 grid_4.3.2
[92] tidyr_1.3.0 colorspace_2.1-0 nlme_3.1-163 patchwork_1.1.3 cli_3.6.2 spatstat.sparse_3.0-3 spam_2.10-0
[99] fansi_1.0.6 viridisLite_0.4.2 dplyr_1.1.4 uwot_0.1.16 gtable_0.3.4 digest_0.6.33 progressr_0.14.0
[106] ggrepel_0.9.4 htmlwidgets_1.6.4 htmltools_0.5.7 lifecycle_1.0.4 httr_1.4.7 mime_0.12 bit64_4.0.5
[113] MASS_7.3-60
I'm having the same issue. Developers, please help!
Hi!
I'm not an expert but I regressed out the cell cycle score difference using the following steps:
- Normalize each datasets separately
- Join the layers to run CellCycleScoring, because this function doesn't work with split layers
- Run scaledata to regress out the cell cycle score and recalculated PCA
- Split again the merged object to integrate the datasets
This is my script:
seu <- merge(seurat.list[[1]], seurat.list[2:8], add.cell.ids=samples.newname)
# Normalizing and scaling RNA data
seu <- NormalizeData(seu)
seu <- FindVariableFeatures(seu, selection.method = "vst")
seu <- ScaleData(seu)
seu <- RunPCA(seu)
seu <- RunUMAP(seu, dims=1:50)
# Calculating Cell cycle score
seu <- JoinLayers(seu)
seu <- CellCycleScoring(seu,
s.features=s.genes,
g2m.features = g2m.genes)
# Regress out cell cycle score
seu$CC.Difference <- seu$S.Score - seu$G2M.Score
seu <- ScaleData(seu, vars.to.regress = "CC.Difference", features=VariableFeatures(seu))
seu <- RunPCA(seu, reduction.name = "pca.ccregress")
# Perform RPCA integration
seu[["RNA"]] <- split(seu[["RNA"]], f=seu$Sample_label)
options(future.globals.maxSize = 8000 * 1024^2)
seu <- IntegrateLayers(object = seu,
method = RPCAIntegration,
dims=1:50,
#assay="RNA"
normalization.method = "LogNormalize",
orig.reduction = "pca.ccregress",
new.reduction = "integrated.rpca",
verbose = FALSE)
# re-join layers after integration
seu[["RNA"]] <- JoinLayers(seu[["RNA"]])
seu <- RunUMAP(seu, dims = 1:50, reduction = "integrated.rpca")
If anyone sees something wrong with this process, please tell me!
Hi @MartaBenegas,
Thanks for your question!
In an effort to more promptly address user issues, we’ve started asking users to direct questions like this to our Discussions board, where community members and developers can provide more targeted assistance.
We’re going to close this issue for now but strongly encourage you to repost your question in the other forum.
We look forward to seeing your post there - thanks for using Seurat!