CyTOF data preprocessing error
Thank you for the nice work flow published on F1000: https://f1000research.com/articles/9-1263/v1
Question 1:
I am able to successfully run the R code upto the code below:
# split SCE by sample
fs <- sce2fcs(sub,
assay = "compexprs",
split_by = "bc_id",
keep_cd = TRUE)
The code above give me an error: Error: Argument 'exprs' must be numeric matrix with colnames attribute set
I checked sub object.
> colnames(sce@assays@data$counts)
NULL
The other two judgement were all TRUE.
What's your suggestion to
Question 2:
My understanding is the values in compexprs is compensated and normalized expr and if I want to do clustering and tSNE/UMAP, I should use the values compexprs after gating. Please let me know whether my understanding is correct? Thanks.
Here is my session info:
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchwork_1.0.1 reshape2_1.4.4 ggcyto_1.16.0 ncdfFlow_2.34.0
[5] BH_1.72.0-3 RcppArmadillo_0.10.1.0.0 ggplot2_3.3.2 openCyto_2.0.0
[9] mvtnorm_1.1-1 flowWorkspace_4.0.6 flowCore_2.0.0 dplyr_1.0.2
[13] CATALYST_1.12.2 SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.1 DelayedArray_0.14.0
[17] matrixStats_0.57.0 Biobase_2.48.0 GenomicRanges_1.40.0 GenomeInfoDb_1.24.0
[21] IRanges_2.22.1 S4Vectors_0.26.0 BiocGenerics_0.34.0
loaded via a namespace (and not attached):
[1] readxl_1.3.1 circlize_0.4.10 drc_3.0-1 plyr_1.8.6
[5] igraph_1.2.6 ConsensusClusterPlus_1.52.0 splines_4.0.3 BiocParallel_1.22.0
[9] fda_5.1.5.1 scater_1.16.0 TH.data_1.0-10 digest_0.6.26
[13] viridis_0.5.1 magrittr_1.5 CytoML_2.0.0 cluster_2.1.0
[17] ks_1.11.7 openxlsx_4.2.2 ComplexHeatmap_2.4.2 RcppParallel_5.0.2
[21] R.utils_2.10.1 sandwich_3.0-0 cytolib_2.0.2 jpeg_0.1-8.1
[25] colorspace_1.4-1 rrcov_1.5-5 ggrepel_0.8.2 haven_2.3.1
[29] xfun_0.18 crayon_1.3.4 RCurl_1.98-1.2 jsonlite_1.7.1
[33] hexbin_1.28.1 graph_1.66.0 survival_3.2-7 zoo_1.8-8
[37] glue_1.4.2 flowClust_3.26.0 gtable_0.3.0 nnls_1.4
[41] zlibbioc_1.34.0 XVector_0.28.0 GetoptLong_1.0.4 car_3.0-10
[45] BiocSingular_1.4.0 IDPmisc_1.1.20 Rgraphviz_2.32.0 shape_1.4.5
[49] DEoptimR_1.0-8 abind_1.4-5 scales_1.1.1 Rcpp_1.0.4.6
[53] plotrix_3.7-8 viridisLite_0.3.0 clue_0.3-57 foreign_0.8-80
[57] rsvd_1.0.3 mclust_5.4.6 FlowSOM_1.20.0 tsne_0.1-3
[61] RColorBrewer_1.1-2 ellipsis_0.3.1 farver_2.0.3 pkgconfig_2.0.3
[65] XML_3.99-0.3 R.methodsS3_1.8.1 flowViz_1.52.0 labeling_0.4.2
[69] flowStats_4.0.0 tidyselect_1.1.0 rlang_0.4.7 munsell_0.5.0
[73] cellranger_1.1.0 tools_4.0.3 generics_0.0.2 ggridges_0.5.2
[77] stringr_1.4.0 yaml_2.2.1 zip_2.1.1 robustbase_0.93-6
[81] purrr_0.3.4 RBGL_1.64.0 R.oo_1.24.0 compiler_4.0.3
[85] rstudioapi_0.11 beeswarm_0.2.3 curl_4.3 png_0.1-7
[89] tibble_3.0.4 pcaPP_1.9-73 stringi_1.5.3 forcats_0.5.0
[93] lattice_0.20-41 Matrix_1.2-18 vctrs_0.3.4 pillar_1.4.6
[97] lifecycle_0.2.0 BiocManager_1.30.10 GlobalOptions_0.1.2 BiocNeighbors_1.6.0
[101] data.table_1.13.0 cowplot_1.1.0 bitops_1.0-6 irlba_2.3.3
[105] corpcor_1.6.9 R6_2.4.1 latticeExtra_0.6-29 KernSmooth_2.23-17
[109] gridExtra_2.3 RProtoBufLib_2.0.0 rio_0.5.16 vipor_0.4.5
[113] codetools_0.2-17 MASS_7.3-53 gtools_3.8.2 rjson_0.2.20
[117] withr_2.3.0 mnormt_1.5-7 multcomp_1.4-14 GenomeInfoDbData_1.2.3
[121] hms_0.5.3 grid_4.0.3 DelayedMatrixStats_1.10.0 carData_3.0-4
[125] Rtsne_0.15 base64enc_0.1-3 ellipse_0.4.2 tinytex_0.26
[129] ggbeeswarm_0.6.0
Thank you for the kind words!
-
You said you checked the
subobject, but your line above hasscein it? Could you maybe print boutsceandsubhere so I can have a look what these looks like? -
Yes,
compexprscontain compensated normalized expression values - assuming that you rancompCytof(..., assay = "normcounts")(i.e., running the compensation on the normalized data, not the defaultassay = "counts").
Thank you for your response. This is the print output:
> print(sce)
class: SingleCellExperiment
dim: 63 337525
metadata(4): experiment_info bc_key sep_cutoffs mhl_cutoff
assays(7): counts exprs ... compcounts compexprs
rownames(63): 75As CD15 ... 208Pb CD45
rowData names(6): channel_name marker_name ... bead_ch is_bc
colnames: NULL
colData names(6): file_id remove ... delta mhl_dist
reducedDimNames(0):
altExpNames(0):
> print(sub)
class: SingleCellExperiment
dim: 3 337525
metadata(4): experiment_info bc_key sep_cutoffs mhl_cutoff
assays(7): counts exprs ... compcounts compexprs
rownames(3): DNA1 DNA2 dead
rowData names(6): channel_name marker_name ... bead_ch is_bc
colnames: NULL
colData names(2): bc_id i
reducedDimNames(0):
altExpNames(0):
I also pasted here my code. Basically the code is from your paper:
#' ## Load library
#'
library(CATALYST)
library(dplyr)
library(flowCore)
library(flowWorkspace)
library(mvtnorm)
library(openCyto)
library(ggcyto)
library(ggplot2)
qc_theme <- list(
theme_bw(base_size = 8), theme(
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
plot.title = element_text(face = "bold"),
axis.text = element_text(color = "black"),
axis.text.x = element_text(angle = 45, hjust = 1)))
#' ## Read `FCS` files
#'
#' A SCE can be constructed using CATALYST’s `prepData()` function, which accepts
#' a path to a directory with one or many FCS files, a character vector of FCS
#' filenames, a single or list of flowFrame(s), or a
#' https://bioconductor.org/packages/3.11/bioc/html/flowCore.html (flowCore package24).
#' By default (transform = TRUE), an arcsinh-transformation with cofactor = 5 is applied to
#' the input (count) data, and the resulting expression matrix is stored in the exprs assay
#' slot of the output SCE:
# construct ’SingleCellExperiment’
data_path = "/depot/bioinf/data/Projects/CyTOF/test_data/"
setwd("/depot/bioinf/data/Projects/CyTOF/test_data/")
fcs <- list.files(data_path, "acquisition", full.names = TRUE)
fcs
(sce <- prepData(fcs, transform = TRUE, cofactor = 5))
## The cofactor used for transformation
(int_metadata(sce)$cofactor)
## dimmencial of counts
class(sce@assays@data$counts)
dim(sce@assays@data$counts)
## cofactor-5 arcsinh-transformed counts
class(sce@assays@data$exprs)
dim(sce@assays@data$exprs)
## How many cells in each file
head(colData(sce))
table(colData(sce))
#cell_num <- data.frame(
# file_id = levels(sce$file_id),
# n_cells = tabulate(sce$file_id))
#cell_num
## we rename the cell metadata variable sample_id to file_id to avoid ambiguity:
i <- match("sample_id", names(colData(sce)))
names(colData(sce))[i] <- "file_id"
# store character vector or channels names
chs <- channels(sce)
# store DNA & live channel indices
dna <- grep("^Ir", chs)
live <- grep("191|194", chs)
#' ## Filtering for stable signal
# plot channels of interest vs. time
coi <- chs[c(dna[1], which(rowData(sce)$use_channel))]
coi
plotScatter(sce, chs = c("Time", coi), label = "both") +
labs(y = "expression") +
scale_x_continuous(
expression("Time ("*10^6~"ms)"),
labels = function(u) u/1e6) +
theme_bw(base_size = 8) + theme(
aspect.ratio = 2/3,
panel.grid = element_blank(),
axis.text = element_text(color = "black"),
strip.background = element_rect(fill = NA))
#' ## Data filtering
## Let's assume we want to remove 0 to 1 10^6 ms
# construct ’GatingSet’
ff <- sce2fcs(sce[dna, ], assay = "exprs")
gs <- GatingSet(flowSet(ff))
# apply rectangular gate to exclude unstable signal
min_t <- 0
max_t <- 1000000
gs_add_gating_method(
gs, alias = "stable",
pop = "-", parent = "root",
dims = paste0("Time,", chs[dna[1]]),
gating_method = "boundary",
gating_args = sprintf("min=c(%s,0),max=c(%s,10)", min_t, max_t))
# plot scatter of DNA vs. Time
ggcyto(gs,
aes_string("Time", chs[dna[1]])) +
geom_hex(bins = 128) +
geom_gate("stable") +
facet_null() + theme_bw() +
ggtitle(NULL) + theme(
legend.position = "none",
panel.grid = element_blank())
# subset selected events
sce_fil <- sce[, gh_pop_get_indices(gs, "stable")]
plotScatter(sce_fil, chs = c("Time", coi), label = "both") +
labs(y = "expression") +
scale_x_continuous(
expression("Time ("*10^6~"ms)"),
labels = function(u) u/1e6) +
theme_bw(base_size = 8) + theme(
aspect.ratio = 2/3,
panel.grid = element_blank(),
axis.text = element_text(color = "black"),
strip.background = element_rect(fill = NA))
#' ## Normalization
# specify path to reference beads
ref_beads <- file.path(data_path, "normalization_beads.fcs")
# apply bead-based normalization
system.time(res <- normCytof(sce,
beads = "dvs", norm_to = ref_beads,
remove_beads = TRUE, overwrite = FALSE))
#' When remove_beads = TRUE (the default), normCytof() will return a list of three SCEs containing filtered,
#' `bead` and `remove` events, respectively, as well as two ggplot objects:
names(res)
table(res$data$is_bead)
table(res$data$remove)
table(res$data$file_id)
# view no. of remaining, bead & removed events
ns <- sapply(res[1:3], ncol)
ps <- sprintf("%1.2f", ns/ncol(sce)*100)
stat_cell_num <- data.frame(t(cbind("# events" = ns, "% of total" = ps)))
stat_cell_num
#' As a first quality control plot, `res$scatter` renders scatter plots of bead channels (x-axis)
#' versus DNA (y-axis), where events identified as beads as well as their expression range are
#' highlighted in color; bead events should have low DNA intensity (since they are not cells) and
#' high intensities across all bead channels.
res$scatter
res$lines
# compute mean bead channel counts for current run
(rowData(res$beads))
is_bead <- rowData(res$beads)$bead_ch # get bead channels
bead_cs <- counts(res$beads)[is_bead, ] # subset counts
rownames(bead_cs) <- chs[is_bead] # use channels as names
(bead_ms <- rowMeans(bead_cs)) # compute means
# read in reference mean bead channel counts
ref <- read.csv(file.path(data_path, "ref_bead_counts.csv"))
# join into single tidy data.frame
df <- bind_rows(ref, bead_ms, .id = "group")
library(reshape2)
df <- melt(df, id.var = "group")
# boxplot of reference vs. current run’s mean bead channel counts
ggplot(df, aes(variable, value)) +
geom_boxplot(data = df[df$group == 1, ]) +
geom_point(data = df[df$group == 2, ],
col = "red", pch = 4, stroke = 1) +
labs(x = "bead channel", y = "mean count") +
qc_theme + ggtitle(
"QC on bead channel counts",
"[-] = reference | x = current run")
#' After normalization, we overwrite the input dataset with the filtered subset
#' that no longer includes bead events, or bead-bead and bead-cell doublets:
sce <- res$data
#' ## Debarcoding
# read in debarcoding scheme
fn <- file.path(data_path, "debarcoding_scheme.csv")
bc_key <- read.csv(fn, row.names = 1, check.names = FALSE)
# all barcodes are positive for exactly 3 barcoding channels
all(rowSums(bc_key) == 3)
# remove empty barcodes from debarcoding scheme
is_empty <- grepl("empty", rownames(bc_key))
bc_key <- bc_key[!is_empty, ]
bc_ids <- rownames(bc_key)
# do preliminary barcode assignments
system.time(sce <- assignPrelim(sce, bc_key, assay = "normexprs"))
# view barcode channels
channels(sce)[rowData(sce)$is_bc]
table(rowData(sce)$is_bc)
# tabulate number of (un)assigned events
tab_bc_id = table(sce$bc_id == 0)
tab_bc_id[2]/(tab_bc_id[1]+tab_bc_id[2])
#' ## Estimation of separation cutoffs
plotYields(sce, which = "0")
sce <- estCutoffs(sce)
metadata(sce)$sep_cutoffs
plotYields(sce, which = "PBMC_R1")
# store preliminary barcode IDs
bc_ids0 <- sce$bc_id
# apply global & sample-specific separation cutoff(s)
sce_glob <- applyCutoffs(sce, sep_cutoffs = 0.15, mhl_cutoff = 30)
sce_spec <- applyCutoffs(sce, mhl_cutoff = 30)
# compare cell yields for both cutoff strategies
c(global = mean(sce_glob$bc_id == 0),
specific = mean(sce_spec$bc_id == 0))
# proceed with sample-specific filtering
sce <- sce_spec
# compute number of events per population
# before vs. after applying separation cutoffs
barplot(rbind(table(bc_ids0), table(sce$bc_id)),
beside = TRUE, ylab = "cell count",
las = 2, cex.axis = 0.5, cex.names = 0.5)
legend("topright", fill = c("black", "grey"),
legend = c("before filtering", "after filtering"))
# event plots for unassigned events
# & barcode population D1
plotEvents(sce, which = c(0, "Tumor_S1", "Tumor_S2"), n = 25)
# plotMahal(sce, which = "0")
#' ## Compensation
# read in pre-computed spillover matrix
sm <- file.path(data_path, "spillover_matrix.csv")
sm <- read.csv(sm, row.names = 1)
# apply NNLS compensation
system.time(
sce <- compCytof(sce, sm, method = "nnls",
assay = "normcounts", overwrite = FALSE))
#' To visually inspect how compensation affects signal intensities, we can generate scatter plots
#' pre- and post-compensation; an exemplary pair of channels is shown in the figure. In such a plot,
#' we can observe a slight positive association between the signals of spill-affected channels,
#' which should be removed upon compensation.
i <- grep("173|174", chs, value = TRUE)
p1 <- plotScatter(sce,
chs = i,
label = "channel",
assay = "normexprs") +
ggtitle("Uncompensated")
p2 <- plotScatter(sce,
chs = i,
label = "channel",
assay = "compexprs") +
ggtitle("Compensated") +
ylab(NULL)
library(patchwork)
wrap_plots(p1, p2)
#' ## Gating
#'
.scatter <- function(gs, chs, gate_id = NULL,
subset = ifelse(is.null(gate_id), "root", "_parent_")) {
p <- ggcyto(gs, max_nrow_to_plot = 1e5,
aes_string(chs[1], chs[2]), subset) +
geom_hex(bins = 100) + facet_wrap(~ name, ncol = 5) +
(if (is.null(gate_id)) list() else geom_gate(gate_id)) +
ggtitle(NULL) + theme_bw(base_size = 8) + theme(
aspect.ratio = 1,
legend.position = "none",
panel.grid.minor = element_blank(),
strip.background = element_rect(fill = NA),
axis.text = element_text(color = "black"),
axis.text.x = element_text(angle = 45, hjust = 1))
suppressMessages(p + coord_equal(expand = FALSE,
xlim = c(-1, 11), ylim = c(-1, 11)))
}
#' ### Gating on cells
# subset DNA & live channels
sub <- sce[union(dna, live), ]
# add metadata variable ’i’ to track cell indices
colData(sub) <- DataFrame(
bc_id = sub$bc_id,
i = seq_len(ncol(sce)))
## split SCE by sample
fs <- sce2fcs(sub,
assay = "compexprs",
split_by = "bc_id",
keep_cd = TRUE)
closing due to inactivity ... happy to get back to it if still relevant! ... fwiw, try to select relevant code chunks/outputs for any software issues in general - the above makes it very hard for developers to pinpoint the issue & help; thanks!