transformGamPoi-Paper
transformGamPoi-Paper copied to clipboard
Reproduce downsample benchmark
Hi,
Thank you for the great work. I had an issue when reproducing the results in the downsampling benchmark and any insight would be highly appreciated!
Basically, I am trying to calculate the KNN graph for CPM transformation and compare it with the overlapped graph. Ideally, all edges in the overlapped should be in the CPM KNN graph but I missed 173 edges. I wonder whether I missed any part during the preprocessing or I missed anything else?
Thank you for your time!
dataset <- "mcSCRB"
logp_cpm_fnc <- function(UMI, sf, alpha){
colsums <- MatrixGenerics::colSums2(UMI)
log1p(t(t(UMI) / colsums * 1e6))
}
calculate_knn <- function(count_mtx, pca_dim=50, k_nearest_neighbors=50){
red_dat <- BiocSingular::runPCA(t(count_mtx), rank = pca_dim, get.rotation = FALSE, BSPARAM = BiocSingular::ExactParam())$x
knn_graph <- BiocNeighbors::findAnnoy(red_dat, k = k_nearest_neighbors, warn.ties = FALSE)$index
knn_graph
}
get_neighbor_df <- function(neighbors_mtx){
# Format the N by k matrix into origin, neighbor pairs
num_nodes <- nrow(neighbors_mtx)
# Initialize an empty dataframe for the neighbor pairs
neighbor_df <- data.frame(origin=integer(), neighbor=integer())
# Loop through each node and its neighbors to fill the dataframe
for (i in 1:num_nodes) {
# Get the current node's neighbors
neighbors <- neighbors_mtx[i, ]
# Create a dataframe of pairs for the current node
current_pairs <- data.frame(origin = rep(i, length(neighbors)), neighbor = neighbors)
# Append to the main dataframe
neighbor_df <- rbind(neighbor_df, current_pairs)
}
return(neighbor_df)
}
# -------------------------
# Overlapped KNN
# -------------------------
file_name <- "downsampling_deepseq_overlap.RDS"
full_knn_list <- readRDS(file_name)
df <- full_knn_list[[dataset]]
df %>% group_by(transformation) %>% mutate(n = n()) %>% select(transformation, n) %>% unique()
list_dfs <- split(df, df$transformation)
common_knn <- Reduce(function(x, y) inner_join(x, y, by = c("origin", "neighbor")), list_dfs)
common_knn_df <- common_knn %>% select(origin, neighbor)
dim(common_knn_df)
# [1] 1085 2
# -------------------------
# CPM KNN in
# -------------------------
file_name <- "GSE103568_JM8_UMIcounts.txt.gz"
count_mtx <- read.table(file_name, header = TRUE, sep = "\t")
count_mtx <- as.matrix(count_mtx)
expressed_cells <- matrixStats::colSums2(count_mtx) > 0
expressed_genes <- matrixStats::rowSums2(count_mtx) > 0
dim(count_mtx)
# [1] 22556 249
count_mtx <- count_mtx[expressed_genes, expressed_cells]
dim(count_mtx)
# [1] 22556 249
cpm_mtx <- logp_cpm_fnc(count_mtx, sf, alpha)
cpm_knn_graph <- calculate_knn(cpm_mtx)
cpm_knn_df <- get_neighbor_df(cpm_knn_graph)
dim(cpm_knn_df)
# [1] 12450 2
cpm_overlap_df <- inner_join(common_knn_df, cpm_knn_df, by = c("origin", "neighbor"))
dim(cpm_overlap_df)
# [1] 912 2
# Rows in common_knn_df but not in cpm_knn_df
common_knn_df %>% anti_join(cpm_knn_df, by = c("origin", "neighbor")) %>% dim()
# [1] 173 2
Hi Da,
thanks for reaching out. Unfortunately, I am currently travelling and would need to look in more depth at the problem that you describe. I will be back in the office in 2 weeks and can take a look then (feel free to ping me again, so that I don't forget).
Just to clarify: is that a problem reproducing some of my analysis or are you trying to extend the benchmark and are encountering some inconsistencies?
Best, Constantin
Hi Constantin,
Thanks for getting back to me. Sure, I will pin you in two weeks. I hope you enjoy the travel! : D
For clarification, I am trying to use the grand truth overlapping in downsampling_deepseq_overlap.RDS and the count matrices to test the performance of a few other transformations that are not in the paper. However, the number of cells for smartSeq3_fibroblasts_alt and smartSeq3_hek seems to be unmatched between the count matrices and the KNN labels.
Hi @const-ae, just a kindly reminder to see if you have come back from the travel : D
Hi Da, thank you for your patience. I am back now and looked in detail into your code. Thank you for providing the reproducible example, this made it a lot easier to figure out what was going wrong.
The difference between your condensed version and the code that I ran for the benchmark was that I filtered the count matrices by the presence of counts in the down-sampled matrix (https://github.com/const-ae/transformGamPoi-Paper/blob/50de7ab4566bfd3b539f22e628e9460eda71be85/benchmark/src/downsampling_benchmark/download_deeply_sequenced_datasets.R#L24). If you include the following snippet:
set.seed(1)
colsums <- colSums2(count_mtx)
downsample_proportion <- 5000 / median(colsums)
downsampled_UMI = as.matrix(scuttle::downsampleMatrix(count_mtx, prop = downsample_proportion, bycol = FALSE))
expressed_cells <- matrixStats::colSums2(downsampled_UMI) > 0
expressed_genes <- matrixStats::rowSums2(downsampled_UMI) > 0
count_mtx <- count_mtx[expressed_genes, expressed_cells]
You will see that the matrix only has 16864 genes. And if you then transform this smaller matrix and search for the kNNs you find that they agree with data from the downsampling_deepseq_overlap.RDS file.
Please, let me know if you are able to reproduce the results.
As you clearly have invested quite a bit of time going through my code, I would be curious to hear for what purpose you are trying to reproduce the exact benchmark results. If you want, we can schedule a brief Zoom call to discuss and see if I can help you with anything else?
Best, Constantin