transformGamPoi-Paper Reproduce downsample benchmark

Hi,

Thank you for the great work. I had an issue when reproducing the results in the downsampling benchmark and any insight would be highly appreciated!

Basically, I am trying to calculate the KNN graph for CPM transformation and compare it with the overlapped graph. Ideally, all edges in the overlapped should be in the CPM KNN graph but I missed 173 edges. I wonder whether I missed any part during the preprocessing or I missed anything else?

Thank you for your time!

dataset <- "mcSCRB" 

logp_cpm_fnc  <- function(UMI, sf, alpha){
  colsums <- MatrixGenerics::colSums2(UMI)
  log1p(t(t(UMI) / colsums * 1e6))
}

calculate_knn <- function(count_mtx, pca_dim=50, k_nearest_neighbors=50){
  red_dat <- BiocSingular::runPCA(t(count_mtx), rank = pca_dim, get.rotation = FALSE, BSPARAM = BiocSingular::ExactParam())$x
  knn_graph <- BiocNeighbors::findAnnoy(red_dat, k = k_nearest_neighbors, warn.ties = FALSE)$index
  knn_graph
}

get_neighbor_df <- function(neighbors_mtx){
  # Format the N by k matrix into origin, neighbor pairs
  num_nodes <- nrow(neighbors_mtx)
  # Initialize an empty dataframe for the neighbor pairs
  neighbor_df <- data.frame(origin=integer(), neighbor=integer())
  # Loop through each node and its neighbors to fill the dataframe
  for (i in 1:num_nodes) {
    # Get the current node's neighbors
    neighbors <- neighbors_mtx[i, ]
    
    # Create a dataframe of pairs for the current node
    current_pairs <- data.frame(origin = rep(i, length(neighbors)), neighbor = neighbors)
    
    # Append to the main dataframe
    neighbor_df <- rbind(neighbor_df, current_pairs)
  }
  return(neighbor_df)
}

# -------------------------
# Overlapped KNN
# -------------------------
file_name <- "downsampling_deepseq_overlap.RDS"
full_knn_list <- readRDS(file_name)

df <- full_knn_list[[dataset]]
df %>% group_by(transformation) %>% mutate(n = n()) %>% select(transformation, n) %>% unique()
list_dfs <- split(df, df$transformation)
common_knn <- Reduce(function(x, y) inner_join(x, y, by = c("origin", "neighbor")), list_dfs)
common_knn_df <- common_knn %>% select(origin, neighbor)
dim(common_knn_df)
# [1] 1085    2

# -------------------------
# CPM KNN in 
# -------------------------
file_name <- "GSE103568_JM8_UMIcounts.txt.gz"
count_mtx <- read.table(file_name, header = TRUE, sep = "\t")
count_mtx <- as.matrix(count_mtx)
expressed_cells <- matrixStats::colSums2(count_mtx) > 0
expressed_genes <- matrixStats::rowSums2(count_mtx) > 0
dim(count_mtx)
# [1] 22556   249
count_mtx <- count_mtx[expressed_genes, expressed_cells]
dim(count_mtx)
# [1] 22556   249
cpm_mtx <- logp_cpm_fnc(count_mtx, sf, alpha)
cpm_knn_graph <- calculate_knn(cpm_mtx)
cpm_knn_df <- get_neighbor_df(cpm_knn_graph)
dim(cpm_knn_df)
# [1] 12450     2
cpm_overlap_df <- inner_join(common_knn_df, cpm_knn_df, by = c("origin", "neighbor"))
dim(cpm_overlap_df)
# [1] 912   2

# Rows in common_knn_df but not in cpm_knn_df
common_knn_df %>% anti_join(cpm_knn_df, by = c("origin", "neighbor")) %>% dim()
# [1] 173 2

Feb 02 '24 19:02 kuang-da

Hi Da,

thanks for reaching out. Unfortunately, I am currently travelling and would need to look in more depth at the problem that you describe. I will be back in the office in 2 weeks and can take a look then (feel free to ping me again, so that I don't forget).

Just to clarify: is that a problem reproducing some of my analysis or are you trying to extend the benchmark and are encountering some inconsistencies?

Best, Constantin

Feb 05 '24 08:02 const-ae

Hi Constantin,

Thanks for getting back to me. Sure, I will pin you in two weeks. I hope you enjoy the travel! : D

For clarification, I am trying to use the grand truth overlapping in downsampling_deepseq_overlap.RDS and the count matrices to test the performance of a few other transformations that are not in the paper. However, the number of cells for smartSeq3_fibroblasts_alt and smartSeq3_hek seems to be unmatched between the count matrices and the KNN labels.

Feb 08 '24 05:02 kuang-da

Hi @const-ae, just a kindly reminder to see if you have come back from the travel : D

Feb 28 '24 22:02 kuang-da

Hi Da, thank you for your patience. I am back now and looked in detail into your code. Thank you for providing the reproducible example, this made it a lot easier to figure out what was going wrong.

The difference between your condensed version and the code that I ran for the benchmark was that I filtered the count matrices by the presence of counts in the down-sampled matrix (https://github.com/const-ae/transformGamPoi-Paper/blob/50de7ab4566bfd3b539f22e628e9460eda71be85/benchmark/src/downsampling_benchmark/download_deeply_sequenced_datasets.R#L24). If you include the following snippet:

set.seed(1)
colsums <- colSums2(count_mtx)
downsample_proportion <- 5000 / median(colsums)
downsampled_UMI = as.matrix(scuttle::downsampleMatrix(count_mtx, prop = downsample_proportion, bycol = FALSE))
expressed_cells <- matrixStats::colSums2(downsampled_UMI) > 0
expressed_genes <- matrixStats::rowSums2(downsampled_UMI) > 0
count_mtx <- count_mtx[expressed_genes, expressed_cells]

You will see that the matrix only has 16864 genes. And if you then transform this smaller matrix and search for the kNNs you find that they agree with data from the downsampling_deepseq_overlap.RDS file.

Please, let me know if you are able to reproduce the results.

As you clearly have invested quite a bit of time going through my code, I would be curious to hear for what purpose you are trying to reproduce the exact benchmark results. If you want, we can schedule a brief Zoom call to discuss and see if I can help you with anything else?

Best, Constantin

Mar 05 '24 15:03 const-ae

transformGamPoi-Paper transformGamPoi-Paper copied to clipboard

Reproduce downsample benchmark

transformGamPoi-Paper
transformGamPoi-Paper copied to clipboard