transformGamPoi-Paper [Question] How to benchmark another transformation

Hi @const-ae,

First up, love the paper!

I was wondering what is the best way to run your benchmarking on a single other transformation? I've implemented an empirical bayesian approach and want to see how it does if possible (https://github.com/mjcasy/scEC/blob/master/R/Normalisation.R).

Thanks

Jul 19 '22 15:07 mjcasy

Hi mjcasy,

I was wondering what is the best way to run your benchmarking on a single other transformation?

The general steps are to

Clone the benchmark folder.

Implement a function that calls your scEC package in transformation_helper.R and add it to the all_transformations list at the end of the script

scEC_fnc <- function(UMI, sf, alpha){
  scEC:::GetFreq(UMI)
}
all_transformations <- list(<...>, newwave = newwave_fnc, scEC = scEC_fnc)

Install the R dependencies by running Rscript -e "renv::restore()"

The next step depends if you have access to a cluster that runs slurm.

If yes:

Modify job_overview.yaml tos very slow transformation approaches.
Set .OUTPUT_FOLDER in run_benchmarks.R to an appropriate location
Run Rscript src/run_benchmarks.R to generate the jobs
Run Rscript src/collect_results.R to collect the results after all jobs have successfully finished

Alternatively, you can call the individual scripts to generate/download data, transform the data and evaluate the results manually. Let's say you want to check the consistency of your method:

Prepare a results folder: e.g. mkdir -p /tmp/transformation_benchmark/{results,duration,data}
Modify the data_folder in consistency_benchmark/download_helper. E.g., data_folder <- "/tmp/transformation_benchmark/data"

Download the dataset

Rscript src/consistency_benchmark/download_10X_datasets.R  \
         --dataset GSE142647 \
         --working_dir /tmp/transformation_benchmark \
         --result_id GSE142647.Rds

Calculate both the k-NN graph for both gene halves

Rscript src/consistency_benchmark/transform_consistency_data.R \
    --transformation scEC \
    --input_id GSE142647.Rds \
    --knn 50 \
    --pca_dim 20 \
    --alpha TRUE \
    --seed 1 \
    --working_dir /tmp/transformation_benchmark \
    --result_id GSE142647_knns.Rds

Calculate the overlap of the two k-NN graphs

Rscript src/consistency_benchmark/calculate_10X_consistency.R \
    --transformation scEC \
    --input_id GSE142647_knns.Rds \
    --dataset GSE142647 \
    --knn 50 \
    --pca_dim 20 \
    --alpha TRUE \
    --seed 1 \
    --working_dir /tmp/transformation_benchmark \
    --result_id GSE142647_consistency.tsv

You can now easily take these steps and wrap them in a for loop over seeds or datasets.

Jul 19 '22 16:07 const-ae

Let me know if you run into any trouble going through the steps. It is always an exciting moment when someone else is running one's code for the first time.

Jul 19 '22 16:07 const-ae

Thanks for the detailed response! Unfortunately the number of jobs launched hit up against admin limits on my institutions server, so will have to run without slurm. Any advice on what bits to run to get a slimmer result?

Jul 21 '22 13:07 mjcasy

Oh, I ran into the same issue with the job limit. I submitted the jobs in batches of a few thousand. In addition, I usually changed the job priority to 'low' in the submission steps of run_benchmarks.R, as our clustered allowed several times more low priority jobs to be submitted.

If you want to reduce the number jobs, modify the job_overview.yaml. By default, each parameter combination mentioned in 'input_data' and 'knn_construction' is considered. So you can for example cut the number of consistency jobs by a factor of 10 if you remove half the datasets and only consider one seed (i.e., replicate). Or you focus only on the parameter combinations discussed in the manuscript, which is also only a fraction of the total 58,000 data points.

Send from my phone, so please excuse any typos.

Jul 21 '22 20:07 const-ae

Hi, thanks for your great work! I wonder if it is possible for me to only run certain methods (for example, GLM based model) on certain datasets. Thanks a lot.

Moreover, I also have a quick question about the benchmarking result:

Did you consider about problems related to cell type disappearing? For example, this dataset which is shown in GLM-PCA. It seems that changing the number of neighbors as a hyper parameter cannot resolve this problem.

Apr 16 '23 16:04 HelloWorldLTY

I wonder if it is possible for me to only run certain methods (for example, GLM based model) on certain datasets. Thanks a lot.

Yes! Take a look at the job_overview.yaml file. It declares which combination of dataset and transformation are benchmarked. You could for example remove all non-GLM-based models. The actual parsing of the file happens in run_benchmarks.R.

Did you consider about problems related to cell type disappearing? For example, this dataset which is shown in GLM-PCA. It seems that changing the number of neighbors as a hyper parameter cannot resolve this problem.

I am not 100% sure I understand the question. I did not specifically consider if cell types disappear (or maybe rather are mixed with other cell types) in the tSNE or UMAP plot. However, the overlapping kNN metric should capture the case if one method is able to better resolve the structure of a dataset (which has to be consistent with the ground truth structure).

Apr 17 '23 07:04 const-ae

Hi, yes, I initially agree with you about using KNN metric to evaluate it. However, do you think, given the plots I have in the paper GLM-PCA, the normalization method based on library size can reduce the resolution of cell types in the existing datasets? It is like a special case or boardline case, I wonder if there are any methods I can use to identify this case. Thanks.

Apr 17 '23 11:04 HelloWorldLTY

Sorry, I don't know if a generic exists to identify such cases.

Apr 17 '23 12:04 const-ae

Hi, another question, now I meet some errors to run Rscript -e "renv::restore()" in a new environment. May I know your R version? Thanks a lot.

Apr 27 '23 19:04 HelloWorldLTY

It's 4.1.2. What kind of errors do you get?

Apr 28 '23 07:04 const-ae

Hi, actually plenty of errors. In most of the cases, its package installing error. I meet this error for stirngi, deseq2, biostringi, etc. Now I use R4.2.3 and I am keeping on trying.

Apr 28 '23 11:04 HelloWorldLTY

Hi mjcasy,

I was wondering what is the best way to run your benchmarking on a single other transformation?

The general steps are to
Clone the benchmark folder.
Implement a function that calls your scEC package in transformation_helper.R and add it to the all_transformations list at the end of the script
scEC_fnc <- function(UMI, sf, alpha){
  scEC:::GetFreq(UMI)
}
all_transformations <- list(<...>, newwave = newwave_fnc, scEC = scEC_fnc)
Install the R dependencies by running Rscript -e "renv::restore()"
The next step depends if you have access to a cluster that runs slurm.

If yes:

Modify job_overview.yaml tos very slow transformation approaches.

Set .OUTPUT_FOLDER in run_benchmarks.R to an appropriate location

Run Rscript src/run_benchmarks.R to generate the jobs

Run Rscript src/collect_results.R to collect the results after all jobs have successfully finished

Alternatively, you can call the individual scripts to generate/download data, transform the data and evaluate the results manually. Let's say you want to check the consistency of your method:
Prepare a results folder: e.g. mkdir -p /tmp/transformation_benchmark/{results,duration,data}

Modify the data_folder in consistency_benchmark/download_helper. E.g., data_folder <- "/tmp/transformation_benchmark/data"
Download the dataset
Rscript src/consistency_benchmark/download_10X_datasets.R  \
         --dataset GSE142647 \
         --working_dir /tmp/transformation_benchmark \
         --result_id GSE142647.Rds
Calculate both the k-NN graph for both gene halves
Rscript src/consistency_benchmark/transform_consistency_data.R \
    --transformation scEC \
    --input_id GSE142647.Rds \
    --knn 50 \
    --pca_dim 20 \
    --alpha TRUE \
    --seed 1 \
    --working_dir /tmp/transformation_benchmark \
    --result_id GSE142647_knns.Rds
Calculate the overlap of the two k-NN graphs
Rscript src/consistency_benchmark/calculate_10X_consistency.R \
    --transformation scEC \
    --input_id GSE142647_knns.Rds \
    --dataset GSE142647 \
    --knn 50 \
    --pca_dim 20 \
    --alpha TRUE \
    --seed 1 \
    --working_dir /tmp/transformation_benchmark \
    --result_id GSE142647_consistency.tsv
You can now easily take these steps and wrap them in a for loop over seeds or datasets.

Hi, I can run your codes and add my own methods via R 4.2.3. Moreover, I wonder if I have my own downloaded datasets, what should I do to use it as a benchmark dataset. Thanks a lot!

Apr 29 '23 20:04 HelloWorldLTY

transformGamPoi-Paper transformGamPoi-Paper copied to clipboard

[Question] How to benchmark another transformation

transformGamPoi-Paper
transformGamPoi-Paper copied to clipboard