transformGamPoi-Paper icon indicating copy to clipboard operation
transformGamPoi-Paper copied to clipboard

[Question] How to benchmark another transformation

Open mjcasy opened this issue 3 years ago • 12 comments

Hi @const-ae,

First up, love the paper!

I was wondering what is the best way to run your benchmarking on a single other transformation? I've implemented an empirical bayesian approach and want to see how it does if possible (https://github.com/mjcasy/scEC/blob/master/R/Normalisation.R).

Thanks

mjcasy avatar Jul 19 '22 15:07 mjcasy

Hi mjcasy,

I was wondering what is the best way to run your benchmarking on a single other transformation?

The general steps are to

The next step depends if you have access to a cluster that runs slurm.

If yes:

  • Modify job_overview.yaml tos very slow transformation approaches.
  • Set .OUTPUT_FOLDER in run_benchmarks.R to an appropriate location
  • Run Rscript src/run_benchmarks.R to generate the jobs
  • Run Rscript src/collect_results.R to collect the results after all jobs have successfully finished

Alternatively, you can call the individual scripts to generate/download data, transform the data and evaluate the results manually. Let's say you want to check the consistency of your method:

  • Prepare a results folder: e.g. mkdir -p /tmp/transformation_benchmark/{results,duration,data}
  • Modify the data_folder in consistency_benchmark/download_helper. E.g., data_folder <- "/tmp/transformation_benchmark/data"
  • Download the dataset
    Rscript src/consistency_benchmark/download_10X_datasets.R  \
             --dataset GSE142647 \
             --working_dir /tmp/transformation_benchmark \
             --result_id GSE142647.Rds
    
  • Calculate both the k-NN graph for both gene halves
    Rscript src/consistency_benchmark/transform_consistency_data.R \
        --transformation scEC \
        --input_id GSE142647.Rds \
        --knn 50 \
        --pca_dim 20 \
        --alpha TRUE \
        --seed 1 \
        --working_dir /tmp/transformation_benchmark \
        --result_id GSE142647_knns.Rds
    
  • Calculate the overlap of the two k-NN graphs
    Rscript src/consistency_benchmark/calculate_10X_consistency.R \
        --transformation scEC \
        --input_id GSE142647_knns.Rds \
        --dataset GSE142647 \
        --knn 50 \
        --pca_dim 20 \
        --alpha TRUE \
        --seed 1 \
        --working_dir /tmp/transformation_benchmark \
        --result_id GSE142647_consistency.tsv
    

You can now easily take these steps and wrap them in a for loop over seeds or datasets.

const-ae avatar Jul 19 '22 16:07 const-ae

Let me know if you run into any trouble going through the steps. It is always an exciting moment when someone else is running one's code for the first time.

const-ae avatar Jul 19 '22 16:07 const-ae

Thanks for the detailed response! Unfortunately the number of jobs launched hit up against admin limits on my institutions server, so will have to run without slurm. Any advice on what bits to run to get a slimmer result?

mjcasy avatar Jul 21 '22 13:07 mjcasy

Oh, I ran into the same issue with the job limit. I submitted the jobs in batches of a few thousand. In addition, I usually changed the job priority to 'low' in the submission steps of run_benchmarks.R, as our clustered allowed several times more low priority jobs to be submitted.

If you want to reduce the number jobs, modify the job_overview.yaml. By default, each parameter combination mentioned in 'input_data' and 'knn_construction' is considered. So you can for example cut the number of consistency jobs by a factor of 10 if you remove half the datasets and only consider one seed (i.e., replicate). Or you focus only on the parameter combinations discussed in the manuscript, which is also only a fraction of the total 58,000 data points.


Send from my phone, so please excuse any typos.

const-ae avatar Jul 21 '22 20:07 const-ae

Hi, thanks for your great work! I wonder if it is possible for me to only run certain methods (for example, GLM based model) on certain datasets. Thanks a lot.

Moreover, I also have a quick question about the benchmarking result:

Did you consider about problems related to cell type disappearing? For example, this dataset which is shown in GLM-PCA. It seems that changing the number of neighbors as a hyper parameter cannot resolve this problem. image

HelloWorldLTY avatar Apr 16 '23 16:04 HelloWorldLTY

I wonder if it is possible for me to only run certain methods (for example, GLM based model) on certain datasets. Thanks a lot.

Yes! Take a look at the job_overview.yaml file. It declares which combination of dataset and transformation are benchmarked. You could for example remove all non-GLM-based models. The actual parsing of the file happens in run_benchmarks.R.

Did you consider about problems related to cell type disappearing? For example, this dataset which is shown in GLM-PCA. It seems that changing the number of neighbors as a hyper parameter cannot resolve this problem.

I am not 100% sure I understand the question. I did not specifically consider if cell types disappear (or maybe rather are mixed with other cell types) in the tSNE or UMAP plot. However, the overlapping kNN metric should capture the case if one method is able to better resolve the structure of a dataset (which has to be consistent with the ground truth structure).

const-ae avatar Apr 17 '23 07:04 const-ae

Hi, yes, I initially agree with you about using KNN metric to evaluate it. However, do you think, given the plots I have in the paper GLM-PCA, the normalization method based on library size can reduce the resolution of cell types in the existing datasets? It is like a special case or boardline case, I wonder if there are any methods I can use to identify this case. Thanks.

HelloWorldLTY avatar Apr 17 '23 11:04 HelloWorldLTY

Sorry, I don't know if a generic exists to identify such cases.

const-ae avatar Apr 17 '23 12:04 const-ae

Hi, another question, now I meet some errors to run Rscript -e "renv::restore()" in a new environment. May I know your R version? Thanks a lot.

HelloWorldLTY avatar Apr 27 '23 19:04 HelloWorldLTY

It's 4.1.2. What kind of errors do you get?

const-ae avatar Apr 28 '23 07:04 const-ae

Hi, actually plenty of errors. In most of the cases, its package installing error. I meet this error for stirngi, deseq2, biostringi, etc. Now I use R4.2.3 and I am keeping on trying.

HelloWorldLTY avatar Apr 28 '23 11:04 HelloWorldLTY

Hi mjcasy,

I was wondering what is the best way to run your benchmarking on a single other transformation?

The general steps are to

The next step depends if you have access to a cluster that runs slurm.

If yes:

  • Modify job_overview.yaml tos very slow transformation approaches.
  • Set .OUTPUT_FOLDER in run_benchmarks.R to an appropriate location
  • Run Rscript src/run_benchmarks.R to generate the jobs
  • Run Rscript src/collect_results.R to collect the results after all jobs have successfully finished

Alternatively, you can call the individual scripts to generate/download data, transform the data and evaluate the results manually. Let's say you want to check the consistency of your method:

  • Prepare a results folder: e.g. mkdir -p /tmp/transformation_benchmark/{results,duration,data}
  • Modify the data_folder in consistency_benchmark/download_helper. E.g., data_folder <- "/tmp/transformation_benchmark/data"
  • Download the dataset
    Rscript src/consistency_benchmark/download_10X_datasets.R  \
             --dataset GSE142647 \
             --working_dir /tmp/transformation_benchmark \
             --result_id GSE142647.Rds
    
  • Calculate both the k-NN graph for both gene halves
    Rscript src/consistency_benchmark/transform_consistency_data.R \
        --transformation scEC \
        --input_id GSE142647.Rds \
        --knn 50 \
        --pca_dim 20 \
        --alpha TRUE \
        --seed 1 \
        --working_dir /tmp/transformation_benchmark \
        --result_id GSE142647_knns.Rds
    
  • Calculate the overlap of the two k-NN graphs
    Rscript src/consistency_benchmark/calculate_10X_consistency.R \
        --transformation scEC \
        --input_id GSE142647_knns.Rds \
        --dataset GSE142647 \
        --knn 50 \
        --pca_dim 20 \
        --alpha TRUE \
        --seed 1 \
        --working_dir /tmp/transformation_benchmark \
        --result_id GSE142647_consistency.tsv
    

You can now easily take these steps and wrap them in a for loop over seeds or datasets.

Hi, I can run your codes and add my own methods via R 4.2.3. Moreover, I wonder if I have my own downloaded datasets, what should I do to use it as a benchmark dataset. Thanks a lot!

HelloWorldLTY avatar Apr 29 '23 20:04 HelloWorldLTY