basenji icon indicating copy to clipboard operation
basenji copied to clipboard

How to reproduce Supplementary_Figure_S9and apply it to other dataset

Open Licko0909 opened this issue 3 years ago • 5 comments

Hello! I'm very interested in your work! In particular, Basenji's application in prioritizing disease variants.

I found that Basenji had applications such as "Disease-Associated Loci" where Basenji used DeepSEA's GWAS Catalog data set (n=12296) and got Supplementary_Figure_S9.

I had a similar mutation dataset and wanted to apply Basenji to that dataset and evaluate performance so I wanted to reproduce the work done on Supplementary_Figure_S9. There are two main aspects of consultation:

  1. Do you have the Basenji prediction score file or implementation code shown in Figure 9? Available for reproduction of work Supplementary_Figure_S9.
  2. What do I need to do if I want to use Basenji to prioritize and score mutations from HGMD data and my own lab mutation data

Thank you for your excellent work and I look forward to your reply! I will keep an eye on email image

Licko0909 avatar Jan 06 '22 03:01 Licko0909

To make this figure, I performed the following steps.

  1. Download Supp Table 6 from the DeepSEA paper from here: https://static-content.springer.com/esm/art%3A10.1038%2Fnmeth.3547/MediaObjects/41592_2015_BFnmeth3547_MOESM650_ESM.csv
  2. Convert the table to a VCF file.
  3. Compute variant effect predictions using basenji_sad.py --rc with my model on the VCF. (I used --stats SAXR option originally, but I haven't found it matters much.)
  4. Optionally, compute a PCA on the variant effect predictions. (I used 128 originally.)
  5. Fit a regularized logistic regression model (more recently, I've used random forests) to classify disease versus healthy variants using the prediction features.

The many lines in the figure represent distinct cross folds. The "Basenji" lines use only my predictions. "Joint" concatenates the DeepSEA predictions, too. I also updated this analysis in my more recent paper's Figure 5 here: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008050.

davek44 avatar Jan 14 '22 23:01 davek44

To make this figure, I performed the following steps.

  1. Download Supp Table 6 from the DeepSEA paper from here: https://static-content.springer.com/esm/art%3A10.1038%2Fnmeth.3547/MediaObjects/41592_2015_BFnmeth3547_MOESM650_ESM.csv
  2. Convert the table to a VCF file.
  3. Compute variant effect predictions using basenji_sad.py --rc with my model on the VCF. (I used --stats SAXR option originally, but I haven't found it matters much.)
  4. Optionally, compute a PCA on the variant effect predictions. (I used 128 originally.)
  5. Fit a regularized logistic regression model (more recently, I've used random forests) to classify disease versus healthy variants using the prediction features.

The many lines in the figure represent distinct cross folds. The "Basenji" lines use only my predictions. "Joint" concatenates the DeepSEA predictions, too. I also updated this analysis in my more recent paper's Figure 5 here: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008050.

I am very glad to get professor's reply!

I followed your steps and I did the following: basenji_sad.py -f data/hg19.ml.fa -g data/human.hg19.genome --h5 -o output/GWAS_benchmark_sad/ --rc --stats "SAXR" --shift "0" models/params_small.txt models/heart/model_best.tf data/7_GWAS_12996/nonconding_GWAS_SNP_demo.vcf

image

I have a few questions: 1)Is shift "0" appropriate? 2)Is ”models/heart/model_best.tf“ appropriate? 3)The output SAD results are 3 columns,how did you deal with it?

image

Licko0909 avatar Jan 20 '22 03:01 Licko0909

Is it possible that you're looking at an older version of the repository? The basenji_sad command in master is different from what you listed. Check out https://github.com/calico/basenji/blob/master/tutorials/sad.ipynb

The model in the tutorials is merely for demonstration purposes; it has not learned interesting biology. My latest models are available here: https://github.com/calico/basenji/tree/master/manuscripts/cross2020

With the latest model, there will be more than 3 output columns. There will be 5,313 output columns, representing all of the training datasets. How you use them is really up to you. Each column corresponds to a particular dataset, so you might focus on the one that's most interesting to you. If you are dataset agnostic, you could compute the first principal component and study that. In the analysis that I described above, I treat those columns as features to a logistic regression or random forest trained on disease variants.

davek44 avatar Jan 23 '22 23:01 davek44

Q1. I am trying to use the RF classifier that is trained and available on an unseen dataset to get Mendelian disease and GWAS variant classification probabilities, however, was not able to figure out the input format required by the models (classifiers/gwas_hm_rf.pkl and mendelian_hm_rf.pkl). Perhaps if you could point me to some user guide or documentation around this, I would be very grateful. Thank you!

sukritikumar4 avatar Apr 11 '22 10:04 sukritikumar4

The random forest models are scikit-learn objects. They have predict functions which will take a VxT numpy array of V variants and the T Basenji tasks. For the human only models, T is the human tasks. For the human-mouse model, T is a concatenation of the human and mouse tasks.

davek44 avatar Apr 13 '22 23:04 davek44