basenji
basenji copied to clipboard
How to reproduce Supplementary_Figure_S9and apply it to other dataset
Hello! I'm very interested in your work! In particular, Basenji's application in prioritizing disease variants.
I found that Basenji had applications such as "Disease-Associated Loci" where Basenji used DeepSEA's GWAS Catalog data set (n=12296) and got Supplementary_Figure_S9.
I had a similar mutation dataset and wanted to apply Basenji to that dataset and evaluate performance so I wanted to reproduce the work done on Supplementary_Figure_S9. There are two main aspects of consultation:
- Do you have the Basenji prediction score file or implementation code shown in Figure 9? Available for reproduction of work Supplementary_Figure_S9.
- What do I need to do if I want to use Basenji to prioritize and score mutations from HGMD data and my own lab mutation data
Thank you for your excellent work and I look forward to your reply! I will keep an eye on email
To make this figure, I performed the following steps.
- Download Supp Table 6 from the DeepSEA paper from here: https://static-content.springer.com/esm/art%3A10.1038%2Fnmeth.3547/MediaObjects/41592_2015_BFnmeth3547_MOESM650_ESM.csv
- Convert the table to a VCF file.
- Compute variant effect predictions using basenji_sad.py --rc with my model on the VCF. (I used --stats SAXR option originally, but I haven't found it matters much.)
- Optionally, compute a PCA on the variant effect predictions. (I used 128 originally.)
- Fit a regularized logistic regression model (more recently, I've used random forests) to classify disease versus healthy variants using the prediction features.
The many lines in the figure represent distinct cross folds. The "Basenji" lines use only my predictions. "Joint" concatenates the DeepSEA predictions, too. I also updated this analysis in my more recent paper's Figure 5 here: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008050.
To make this figure, I performed the following steps.
- Download Supp Table 6 from the DeepSEA paper from here: https://static-content.springer.com/esm/art%3A10.1038%2Fnmeth.3547/MediaObjects/41592_2015_BFnmeth3547_MOESM650_ESM.csv
- Convert the table to a VCF file.
- Compute variant effect predictions using basenji_sad.py --rc with my model on the VCF. (I used --stats SAXR option originally, but I haven't found it matters much.)
- Optionally, compute a PCA on the variant effect predictions. (I used 128 originally.)
- Fit a regularized logistic regression model (more recently, I've used random forests) to classify disease versus healthy variants using the prediction features.
The many lines in the figure represent distinct cross folds. The "Basenji" lines use only my predictions. "Joint" concatenates the DeepSEA predictions, too. I also updated this analysis in my more recent paper's Figure 5 here: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008050.
I am very glad to get professor's reply!
I followed your steps and I did the following:
basenji_sad.py -f data/hg19.ml.fa -g data/human.hg19.genome --h5 -o output/GWAS_benchmark_sad/ --rc --stats "SAXR" --shift "0" models/params_small.txt models/heart/model_best.tf data/7_GWAS_12996/nonconding_GWAS_SNP_demo.vcf
I have a few questions: 1)Is shift "0" appropriate? 2)Is ”models/heart/model_best.tf“ appropriate? 3)The output SAD results are 3 columns,how did you deal with it?
Is it possible that you're looking at an older version of the repository? The basenji_sad command in master is different from what you listed. Check out https://github.com/calico/basenji/blob/master/tutorials/sad.ipynb
The model in the tutorials is merely for demonstration purposes; it has not learned interesting biology. My latest models are available here: https://github.com/calico/basenji/tree/master/manuscripts/cross2020
With the latest model, there will be more than 3 output columns. There will be 5,313 output columns, representing all of the training datasets. How you use them is really up to you. Each column corresponds to a particular dataset, so you might focus on the one that's most interesting to you. If you are dataset agnostic, you could compute the first principal component and study that. In the analysis that I described above, I treat those columns as features to a logistic regression or random forest trained on disease variants.
Q1. I am trying to use the RF classifier that is trained and available on an unseen dataset to get Mendelian disease and GWAS variant classification probabilities, however, was not able to figure out the input format required by the models (classifiers/gwas_hm_rf.pkl and mendelian_hm_rf.pkl). Perhaps if you could point me to some user guide or documentation around this, I would be very grateful. Thank you!
The random forest models are scikit-learn objects. They have predict functions which will take a VxT numpy array of V variants and the T Basenji tasks. For the human only models, T is the human tasks. For the human-mouse model, T is a concatenation of the human and mouse tasks.