What to do with smaller genomes?
Hi there
Congratulations on a very nicely written paper. I'm still absorbing it, but I wondered what I ought to do if using it on small genomes. Your documentation says you need 1 million SNPs in your input candidate VCFs but for small genomes you won't get that many variants, commonly. Is this ok?
Hi,
Thank you for the kind words - and for posting your question!
The BayesTyper requires a set of candidate SNPs for estimating the noise parameters. 1 million is likely far on the conservative side, but was empirically determined to work well on human data where getting a large number of variants is not an issue - we haven't rigorously tested the effects of using less candidates.
Ideally, the noise should be adapted both to the technical (e.g. sequencing) error profile of your experiment as well as to how much of this noise actually ends up as candidate variants in the input (determined by the level of filtering performed before running BayesTyper). As the number of "training" SNPs decreases, the noise estimates will be increasingly dominated by the noise prior distribution and we are not sure how this will affect performance. Furthermore, only the mean value of the noise estimates are currently used in the genotype inference and hence the increased uncertainty caused by less "training" SNPs is not propagated to the final genotype posterior.
Our best advice is to just try it out, but be aware that if it does not perform as intended is may be due to poor noise estimates. It is possible to inspect the noise estimates as the posterior samples are written to a file for each unit (<prefix>_noise_parameters.txt). You could for instance compare their mean value (the last line of the file) with their genomic rate estimates (<prefix>_genomic_parameters.txt) - they should be significantly smaller.
If you find that it does not perform well, you could consider trying to empirically optimize the parameters of the noise prior (a gamma distribution) - user defined parameters can be provided using --noise-rate-prior option to bayestyper genotype.
Please do not hesitate to post again if you have further questions or comments.
br,
Lasse
Thanks! I will have a go. BTW I wonder if you can get equally good estimates of noise by using a smaller number of SNPs but more sequencing depth (eg splitting the input data in two, or three). Might be useful for smaller genomes where one gets more depth than is typical with human
Hi,
Sounds good. You are right that sequencing depth does provide some compensation for the lower number of SNPs, however the best estimates will as far as I can see still be obtained by using all data for the same sample in the same run.
In reality, we do not know if there will be any problems at all when running on a small set of input variants - we just don't know. If possible, please let us know of your experience with running BayesTyper your data and if there is something we can do on our side. In the meantime, we will have a look at how we can improve noise estimation and the genotyping on sparse candidate sets e.g. by propagating the posterior uncertainty of the noise parameters to the genotyping step.
br,
Lasse
Further to the above, I'm trying out v1.4.1 on a bacterial genome of size ~5.3MB. I get this error from running bayesTyper genotype:
[18/03/2019 17:03:46] Estimating noise model parameters using 20 parallel gibbs sampling chains each with 350 iterations (100 burn-in) ...
ERROR: Insufficient number of SNV clusters available for Poisson parameter estimation (4349 < 10000); the genome used is likely too small
Are there any options I can change to get it to run? For this example there are 11175 variants in the VCF made when I run bayesTyper cluster on my two input VCFs. But I also have samples where there are <100 variants and if possible I'd like to try bayesTyper on those as well.
Any suggestions please?
Thanks, Martin
If it helps, I just found that v1.3.1 works on the same input data.
Hi Martin,
I am currently working on a new version that I hope to release later this week that specifically adresses the issue you mention. In the new release all types of variants (except nested) will be used for noise parameter estimation, which should increase the number of variants that are available for this. Also, the error that you get have been changed to a warning.
Besides this I am thinking about introducing a new mode where noise parameters is not estimated prior to genotyping and fixed, but rather estimated in conjunction with the genotypes. This should work really well for smaller genomes with few variants, since the uncertainty in the noise estimation will be directly propagated into the genotyping step. This mode will however require more memory and have a longer run time, but for small genomes this might not matter that much. I can't promise that I will have it ready for this release, but I will see what I can do.
Best,
Jonas
Thanks Jonas, looking forward to the new release.
Hi Martin,
The new release is now available (v1.5). It contains a new mode where noise and genotypes are estimated at the same time (--noise-genotyping). I would recommend using this mode if you run on data with less than a 100,000 variants (see wiki)
Let me know if you have any other questions.
Best,
Jonas