octopus
octopus copied to clipboard
Hints on adding refcall POSITIONAL support to polyclone?
Hi,
I was interested in looking at the code for the polyclone caller to see what it would take to make it output VCF records for every site. Would you be able to say a few words about how the calling process relates to the process of outputting VCF records? It seems (perhaps naively) like the calling process should generate the info that would go in the non-variant records whether it is printed or not.
So, looking at the code
- it looks like the
polyclone
caller current returns{}
fromcall_reference( )
, wherase theindividual
caller returns more info. - it looks like the
polyclone
caller has the required fields in itsLatents
So I'm curious if it you think it would be hard for me to fill out call_reference
for the polyclone
caller.
Thanks again for any help.
-BenRI
As a side note, given the number of uses of std::transform( )
, I wonder if using ranges-v3
could simplify a lot of code. It works with c++14, and it would allow using more readable idioms without requiring a flag day to convert more verbose idioms to ranges-v3.
https://github.com/ericniebler/range-v3
As a caveat, it doesn't really correspond directly to what is in C++20. I see that there is a new repo called 'rangesnext' that assumes C++20, and implements some new features that are proposed for inclusion in C++23: https://github.com/cor3ntin/rangesnext
If you just want basic reference calling functionality then you should be able to simply copy the implementation from IndividualCaller
to PolycloneCaller
. Note, however, that the model implemented is essentially diploid.
Thanks, good to know! I was wondering how much I could use IndividualCaller implementation as a template.
(At the moment, I don't care if the probability that the homozygous-ness is correct, just the depth. I don't even need GVCF format, I just need records printed for non-variant sites.)
However, in the interest of doing a correct implementation -- is the assumption of diploidy just in the last few lines of compute_homozygous_posterior
?
Also, is there a way of determining the local ploidy for a site in the polyclone caller?
However, in the interest of doing a correct implementation -- is the assumption of diploidy just in the last few lines of compute_homozygous_posterior?
Correct.
Also, is there a way of determining the local ploidy for a site in the polyclone caller?
Yes, you can get the inferred local ploidy from the Latents
object passed into PolycloneCaller::call_reference
. Specifically, PolycloneCaller::Latents::model_log_posteriors_
gives the probability of clonality - if subclonality is indicated then you can get the inferred ploidy from polyploid_genotypes_
.
OK, so I started by copying over the code from individual caller:
https://github.com/bredelings/octopus/commits/polyclone-refcall
It required implementing PolycloneCaller::Latents::genotype_log_posteriors()
, which I may have done correctly. I still need to access to local ploidy and compute the posterior correctly.
However... I tried using it, and no reference calls were shown. I tried compiling with -DCMAKE_BUILD_TYPE=RelWithDebInfo
to investigate, but the compilation process took forever and ran out of RAM. I will try again with different flags later, but... I was wondering if you had a hint for (a) telling cmake to use -Og -g
and not turn on the address sanitizer, and (b) why it might not be showing any reference sites.
I just installed your fork and I do get reference calls from the polyclone
caller. Are you sure you checked out your new branch polyclone-refcall
after cloning your fork (default branch is develop
)?
Oops! I was indeed on the wrong branch somehow. I am seeing the reference calls now. So, the next thing is to pass the local ploidy down to compute_homozygous_posterior( )
.
In the meantime, I'll note two problems that I've seen:
- An assert triggers when running in debug mode:
octopus-debug: /home/bredelings/Devel/octopus/src/utils/concat.hpp:82: octopus::MappableBlock<T> octopus::concat(const octopus::MappableBlock<T>&, const octopus::MappableBlock<T>&) [with T = octopus::Genotype<octopus::IndexedHaplotype<> >]: Assertion `is_same_region(lhs, rhs)' failed.
The call to concat( ) is from the line auto genotypes = concat(haploid_genotypes_, polyploid_genotypes_);
in PolycloneCaller::Latents::genotype_posteriors()
- Apparently you can't supply the
--refcall-filter-expression
argument, because something is already supplying it:
$ ~/Devel/octopus/local/gcc-11/octopus -T LT635626 -I api.bam -R ../plasmo-combined.fasta -o test.g.vcf.gz --annotations AD -C polyclone --max-clones 2 --sequence-error-model PCR --threads 4 --refcall-filter-expression 'QUAL < 2 | GQ < 20 | MQ < 10 | DP < 10 | MF > 0.2' --refcall POSITIONAL
[2021-09-30 10:03:36] <INFO> ------------------------------------------------------------------------
[2021-09-30 10:03:36] <INFO> octopus v0.7.4 (polyclone-refcall ec838e75)
[2021-09-30 10:03:36] <INFO> Copyright (c) 2015-2021 University of Oxford
[2021-09-30 10:03:36] <INFO> ------------------------------------------------------------------------
[2021-09-30 10:03:36] <EROR> An unclassified error has occurred:
[2021-09-30 10:03:36] <EROR>
[2021-09-30 10:03:36] <EROR> Option '--refcall-filter-expression' cannot be specified more than
[2021-09-30 10:03:36] <EROR> once.
[2021-09-30 10:03:36] <EROR>
[2021-09-30 10:03:36] <EROR> To help resolve this error submit an error report.
[2021-09-30 10:03:36] <INFO> ------------------------------------------------------------------------
Hi,
I have some questions about the logic in compute_homozygous_posterior( )
. This applies to the code in the individual caller as well.
So, in this code:
auto het_alt_ln_likelihood = std::inner_product(std::cbegin(reference_ln_likelihoods), std::cend(reference_ln_likelihoods),
std::cbegin(non_reference_ln_likelihoods), 0.0, std::plus<> {},
[] (auto ref, auto alt) { return maths::log_sum_exp(ref, alt) - std::log(2); });
it looks like maths::log_sum_exp(ref, alt)
is summing Pr(ref|data) + Pr(not ref|data) on the log scale. But Pr(ref|data)+Pr(not ref|data) will always equal 1.0. I guess you would like Pr(data|ref) + Pr(data|alt)... but it is less clear how to get that from the PHRED score. But I am not totally sure I am following this correctly.
To double check this reasoning, I ran the code in debug mode, and I get:
- reference_qualities[0] = 1
- reference_ln_likelihoods[0] = -1.5814737534084538
- non_reference_ln_likelihoods[0] = -0.23025850929940456
And indeed exp(-1.5814737534084538) + exp(-0.23025850929940456) = 1.0
-BenRI
P.S. I note that you are not using log1p(x)
to implement log_sum_exp( ), but I'm not sure if that's by design. This is a nice function since it avoids losing precision when x
is very small.
I was also thinking that it might be easier to read the code if you implemented a class that stores the log of a value, and then defined operator+, operator*, etc.. Then you could write p1+p2
instead of maths::log_sum_exp(p1,p2)
.
In the meantime, I'll note two problems that I've seen:
- An assert triggers when running in debug mode:
Thanks - I'll see if I can reproduce.
Apparently you can't supply the --refcall-filter-expression argument, because something is already supplying it:
I think this may be an issue with Boost; it appears Boost incorrectly recognises --refcall
as shorthand for --refcall-filter-expression
thus counting the latter twice. Will have a think on what to do.
it looks like maths::log_sum_exp(ref, alt) is summing Pr(ref|data) + Pr(not ref|data) on the log scale. But Pr(ref|data)+Pr(not ref|data) will always equal 1.0. I guess you would like Pr(data|ref) + Pr(data|alt)... but it is less clear how to get that from the PHRED score.
The calculation is the genotype log likelihood ln p(R | g)
- it essentially implements the log diploid form of equation 2 from Heng Li's paper. This will, of course, simplify to -N*log(2)
for N
reads in the diploid heterozygous case. I think I probably didn't implement this to make it easy to extend for non-diploid cases.
P.S. I note that you are not using log1p(x) to implement log_sum_exp( ), but I'm not sure if that's by design. This is a nice function since it avoids losing precision when x is very small.
Yup, std::log1p
is the better option here. Changed in 27b48ff8adcc6ffc6fbf57ec503e45cbc4a1c224. Thanks.
I was also thinking that it might be easier to read the code if you implemented a class that stores the log of a value, and then defined operator+, operator*, etc.. Then you could write p1+p2 instead of maths::log_sum_exp(p1,p2).
It's an interesting idea but I think it actually makes things a lot more complicated and difficult to read. How is the mixture factor included, (p1 + p2) / 2
or p1 + p2 - log(2)
? The latter is inconsistent since p1 + p2 + (-log(2))
has a different meaning. The former requires implementing a full symbolic algebra, including operator chaining (e.g. p1 + p2 + p3
), and we still end up with confusing statements like L = (p1 + p2) / 2
where L
, p1
, and p2
are all Log
types.
The calculation is the genotype log likelihood ln p(R | g) - it essentially implements the log diploid form of equation 2 from Heng Li's paper. This will, of course, simplify to -N*log(2) for N reads in the diploid heterozygous case. I think I probably didn't implement this to make it easy to extend for non-diploid cases.
Hmm.... I was deriving this from scratch. I guess this is a valid scaling of the likelihoods and not an error. I see that Heng Li decided to ignore bases that are not ref
and alt
... that does simplify the transformation from PHRED to likelihoods, but it seems to drop a factor of (1/3) by assuming that any non-ref base is equal to alt. Perhaps that is tradition at this point...
In any case, I probably see how to implement this for polyclone now. Thanks.
It's an interesting idea but I think it actually makes things a lot more complicated and difficult to read. How is the mixture factor included, (p1 + p2) / 2 or p1 + p2 - log(2)? The latter is inconsistent since p1 + p2 + (-log(2)) has a different meaning. The former requires implementing a full symbolic algebra, including operator chaining (e.g. p1 + p2 + p3), and we still end up with confusing statements like L = (p1 + p2) / 2 where L, p1, and p2 are all Log types.
I have been using this in my own code for a while, and it works fairly well. Any operations on a Log type and a double return a Log type. So L = (p1 + p2)/2
automatically converts 2 to a Log type.
My code for this is here: https://github.com/bredelings/BAli-Phy/blob/master/src/util/include/util/math/log-double.H I just committed some cleanups, so this version is tested, but not heavily tested.
I fixed the error when specifying --refcall
. I also changed --refcall
into a flag so it no longer accepts the BLOCKED
or POSITIONAL
argument - this is now controlled entirely by --refcall-block-merge-quality
(--refcall-block-merge-quality=0
implies POSITIONAL
).
Thanks!
I had a little bit of time today to move forward on the math for this.
I. Since you don't have estimates of the allele frequency, it looks like for the diploid case the code in individual_caller.cpp
assumes that the prior probabilities Pr(homozygous ref) = Pr(heterozygous ref + alt). In theory you could have
- Pr(ref|ref) = 1/4
- Pr(ref|alt) = 1/4
- Pr(alt|ref) = 1/4
- Pr(alt|alt) = 1/4
But instead it looks like Pr(ref|ref) = Pr(ref|alt)+Pr(alt|ref). Does this sound correct?
II. To generalize this to ploidy p
, I see two routes:
- Pr(genotype) = (1/2)^p
- Pr(number of reference alleles = r) = (1/(p+1)), and Pr(genotype with r ref alleles) = (1/(p+1)) * (1/choose(p,r))
The first approach weights pretty heavily against homozygosity (as ploidy increases) by assuming that the allele frequency is (1/2).
The second approach assigns each number of reference alleles an equal weight, which is hopefully similar to placing a uniform prior on allele frequency. This avoids weighting too heavily against homozygosity. It also matches what you did for diploids. So, I like this approach.
It looks like, for diploids, you can consider the genotypes ref|ref
and ref/alt
, while ignoring alt|alt
.
However, for polyploids, it looks like we can ignore homozygous alt, but not any of the other genotypes. So, we have to consider (2^p)-1 genotypes.
For the polyploid case (i.e. every haplotype has frequency 1/p
) this can be done with a sum over p+1
terms.
I haven't thought yet about whether we can handle the polyclone case (i.e. unequal frequencies) without summing over 2^p
terms...