souporcell icon indicating copy to clipboard operation
souporcell copied to clipboard

Understanding genotype fields added in consensus.py

Open dfirer opened this issue 3 years ago • 1 comments

I'm trying to understand what the added fields T, E, GO, and GN are in the vcf output by consensus.py (cluster_genotypes.vcf) I don't see these tags in the output from freebayes, and that they're added when assigning snps to particular clusters. Currently when I try to annotate the vcf I am getting [W::vcf_parse_format] FORMAT 'T' is not defined in the header, assuming Type=String [W::vcf_parse_format] FORMAT 'E' is not defined in the header, assuming Type=String [W::vcf_parse_format] FORMAT 'GO' is not defined in the header, assuming Type=String [W::vcf_parse_format] FORMAT 'GN' is not defined in the header, assuming Type=String Encountered error, cannot proceed. Please check the error output above. and since I am using bcftools v1.9 the option --force isn't available so I want to add the format info to the vcf header.

An added bonus, FILTER 'BACKGROUND' is also not included in the vcf header.

I am trying to annotate the souporcell vcfs I have so that I can match clusters with the sample vcfs I have (I did not run souporcell with the --known_genotypes tag... trying different methods and will eventually also try doing that) as well as across souporcell runs and I came across shared_samples.py (this was very helpful thank you for writing that) and noticed you were using the output from freebayes rather than the edited vcf created later on. I was curious as to why you chose to use the freebayes output rather than cluster_genotypes.vcf

dfirer avatar Mar 24 '21 19:03 dfirer

Hmm... I am trying to figure this out as well lol. Its been a while since i wrote this code. T and E seem to be about whether the ambient RNA + genotyping model thought this variant was truth vs error? But this might be referring to two different types of errors, will have to inspect code further. GO appears to be the non normalized (log likelihood) for each genotype and the GN appears to be a posterior log probability (so normalized) for each genotype. Okay so reading more and T and E are log likelihoods of true variant vs arising from ambient RNA. So the posterior probabilities would be exp(T - log(exp(T)+exp(E))). Basically how it works is that each cluster after doublet removal has certain allele counts and those could arise from that individual being homozygous ref + some soup, homozygous alt + some soup, het + some soup, or the entire variant could be a false positive in which case all individuals will have the same underlying allele fractions and that is the (E)rror case.

I believe the reason for using freebayes output for shared_samples.py is because I don't think there is a line for line equivalency between the freebayes output and the cluster_genotypes.vcf file because I only output variants used for clustering in the latter. And to get the allele counts from the alt.mtx and ref.mtx, I need the record number which is associated with the freebayes vcf. So admittedly this was something of a shortcut but so far I have found the signal to noise quite high on matching clusters between experiments so I don't think it is a problem.

Hope this helps.

wheaton5 avatar Mar 25 '21 00:03 wheaton5