viral-ngs icon indicating copy to clipboard operation
viral-ngs copied to clipboard

Report read depth in VCF file for samples that do not have alternate alleles at a position

Open haydenm opened this issue 8 years ago • 6 comments

With #356, the VCF file created by intrahost.py's merge_to_vcf reports read depth, at a given position, only for samples that contain alternate alleles at that position. If a sample has no alternate alleles at the position, there is no line for that position in that sample's vphaser output, and thus we do not have easy access to the read depth. However, when comparing allele frequencies for two samples at a position, it is important to know read depth even for a sample that has no alternate alleles.

Consider a sample s that has no alternate allele at a reference position rp. In the merge_to_vcf function, we can use cm (the CoordMapper object) to obtain the position sp in s's assembly that corresponds to rp in the multiple alignment. Then, we would need to obtain the read depth at sp using the alignment of reads back to s's own assembly. Finally, this read depth can be added to iSNVs_read_depth immediately after https://github.com/broadinstitute/viral-ngs/blob/8016f2e214efd072be949283726458584755a377/intrahost.py#L762.

haydenm avatar Jun 21 '16 22:06 haydenm

Hm, can we chat about this one in more detail in the morning? I think the code that produces this file is not well set up for that kind of information in all cases.. You're already getting DP for samples that have no alternate allele as it is right? (As long as some sample has an alt allele at that position)

dpark01 avatar Jun 21 '16 22:06 dpark01

Sure, let's talk about this -- it's not so trivial. I haven't yet run the update from #356 so I'm not certain, but this is something I had in mind when writing that update. I believe we will not get DP for samples that have no alternate alleles at a position, even if some other sample has an alternate allele at that position. As it is, I don't know how merge_to_vcf would know the read depth at those positions for these samples.

haydenm avatar Jun 21 '16 22:06 haydenm

This will be a bit tricky in that merge_to_vcf doesn't actually have this information available in its input files.. maybe we should consider having vphaser_one_sample emit a row for every position in the genome, regardless of whether intrahost variants exist.... it'd be a bit of a rewrite though.

dpark01 avatar Jun 22 '16 13:06 dpark01

I think that's one reasonable option. I'm not sure how vphaser works and easy it would be to make that change.

One other option to consider is for the map_reads_to_self rule (or another rule that depends on its output) to output a depth at each position. There might be a canonical format for this, or just each line is [CHROM]"\t"[POS]"\t"[DEPTH]. I already do this manually after each run with an R script that produces a coverage plot by first calling samtools depth [some options] data/02_align_to_self/{sample}.mapped.bam | awk [some formatting] > data/02_align_to_self/{sample}.mapped.depth. And then these depth files would be input to the merge_to_vcf rule.

I'm happy to help with either approach. There are two clear components -- getting depth and incorporating this into merge_to_vcf -- so perhaps two of us can divide it up. Maybe you and @tomkinsc can decide what you think the best way to go is, and let me know.

haydenm avatar Jun 22 '16 22:06 haydenm

No, the map to self BAMs will be in the wrong coordinate space for your needs. This does connect with Shirlee's desire to have a general interhost SNP caller too. @haydenm do you want to be the one that gives varscan a try? You can start with the alignment and all the BAMs and it should give you a VCF of all the interhost and intrahost variants.

dpark01 avatar Jun 22 '16 23:06 dpark01

Why is it in the wrong coordinate space? I thought the opposite, that it's the coordinate space we want the coverage in. Here's how I understood it, but let me know if any of this is wrong: The information output by vphaser (positions, frequencies, and read depth) are in the space of the map to self. Then merge_to_vcf uses a multiple alignment to map those positions to the reference space -- the VCF file is in the reference space, but the data (frequencies and read depth) come from the "self" space. If we're adding in read depth for additional samples to the VCF file, then the read depth should also come from the "self" space, albeit with positions in reference space (mapped using the cm object). This is how I think it should be, at least.

I'm not yet sure about varscan, or even if I'm the right person to evaluate it for our needs. I'll take a look to get an idea of how long I think it would take to test.

haydenm avatar Jun 23 '16 01:06 haydenm