vg icon indicating copy to clipboard operation
vg copied to clipboard

feature requests for vg filter and vg call

Open ac2278 opened this issue 3 years ago • 5 comments

It's my understanding that when performing linear alignment of paired-end reads, people typically follow this workflow: (1) pre-process reads in FASTQ file (2) align reads to linear reference, generating a BAM (3) sort aligned reads in the BAM file (4) remove PCR duplicates (5) filter for properly paired reads that pass a certain MQ score (6) call variants

Would it be possible for the vg development team to add these two features: (1) a feature that allows the user to remove potential PCR duplicates before running vg call (equivalent to samtools rmdup) (2) a feature that allows the user to generate a VCF file (from vg call) using only properly paired reads that pass a certain MQ score.

ac2278 avatar Apr 27 '21 18:04 ac2278

Hi @jeizenga, is this the proper way to make a feature request?

ac2278 avatar May 07 '21 17:05 ac2278

Yes, thanks! Sorry to be not-so-responsive, I've been tied up prepping a manuscript. I'm hoping to get it out the door pretty soon, at which point I'll have some more bandwidth to help out here.

jeizenga avatar May 07 '21 17:05 jeizenga

Great, and no worries! I really appreciate the help, @jeizenga. Good luck with the manuscript.

ac2278 avatar May 07 '21 17:05 ac2278

Hi @jeizenga, would there happen to be any updates regarding these requests?

ac2278 avatar Jul 12 '21 18:07 ac2278

Hi, sorry to have been so non-responsive about this. I've looked into this a bit, and it seems like the implementation will be a bit more involved than I expected. However, I've brought the idea up within our group and there's general agreement that this would be a good addition to our tooling.

If you want to work on developing a pipeline in the interim, one option might be:

  • Use vg surject to produce a BAM
  • Deduplicate the reads with Picard
  • Convert the BAM into a FASTQ
  • Re-map the de-duplicated reads

Alternatively, instead of remapping the reads, you could extract a list of the de-duplicated read names and use vg filter -N to subset the GAM file down to the deduplicated reads. However, I expect that this could be pretty memory inefficient.

jeizenga avatar Aug 10 '21 01:08 jeizenga