echtvar icon indicating copy to clipboard operation
echtvar copied to clipboard

Echtvar slower on VCF with many samples

Open edg1983 opened this issue 3 years ago • 2 comments

Hi Brent,

I'm integrating echtvar as our standard annotation tool for large cohort VCFs. I've noticed that annotating a VCF with many samples (I have some with up to 10000 individuals, and we may have up to 100000 in the future) is slower than annotate the same variants after removing the samples columns. Is this expected?

Based on the above observation, I suppose that echtvar parses values from all the samples columns and this slows down the file reading. Since echtvar annotate variants based on position and alleles and thus sample information are not needed at all, would be possible to make it ignore the sample columns? This would make var annotation speed unaffected by number of samples?

Thanks again for a great tool!

edg1983 avatar May 04 '22 04:05 edg1983

Hi Edoardo, this is expected. The problem is that simply parsing and then writing the variants becomes a bottleneck. It uses htslib, so it doesn't actually "parse" the sample parts of variants, but it must read them from and write them to disk. This will be true for any tool, not just echtvar. If you use BCF instead of VCF, that will help some (maybe up to 3X), but there's not much that can be done about it other than parallelization. I don't plan to add that in echtvar, but you can split and send multliple processes for annotation.

brentp avatar May 04 '22 05:05 brentp

Yes, I was suspecting something like that.

With larger and larger datasets being generated I'm wondering if we can get an advantage by a new implementation nof the VCF file format in which variant information and the genotype matrix are kept separated so one can annotate and filter based on variant info without actually reading all genos and then access the genos.

Thanks for the BCF suggestion and echtvar is still fast enough anyway. I don't think we will need to parallelize so it will take at least few mins and give me time for a coffee break ;)

edg1983 avatar May 04 '22 05:05 edg1983