PyVCF icon indicating copy to clipboard operation
PyVCF copied to clipboard

Performance improvements

Open sambrightman opened this issue 8 years ago • 3 comments

I haven't looked at this branch for a few weeks, and you should bear in mind that I've never used Cython before. I've rebased it and it passes all current tests.

My observation was that PyVCF is still rather slow in reading & writing large, real-world VCFs (about 6-8x slower than a simplistic split-index-join approach). The individual commits here should be reasonably clear, and I found:

  • integer instead of string comparisons were worth about 5-10% in both Python and Cython implementations
  • INFO parsing became the bottleneck, and a very naive Cython version made about 25% difference
  • formatting strings for writing was also slow, and a very naive Cython version made about 20% difference

I haven't had much luck with line-profiling to improve things further. One idea might be to lazy-parse the INFO fields – keep them as strings until accessed. They still seem to be a bottleneck even with Cython (large real-world VCFs may contain many annotations, for example).

Downside here is further duplication between Python and Cython, but that seems unavoidable if supporting pure Python remains a priority.

sambrightman avatar Feb 19 '17 14:02 sambrightman

Nice work. The larger problem, it seems to me, is that VCF is madder than a box of frogs as a file format. eg it includes at least two incompatible delimited field specs IIRC.

How is tooling support for binary call format these days? Shouldn't that be the target format for performance?

jamescasbon avatar Feb 20 '17 10:02 jamescasbon

Seems to me that it's still worth having the best performance in all use-cases. BCF has been discussed for years, but progress is slow. Shall we either merge or close this? There hasn't been a release for a while either, would probably be useful for people.

sambrightman avatar Jan 07 '18 20:01 sambrightman

Merged on https://github.com/dridk/PyVCF3/commit/9e5de0f2ce892167d5d0b13073e39864632fa308

dridk avatar Jan 28 '22 22:01 dridk