PyVCF icon indicating copy to clipboard operation
PyVCF copied to clipboard

Consider using a parser combinator approach

Open jamescasbon opened this issue 12 years ago • 5 comments

I'm still unbelievably angry that VCF even exists, when they could have used hdf5 or something.

Anyway, I'm not convinced that hand rolled parsers are really where it's at in the 21st century and maybe investigate the performance of a parser combinator approach.

jamescasbon avatar Jan 30 '13 23:01 jamescasbon

I have a prototype of this working for the header using funcparserlib. I think it is quite nice, you have a separation of concerns: a tokenizer, grammar and object model. It's clearly slower, but for the header we don't care about speed, we care about correctness. The main sample parsing code will still require hand rolled loops for performance, but with the improved header model it should be easier to handle the type casting. I will share when I'm not posting from a phone.

jamescasbon avatar Feb 14 '13 08:02 jamescasbon

Have a look at this gist... https://gist.github.com/jamescasbon/4960150

jamescasbon avatar Feb 15 '13 12:02 jamescasbon

@martijnvermaat any thoughts?

jamescasbon avatar Feb 19 '13 17:02 jamescasbon

Yes, I like this a lot. It would be good to have an idea what kind of spec violations are out there and whether or not we can still parse them using this approach. I think parsing all headers in our test data is a first minimal requirement.

If we can still be flexible enough to parse what is being used in practice I'm all for it. I'd be happy to see the regexes go.

By the way, do you know how funcparserlib compares to pyparsing? The latter is the only one I've used (and more popular I think) and while it did everything I needed, I wasn't terribly impressed (especially documentation).

martijnvermaat avatar Feb 19 '13 21:02 martijnvermaat

This is kind of connected with #89 but I was wondering if there is already a multiprocess solution to read the whole VCF in memory, so than can be moved to a pandas dataframe for operations.

Or some other strategy to read the bulk VCF in a quick way

mattions avatar Dec 16 '14 10:12 mattions