pyregion
pyregion copied to clipboard
Make ds9 region file parser fast
The docs contain this note: "pyregion is rather slow, likely due to a inefficient parser." ( see http://pyregion.readthedocs.org/en/latest/index.html#documentation )
It's really slow ... parsing 1000 simple circle regions
with open('ds9.reg', 'w') as f:
for _ in range(1000):
f.write('circle(0,0,1)\n')
takes 7 seconds:
%time pyregion.open('ds9.reg')
CPU times: user 7.59 s, sys: 20 ms, total: 7.61 s
Astropy bundles ply, so if someone takes this on it might be worth considering using ply
(currently pyparsing is used).
I'm using pyregion
in a data analysis / image plotting pipeline and region file parsing is actually the bottleneck, but it's acceptable and I'm not very familiar with parsing and region files, so I don't plan to take this on ... just wanted to make a GH issue so this known issue is not forgotten.
@cdeil - can you do some basic profiling (%prun) to see what the bottleneck is?
All the time is spent in pyparsing
.
RegionParser.parseLine
calls self.parser.parseString(l)
where self.parser
is a pyparsing
object created in RegionParser.__init__
(see here).
I have no idea how to optimise that ... this should probably be rewritten with ply
(because it's bundled in astropy, i.e. the extra pyparsing
dependency is avoided) and only then one should start profiling ... but it's some effort and requires someone that knows (or has time to learn) how ply
works.
Thanks for checking!
The slowness is likely due to my lack of knowledge instead pyparsing is inherently slow.
My recollection is that I wanted to keep passing the relevant part of the original string during the parsing, which somehow complicated the code. But not sure if this is the main reason for slow performance.
I note that matplotlib depends on pyparsing. So the pyparsing may not be an extra dependency.
This speed issue in the DS9 parser in pyregion is still of interest.
Next week @joleroi will give a presentation on regions
(but also mentioning pyregion
) at PyAstro17 and has updated a comparison: https://github.com/astropy/regions/pull/123
Long-term it would be nice if we could have one easy to understand parser that's complete and reasonably fast (speed is not the most important thing, it doesn't have to be super fast).
So pyregion uses http://pyparsing.wikispaces.com/ and one could try to understand the parser here and why it's slow (profile and see which part of the parser is slow and if it can be restructured to be faster).
In https://github.com/astropy/regions/blob/master/regions/io/read_ds9.py there's a handwritten parser. I don't know how complete it is or could be made with some work.
To me, from a quick look, the parser in pyregion and regions looks about the same complexity. In either case I think I would have to read and think for an hour or two to grok them. (cc @keflavich and I think @astrofrog and @joleroi as authors)
If someone becomes really interested in this topic and wants to prototype and benchmark parsers, other options worth looking at are http://www.dabeaz.com/ply/ and https://github.com/erikrose/parsimonious . I don't have time now or in the coming weeks, but I might in the summer if no-one figures out and finishes up this DS9 parsing part.
One more thing I forgot to mention: the nice thing about https://github.com/erikrose/parsimonious is that (from the looks of it, I haven't tried) one just writes down the grammar of the text to be parsed in BNF form. Independent of whether parsimonious is used or not (probably not) in the end, having the ds9 grammar written down would be nice, e.g. in the docstring of the parser .py
file, or the developer docs section.
I don't presume the BNF for DS9 region files exists, but it could probably be extracted from the pyparsing parser code in pyregion, which looks a bit similar.
We'll talk this week, but my recollection from last year was either no one understood how to write a grammar for ds9, or we tried and found it incompatible.
With a functional parser like the one we've started on, we can guarantee 100% coverage given enough programmer time. I don't know if the same is true for a grammar-based approach; it's not obvious to me whether ds9 regions follow a single self-consistent grammar.
Probably the existing code could be made more clear by breaking it down into individual region types; right now, there are extra if
statements around for the cases that have indefinite numbers of parameters (e.g., polygons, annuli). I won't object to a BNF-based approach if it can be demonstrated, but I have to see it to believe it.
how to write a grammar for ds9,
What about the pyparsing based parser in pyregion? Is there a problem apart from the speed issue? Can you point out a (ideally small) example that can't be parsed at all or isn't parsed correctly? If yes, maybe make a new issue in the pyregion tracker (to keep this one focused on the speed problem)?
This is the table with the current status
With a functional parser like the one we've started on, we can guarantee 100% coverage given enough programmer time.
I agree, we should work until the table linked above shows 100% for coverage everywhere for the regions package. Speed is already much better than pyregion, so we probably don't need to worry about that (as @cdeil said). We discussed this already a year ago, and couldn't come up with a complete DS9 grammar.
Can you point out a (ideally small) example that can't be parsed at all or isn't parsed correctly?
See the script to generate the table, there are many warnings like this
UserWarning: Failed to parse : box(18:13:27.558,+31:22:30.76,13.32",6.66",19.98",9.99",45.2623) # font="helvetica 10 bold roman" text={Box Annulus}
warnings.warn("Failed to parse : " + l)
Also, most of the files that cannot be parsed with the regions package, show only a 75% coverage with pyregion. So I would be reluctant to try going down the grammar avenue again.
I didn't know that pyregion is not a complete DS9 region parser and I think wasn't part of the DS9 grammar / parse discussions last year (or forgot).
OK, fine with me to go with a hand-written parser if you think it's a good (or even the only possible) solution.
The pyregions
docs are a bit all over the place.
- On the front page there's a note "pyregion is rather slow, likely due to a inefficient parser. Any contribution will be welcome." and another one pointing to the
regions
package without saying anything about the relation of the two. - At the top of http://pyregion.readthedocs.io/en/latest/getting_started.html there's a statement containing "It reads most of the region files created by ds9.". I think it would be good to give an example of such a case and say what happens in the docs.
The regions
docs also don't really say anything about the relation or regions
and pyregion
.
I think it would be good to add an About
section to both, briefly describing what they are with a few bullet points.
Given the discussion here I think it's roughly: pyregion
is a legacy package, will likely remain as-is. (Would still be nice to increase test coverage a bit, improve docs a bit and make one more release).
regions
is supposed to be a full replacement and contain a full DS9 region parser. I think the "Shape" and "ShapeList" abstractions from pyregion
might be good ones and should become part of the regions.io.ds9
API as an intermediate layer?
@joleroi or @keflavich - Can you do such a docs update this week or should I put it on my TODO list?
Note that @olebole as Debian Astro lead asked me a week ago if he should include pyregion
and / or regions
in Debian. My answer was that "both" would be good. I think there's many pyregion users and it would be good to have it also, at least for the next 5 years, then maybe it can be dropped at some point (like pywcs and pyfits were recently now that there's Astropy).
For regions
I don't think we can make it into Astropy 2.0. It's far from finished / polished / user-tested. It'll be under development for another year and we don't want to commit to stable API like putting as astropy.regions
implies, no?
Agreed that both pyregion and region need to coexist for at least a while.
Could you make the docs thing into a separate issue? Maybe we'll be able to address it this week, but it should be an Issue.
@keflavich - OK. I made a separate issue suggesting the docs improvements here: https://github.com/astropy/regions/issues/124
- coverage of pyregion
I have made a pull request (https://github.com/astropy/pyregion/pull/111) that should improve the coverage. As far as I can tell, it now can parse all the regions in the test. The current version of the regions_pyregion_comparison.py reports coverage around 82%, but this is because the script overestimates the number of defined regions in the file.
- speed of pyregion
I think that the reason that pyregion is slow is likely not due to the inherent performance of the pyparsing, but more likely due to the inefficient grammar I created (and things inbetween). By no means I am an expert on this grammar thing and also the pyparsing.