parsimonious
parsimonious copied to clipboard
Decide whether to turn to compilation tricks for speed
Cython, numba, Rust, C, etc.
Parsimonious was originally targeted to pure Python in the days when vendor branches were the norm and compilers were left off of servers for security reasons. Now that wheels are mature, pip supports hash-checking, and servers are often virtual and disposable, we can consider loosening things up.
- [ ] Profile Parsimonious, and see what's hot. What parts should we seek to speed up? Is there anything suitable?
- [ ] Choose a weapon.
- [ ] Write some PRs.
Some starting info on Cython is at https://github.com/erikrose/parsimonious/issues/119#issuecomment-316462408.
A few notes:
- Compiling wheels for a bunch of different platforms can be a pain and adds a lot of complexity to deploying to pypi. Not distributing wheels is OK for C extensions, but most environments do not have a rust compiler (and so cannot compile their own wheels locally.)
- Parsers tend to be used on untrusted input, so directly using a memory-unsafe language like C seems like a questionable decision from a security perspective. I don't know if Cython has the same issues. An attacker would have to exploit a bug in Cython, not in your C code, so presumably that mitigates it slightly.
- Numba seems very geared towards numerical code; I'm skeptical it'd be helpful.
- This introduces compatibility issues with ironpython, jython and possibly pypy.
Profiling to find hotspots sounds like a great idea, but I'm skeptical the speedup from using native code or a JIT is worth the cost in compatibility and complexity.
Yep. Unsafety is why I listed Rust before C. I don't anticipate C being the winner.
Yep. Rust might be a distribution challenge atm.
AFAIK Numba is a general JITing compiler, able to generate machine code from anything with determinable types. But my information is several years old.
PyPy, IIRC, gives Parsimonious maybe a 2-3x speed boost, likely because it's full of recursion rather than looping, and, when last I checked, it had simplistic JIT-triggering conditions that required a 100th time around a loop. (If we rejiggered to use trampolining, which would also solve RecursionErrors, that might change.)
Regarding IronPython etc., I'm not proposing to drop the pure-Python codepaths. If we found some kind of compilation worthwhile, we'd support and test both.
Good points, @lucaswiman. The safety thing is a primary motivation for the Rust parsing framework nom, which I discovered after writing my Cython parser and was happy to see a lot of shared goals.
Btw, here is my Cython parsing experiment (note that the README is mostly notes to myself rather than accurate documentation, and also that it currently only works for Python3 with unicode data). It contains both the Cython and the equivalent pure Python implementations, so from a distribution perspective you could install that setup without the compiled code, it just wouldn't be as fast.
And I agree that there's still lots of things that can be done to optimize the Python code before resorting to compiled languages.