parslet
parslet copied to clipboard
automatic whitespace handling
Parslet grammers are littered with whitespace checks, making them harder to read. Leaving them out fails to parse valid things properly. Take the javascript parser as an example: https://github.com/matthewd/capuchin/blob/d47f4b19eb888b6a4fc5428d3d1fdfcdb551b183/lib/capuchin/parser.rb
There is sp? everywhere. There are very few cases where whitespace is not allowed, and decorating those cases with a different operator to join the atoms seems sufficient.
So, this is a feature request for some sort of functionality like this. pyPEG has a skipws option which seems to work ok.
I can see why you would want this, but am not convinced if we really need it. After all, we can process parslet atoms as if they were data, so appending whitespace to all and everything will not be hard. This really belongs to the mailing list - and if you provide a patch/ an implementation idea, we'll consider it more thoroughly.
I have some code that implements this: https://github.com/kschiess/parslet/compare/master...mikeyhew:ignore-whitespace. It changes the >> operator so that it consumes 0 or more spaces in between parslets, and adds << for when you don't want to allow spaces. I'm been using it in this project and it has worked well so far, making it more pleasant to write the grammar.
@kschiess It would be interesting to hear what you think about the general idea, as well as whether this would break anything. (I think it caused an error with the infix_expression helper already, but didn't spend much time debugging.)
I'll take a look soon.
I like the idea that this is an option you give to the whole parse process. Perhaps we could (as an implementation) create a source that skips whitespace? I do realize this is a problem for a lot of people.
Hi, any progress on this? This would be a valuable addition. Thanks.
We would welcome a PR that solves this, however we won't be able to dedicate our time to this.
@kschiess the problem with a global option is that it restricts what you can parse. Even if your grammar is mostly whitespace-insensitive, there are still times when you need >> without whitespace in between. For example, parsing identifiers:
rule(:ident) { match['a-zA-Z'] >> match['a-zA-Z0-9'] }
# how would you do this if the `Source` ignores whitespace?
I'll merge any kind of solution that doesn't lock people into whitespace-agnostic parsers. The default should be not to ignore whitespace. But I think we can make it easy to have a choice.