pest icon indicating copy to clipboard operation
pest copied to clipboard

Feature requests: feedback from parsing tweets

Open sayrer opened this issue 6 years ago • 3 comments

I wrote a tweet parser with Pest here: https://github.com/sayrer/twitter-text/blob/master/parser/src/twitter_text.pest

It was a good experience, but two small things would have helped a lot.

  1. Support for a parse() method that takes a character iterator rather than a string. This would allow me to nfc normalize the input text without allocating an extra string.

  2. Optional support for more detailed character offsets in Pair (UTF-16 and UTF-32). Finding these offsets requires iterating over the input string with str.char_indices after parsing, but I bet Pest could provide them.

sayrer avatar Feb 15 '19 16:02 sayrer

Also, a more difficult request: compile long sequences of literal choices to tries. At the bottom of the tweet grammar, this was done manually for TLDs. I didn't test this yet, but I did look at the generated code, and it seemed to be called for.

sayrer avatar Feb 15 '19 16:02 sayrer

The way that pest is currently set up, the returned parse tree borrows the original input string, so making a streaming API isn't that possible. The string has to be collected in either case, so making this externally obvious seems ideal.

It might be possible to support a streaming API in the future with pest:3.0 or otherwise, but streaming introduces a lot of issues. As I understand it, pest is optimized for a full-file processing, as you see in a programming language.

As for the literals, I believe the intent is to utilize logos's lexing plumbing superpowers, which will give us O(1) bytewise lexing for "free" so long as we can get ordered-choice semantics instead of longest-match.

CAD97 avatar Feb 15 '19 18:02 CAD97

Thanks for the reply. I agree logos should take care of my trie request (excellent!). It's also understandable that streaming might take a while or never happen.

But what about getting UTF-16 and UTF-32 offsets?

sayrer avatar Feb 15 '19 18:02 sayrer