logos
logos copied to clipboard
Possibility of using logos' core
Hi! First of all, thanks for writing this super cool crate and making Rust's ecosystem more robust.
I'm the original developer of pest and was wondering what would be the best way to take advantage of some of the technology use in logos. It seems like a lot of the simpler production rules could take advantage of a similar tree approach. I'm also working on a higher level framework to improve pest's grammar compilation, but I have yet to decide on an intermediate representation that would deliver the best results.
This is why I'm curious whether you would be interested perhaps separating the derive into a more generic core crate that would be reusable in other projects. The goal here is to offer a good experience for people new to parsing and to bring as much of the technology we write here in the Rust community reusable and well-integrated, so any other ideas or feedback would be most appreciated.
Hey! Thanks a lot, when I first saw Pest I was really blown away by how easy it is to use. I got to talk to a couple people using it, some of whom are beginning Rustaceans, and their experience is great. Pest is definitely one of those crates that helps driving Rust adoption!
As for extracting the pattern-tree-resolution engine into a core crate - absolutely! I'd be happy to collaborate and see what the requirements of Pest are. I reckon there might be some friction because I'm using Regex syntax as input, but then the tree actually doesn't look anything like Regex, so it should be possible to make it general enough not to have to change the syntax of .pest files.
The are just a few difference between PEG (which pest uses) and a Regex engine:
- everything matches eagerly; parse trees produced are always deterministic
- lookaheads
- named production rules
Maybe there is a simple enough way to make the parsing strategy generic, so as to be able to use eager matching. Adding lookaheads should be straight-forward. As for the tagged rules, that can be left for pest to handle; everything below it can highly-optimized eager Regex.
Actually, lookaheads will be probably the only tricky part.
Everything matching eagerly is how Logos works atm (.*?
will fail to compile), so that's fine. Named production should be very easy to do by just swapping my Token markers with a generic.
I'll give extracting the tree resolution stuff into a crate a go this week, then we can try to square the circle of it into pest and see what changes are needed. I should also do some reading of the pest source code to get a better understanding of what it's doing (I have some assumptions, but assumptions tend to suck).
pest's backend is not very stable right now. I'm planning to write an RFC in pest detailing how the next version (3.0) will use intermediate representation in order to do high-level optimizations, then have this IR generate Rust code that does the parsing. Once this IR is defined, it will be very easy to know exactly is needed.
After a bit more research, I think the best approach is to have pest deal with lookaheads itself, maybe optimize them statically if it can. Thus, the kind of IR Logos could help it would be:
- UTF-8 strings
- case-insensitive UTF-8 strings
- UTF-8 character ranges
- sequences like
ab
-
ordered choices
a|b
, whereb
is matched only ifa
fails -
eager bounded repetitions
a{0,n}
-
eager unbounded repetitions
a*
- any character
.
- inversions
[^a-z]
pest 3.0 should be able to statically optimize most expressions and inline most rules such that the work that Logos will do will be clear and precise. It should be able to start parsing from a specific index in a string and return an Option<usize>
with the match position.
Hello @dragostis, I am trying to make this project live-on, and wanted to check if this issue / feature request was this requested?
Thanks :-)
At this point, I'm out of the loop and have no time to invest in this, so I'm closing this issue.