lark icon indicating copy to clipboard operation
lark copied to clipboard

Add scanning

Open MegaIng opened this issue 1 year ago • 15 comments

An implement of Lark.scan. Also adds start_pos and end_pos to Lark.parse, Lark.parse_interactive and Lark.lex.

TODO:

  • [x] add example
  • [x] A bit more documentation for what exactly this function does
  • [x] Notes about start_pos and end_pos mirroring the behavior of stdlib re with regard to look behind and look ahead.

But I do think the core logic is pretty stable and I would like a review of that already @erezsh.

Future work:

  • Check if it already works/What needs to be done to make this work with mmap to not have to load the text into memory at all (also involves checking up on the byte parsing implementation)
  • Check to see if I can implement a custom lexer that uses python's stdlib tokenize module, which would have a few benefits especially with regard to the new f string syntax, and how well that would play with this feature.

This PR is based on #1428, so merging it first would be better.

MegaIng avatar Jun 20 '24 01:06 MegaIng

Btw, can we drop support for the now completely unsupported versions 3.6 and 3.7? https://devguide.python.org/versions/

The regex stubs in typshed now uses the / positional only syntax, meaning mypy error exists because the stubs can't be used for 3.6, which we have as our lowest supported version.

MegaIng avatar Jun 20 '24 12:06 MegaIng

Yeah, I think we can. Users of <=3.7 will just stay stuck on this version of Lark, which is fine. But I think it should be in a separate PR.

erezsh avatar Jun 20 '24 13:06 erezsh

There is one design idea that I have, I hope you'll keep an open mind. If we change the definition of text to text: str | TextSlice, where TextSlice has 3 fields: text, start, end.

That would follow the principal of keeping related data together, and also make all the None checks a little less awkward ( I think ).

Yep, sounds like a good idea. I think in a later PR it might also make sense to go a step further and make Lexer (and therefore Lark) generic over the type to parse, which should be quite reasonably possible without breaking changes thanks to Type Var defaults.

MegaIng avatar Jun 20 '24 21:06 MegaIng

make Lexer (and therefore Lark) generic over the type to parse

I'm not sure what exactly you have in mind, but overall it sounds like a positive thing. Are you referring to the text also accepting bytes, or something else?

erezsh avatar Jun 20 '24 21:06 erezsh

Specifically the think I had in mind is some kind of externally generated token stream with a custom lexer.

MegaIng avatar Jun 20 '24 21:06 MegaIng

Hmm.. I'm not sure I get it.

erezsh avatar Jun 20 '24 21:06 erezsh

Imagine using something like stdlib's tokenizer to produce a list of tokens, and using those as the input to Lark.parse instead of the raw text. Or using a bytecode stream from decompiling a python code object as the input to recover structures like loops.

MegaIng avatar Jun 20 '24 22:06 MegaIng

I am going to bed for now, but an observation I am already making is that using TextSlice instead requires touching a lot more code. We have quite a few places where we slightly break layers of abstraction, which was fine because we always found either str or bytes. Now this is TextSlice instead which requires special handling. I think I might start from scratch tomorrow and separate TextSlice and scan into two different PRs.

MegaIng avatar Jun 20 '24 23:06 MegaIng

I think I might start from scratch tomorrow and separate TextSlice and scan into two different PRs.

Sounds like that might be a good idea.

erezsh avatar Jun 21 '24 07:06 erezsh

@MegaIng Do you have plans to submit the TextSlice PR in the near future? If not, maybe I'll take a crack at it.

erezsh avatar Aug 15 '24 14:08 erezsh

Sorry, no, not next-few-days-soon. Forgot about this. It was a bit more work than I thought (since it also needed to touch a chunk of earley code). I am not currently on my main PC, so I can't even provide the work-in-progress, but I would be able to in ~2 days.

MegaIng avatar Aug 15 '24 14:08 MegaIng

Sure, I'll wait until you're back to your PC.

I'm not in a rush, I'm just in a mood to work on something, and might as well do this. The scan feature is pretty cool, so I'll be happy to unblock it if I can.

erezsh avatar Aug 15 '24 14:08 erezsh

If you want to work on something, maybe you can come up with a solution of the issue of multi-path imports, as seen in this SO question?

The problem is that the same terminal gets imported with different names and then ofcourse collides with "itself". I don't really know if there is a good solution to this, maybe you can come up with something.

MegaIng avatar Aug 15 '24 15:08 MegaIng

I gave it some thought, and it's a tricky problem. I wrote my thoughts here: https://github.com/lark-parser/lark/issues/1448

erezsh avatar Aug 15 '24 16:08 erezsh

@erezsh Pushed a version of TextSlice I had into this branch if you want to use that as a jumping off point. The commit is called "temp", but we should probably never merge this specific PR anyway.

MegaIng avatar Aug 17 '24 16:08 MegaIng