lark icon indicating copy to clipboard operation
lark copied to clipboard

Parser performance with ambiguity

Open sternj opened this issue 10 months ago • 16 comments

I'm trying to write a parser that deals with some syntactic elements inside of Markdown without affecting the rest of the document. I've been dealing with a lot of issues surrounding the performance of the grammar. In trying to convert it into LALR(1), I've found that there are a lot of ambiguities. The root of these is that a lot of the syntax uses characters that are legal in other contexts as well-- for instance, the phrase

asdf `{some syntax}`

could (in theory) be validly parsed as either just a string or as one string token asdf and then the rule corresponding to {some syntax}. There are also some multiline constructs and some list handling has to happen. Further complicating things is the fact that I want to preserve whitespace outside of the syntax, so ignoring it isn't really viable

So, the grammar's slow-- there are a good few similar issues to the above. When I tried to change the parser to LALR(1), as a consequence, it did things like regarding a NEWLINE as part of the preceding paragraph rather than as the boundary between two list items. I'm kinda at my wit's end here-- I have a grammar that works well, but is incredibly slow in Earley and is unusable with other parsers. I'd appreciate any guidance on making it run at any speed.

sternj avatar Mar 02 '25 06:03 sternj

markdown is not context free. Use a dedicated, hand written, parser for it, lark will never be able to handle it well.

MegaIng avatar Mar 02 '25 12:03 MegaIng

I think that's a bit too dramatic. It might be possible to create an adequate solution, if he uses Lark through its interactive-parser interface.

erezsh avatar Mar 02 '25 16:03 erezsh

I do have a grammar that behaves correctly and I think I constrained the relevant bit of markdown that's actually lexically relevant enough to make it context free (it's basically just inline code, code blocks, and nested lists, I added a list end marker), it's just super slow and actually needs a lookahead of 2 as things stand, I think. I'll scrub it when I'm back at my desk and post it here. I've also considered modifying the lexer and entering the syntactic elements as individual tokens, but adding a lexeme for something like NEWLINE WS_INLINE* -, at least as far as I can figure out, would make the rules outside of this bespoke list context become quite unwieldy.

On Sun, Mar 2, 2025, 11:01 AM Erez Shinan @.***> wrote:

I think that's a bit too dramatic. It might be possible to create an adequate solution, if he uses Lark through its interactive-parser interface.

— Reply to this email directly, view it on GitHub https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flark-parser%2Flark%2Fissues%2F1514%23issuecomment-2692794462&data=05%7C02%7Cjstern%40umass.edu%7C87107b2069b14646885408dd59a387e6%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765281083731743%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=tcUXFG7pEc3Wupw6eSPs11mw6yJVmokGmjc1GdC18kQ%3D&reserved=0, or unsubscribe https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAEWVMH4ZIJSOVG4ZZHSS6HD2SMTORAVCNFSM6AAAAABYE4U7FCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJSG44TINBWGI&data=05%7C02%7Cjstern%40umass.edu%7C87107b2069b14646885408dd59a387e6%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765281083741286%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=ESvgSsxniaoxchGam5gZWV69Zjp96hGxGiAoOGjWyrY%3D&reserved=0 . You are receiving this because you authored the thread.Message ID: @.***> [image: erezsh]erezsh left a comment (lark-parser/lark#1514) https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flark-parser%2Flark%2Fissues%2F1514%23issuecomment-2692794462&data=05%7C02%7Cjstern%40umass.edu%7C87107b2069b14646885408dd59a387e6%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765281083749973%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=5W4zi2zKibnuQ0utZkPGGSI2gNQwBeH8HzM%2Fa0K3p6k%3D&reserved=0

I think that's a bit too dramatic. It might be possible to create an adequate solution, if he uses Lark through its interactive-parser interface.

— Reply to this email directly, view it on GitHub https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flark-parser%2Flark%2Fissues%2F1514%23issuecomment-2692794462&data=05%7C02%7Cjstern%40umass.edu%7C87107b2069b14646885408dd59a387e6%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765281083758494%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=i6beDl5n0OWG566rruDnLBGKPWzWQcIZAQiKAYFJbgc%3D&reserved=0, or unsubscribe https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAEWVMH4ZIJSOVG4ZZHSS6HD2SMTORAVCNFSM6AAAAABYE4U7FCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJSG44TINBWGI&data=05%7C02%7Cjstern%40umass.edu%7C87107b2069b14646885408dd59a387e6%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765281083767549%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=8PXNBYWwR0i%2B7QjsfGBvMxN1onKBb%2FEXmOumEk83VJ4%3D&reserved=0 . You are receiving this because you authored the thread.Message ID: @.***>

sternj avatar Mar 02 '25 17:03 sternj

Hm, yeah, that restriction might be just about possible, especially if you can disallow ` appearing elsewhere. What normally causes trouble is the actually ambigous stuff about *, **, _, __ and the nestings thereof. (since it's very easy to make this require arbitrary lookahead: ___Hello_ World__ = Hello World vs ___Hello__ World_= Hello World)

MegaIng avatar Mar 02 '25 18:03 MegaIng

All of those I can deal with as just plain text, and while I have double-bracketing as semantically relevant, it can't be arbitrarily nested. As for disallowing back ticks elsewhere, that's not entirely feasible. I have them in constructs like [syntax_start arg0 arbitrary text here syntax_end] and

Arbitrary_code

But they need to be permissible elsewhere, since they can also just be normal markdown. That's why I was thinking about the tokenization approach, since I can disallow the syntax_start and @embed sequences elsewhere without issue

On Sun, Mar 2, 2025, 1:00 PM MegaIng @.***> wrote:

Hm, yeah, that restriction might be just about possible, especially if you can disallow appearing elsewhere. What normally causes trouble is the actually ambigous stuff about *, **, _, __ and the nestings thereof. (since it's very easy to make this require arbitrary lookahead: *Hello World* = ___Hello_ World__ vs*Hello World*= Hello World)

— Reply to this email directly, view it on GitHub https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flark-parser%2Flark%2Fissues%2F1514%23issuecomment-2692838301&data=05%7C02%7Cjstern%40umass.edu%7Cd3ff914da6594e80218a08dd59b4222e%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765352380906681%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=Ewgb0cPtB0LQvQiOQkG%2FKl%2FxmCkbWTBuO6jbKjwnINI%3D&reserved=0, or unsubscribe https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAEWVMH22ZYOQ7L74367GWML2SNBMHAVCNFSM6AAAAABYE4U7FCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJSHAZTQMZQGE&data=05%7C02%7Cjstern%40umass.edu%7Cd3ff914da6594e80218a08dd59b4222e%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765352380922584%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=mUVNls8yv35ZvepEELUUKj3T%2FOT2fF4tVlIhpCi6d18%3D&reserved=0 . You are receiving this because you authored the thread.Message ID: @.***> [image: MegaIng]MegaIng left a comment (lark-parser/lark#1514) https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flark-parser%2Flark%2Fissues%2F1514%23issuecomment-2692838301&data=05%7C02%7Cjstern%40umass.edu%7Cd3ff914da6594e80218a08dd59b4222e%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765352380935503%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=U1pXn1itYUjjAtIDYGAngTQ5PBp3CCeUAVwiEr3X2J0%3D&reserved=0

Hm, yeah, that restriction might be just about possible, especially if you can disallow appearing elsewhere. What normally causes trouble is the actually ambigous stuff about *, **, _, __ and the nestings thereof. (since it's very easy to make this require arbitrary lookahead: *Hello World* = ___Hello_ World__ vs*Hello World*= Hello World)

— Reply to this email directly, view it on GitHub https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flark-parser%2Flark%2Fissues%2F1514%23issuecomment-2692838301&data=05%7C02%7Cjstern%40umass.edu%7Cd3ff914da6594e80218a08dd59b4222e%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765352380951026%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=gljJgAT%2BfGgks1lVUUPANmzBpTQ9p9mHxr30xO8GZDs%3D&reserved=0, or unsubscribe https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAEWVMH22ZYOQ7L74367GWML2SNBMHAVCNFSM6AAAAABYE4U7FCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJSHAZTQMZQGE&data=05%7C02%7Cjstern%40umass.edu%7Cd3ff914da6594e80218a08dd59b4222e%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765352380965322%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=AVmODPwsSS7OuiH0Tg%2FgRcS1iz8NonZ2Eo9dPugKmoc%3D&reserved=0 . You are receiving this because you authored the thread.Message ID: @.***>

sternj avatar Mar 02 '25 18:03 sternj

Alright, here's a sanitized version of the grammar I'm working with. I did have to change some details, so there might be some transcription errors. I know the explicit use of WS_INLINE is a bit suboptimal, but in most of these cases I'm trying to preserve whitespace in the document. There are also some productions of text that I needed to ensure were discrete tokens, so all in all it is a mess of a grammar.

start: content
content: (( simple_unary | intermediate_text | list_construct | codeblock | nothing_in_middle |   text|  inline_item) WS_INLINE* NEWLINE?)+
simple_unary.4: "[[" _simple_unary_item "]]" 

_simple_unary_item.4: SIMPLE_UNARY_ITEM ("|" SIMPLE_UNARY_ITEM ("," SIMPLE_UNARY_ITEM)*)?
SIMPLE_UNARY_ITEM: /[^\[\]]+/ 

intermediate_text.5: "`[intermediate" WS_INLINE+ _some_name? WS_INLINE+ "=" WS_INLINE* "`" WS* content WS* "`intermediate]`"
_some_name: IDENTIFIER
_bullet_list: ((list_item | codeblock) NEWLINE+)+
_CODE_HEAD: "```@lang" WS_INLINE* NEWLINE 
_CODE_TAIL:  "```" NEWLINE+
_LIST_CONSTRUCT_HEAD: _CODE_HEAD "list_construct" NEWLINE _CODE_TAIL

list_construct.3: _LIST_CONSTRUCT_HEAD _bullet_list _LIST_CONSTRUCT_TAIL
_LIST_CONSTRUCT_TAIL: _CODE_HEAD "end_list_construct" NEWLINE _CODE_TAIL
!_bullet:  WS_INLINE* "-" WS_INLINE+
inline_item.3: "[%" /.+?(?=%])/ "%]" | "[%" NEWLINE? _bullet_list  NEWLINE? "%]"
list_item.4: _bullet content (NEWLINE+ WS_INLINE* content)*

nothing_in_middle.4: "`[" WS_INLINE* "nothing_in_middle" WS_INLINE+ IDENTIFIER WS_INLINE* "]`"
IDENTIFIER: /[a-zA-Z_][a-zA-Z_0-9]*/

!text: (_TEXT | "[" | "]" | "`" | "=")+
_TEXT: /[^\n\[\]`=]+/
codeblock.2: "```@lang" WS_INLINE* NEWLINE /[^`]+/ "```"

%import common.NEWLINE
%import common.WS_INLINE
%import common.WS

sternj avatar Mar 02 '25 22:03 sternj

Your grammar is very inefficient. Many of your rules can be joined into a single terminal.

e.g.

!text: (_TEXT | "[" | "]" | "`" | "=")+

Can be written as a single regexp, and it would performance MUCH better than as a rule. Perhaps some of these fragments are needed to handle ambiguity, but I'd bet many of them don't.

erezsh avatar Mar 02 '25 23:03 erezsh

Also you should share how you're running Lark, i.e. what parameters you're using.

erezsh avatar Mar 02 '25 23:03 erezsh

A decent few of them are needed for the ambiguity resolution-- the one you pointed out is one of those. Previous versions of that rule led the tokenizer to have a TEXT token ending with something random here[ and then complain that it expected a [.

As for parameters-- Earley parser, dynamic lexer (all defaults).

sternj avatar Mar 02 '25 23:03 sternj

Any reason you can't do /[|]|=|...?

erezsh avatar Mar 02 '25 23:03 erezsh

That's a good idea, I'll put that in and see what that gets me. I'm also seeing _bullet as a rule and nothing_in_middle as a rule that don't need to be, I'll see whether making simple_unary a terminal still lets me distinguish simple_unary_items. Anything else jump out at you?

sternj avatar Mar 02 '25 23:03 sternj

The biggest ones are WS_INLINE+ WS_INLINE* and all the rest. They need to be a terminal, not a rule.

erezsh avatar Mar 03 '25 00:03 erezsh

How would I do that when I need to use them in the middle of rules? Would I make a terminal with form _TERM: WS_INLINE+ or the like?

On Sun, Mar 2, 2025, 7:28 PM Erez Shinan @.***> wrote:

The biggest ones are WS_INLINE+ WS_INLINE* and all the rest. They need to be a terminal, not a rule.

— Reply to this email directly, view it on GitHub https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flark-parser%2Flark%2Fissues%2F1514%23issuecomment-2692999238&data=05%7C02%7Cjstern%40umass.edu%7Cfaca92e74e2d4b6251f808dd59ea467d%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765584916336484%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=LxIoNPjm2Nf95B60u%2B2Q1k3TnVt2caxnaVQuq5Z%2BFaI%3D&reserved=0, or unsubscribe https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAEWVMH6NTMNR5LOXLAL5TYT2SOOZRAVCNFSM6AAAAABYE4U7FCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJSHE4TSMRTHA&data=05%7C02%7Cjstern%40umass.edu%7Cfaca92e74e2d4b6251f808dd59ea467d%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765584916353925%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=5y6TjYIvjxn3CCPtm41w9tKZtCnFFcua%2BKfUjfMDqO4%3D&reserved=0 . You are receiving this because you authored the thread.Message ID: @.***> [image: erezsh]erezsh left a comment (lark-parser/lark#1514) https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flark-parser%2Flark%2Fissues%2F1514%23issuecomment-2692999238&data=05%7C02%7Cjstern%40umass.edu%7Cfaca92e74e2d4b6251f808dd59ea467d%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765584916363052%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=BAlIO1dMYdhr5eR7tg2g34ZSvQnQqKBHmlmEK%2BfJKX4%3D&reserved=0

The biggest ones are WS_INLINE+ WS_INLINE* and all the rest. They need to be a terminal, not a rule.

— Reply to this email directly, view it on GitHub https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flark-parser%2Flark%2Fissues%2F1514%23issuecomment-2692999238&data=05%7C02%7Cjstern%40umass.edu%7Cfaca92e74e2d4b6251f808dd59ea467d%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765584916377346%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=CkuV%2FzaykU7mlTbeWPHvDaqKiCUp%2BlAzeIJBxivrQ%2Fk%3D&reserved=0, or unsubscribe https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAEWVMH6NTMNR5LOXLAL5TYT2SOOZRAVCNFSM6AAAAABYE4U7FCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJSHE4TSMRTHA&data=05%7C02%7Cjstern%40umass.edu%7Cfaca92e74e2d4b6251f808dd59ea467d%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638765584916387593%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=341LrxEf0ARF16Azn%2BvTopX1YD2vvzHN71vyv5BgLl8%3D&reserved=0 . You are receiving this because you authored the thread.Message ID: @.***>

sternj avatar Mar 03 '25 01:03 sternj

Yes, exactly. WS_INLINE+ can be WS_INLINE_REPEAT and WS_INLINE* as WS_INLINE_REPEAT?

erezsh avatar Mar 03 '25 07:03 erezsh

Thank you for all the help there, I'm a few hours into reworking this and have run into a confounding issue (vaguely related).

I have this grammar

start: _LIST_CONSTRUCT_TAIL
TRIPLE_BACKTICK: "`" "`" "`"
_CODE_HEAD: TRIPLE_BACKTICK "@lol" WS_INLINE* NEWLINE 
_CODE_TAIL:  TRIPLE_BACKTICK NEWLINE+ 

_LIST_CONSTRUCT_TAIL: _CODE_HEAD "end_list_construct" NEWLINE _CODE_TAIL

%import common.WS_INLINE
%import common.NEWLINE

(which I have tried with _CODE_HEAD in various forms including

```@lang

and

TRIPLE_BACKTICK: "```"

on this input:

```@lang
end_list_construct
`` `
(note that is not input: space before last backtick to get Markdown to work right on Github) 

and consistently get the error

```@lang

^
Expected one of: 
	* _LIST_CONSTRUCT_TAIL

Any ideas here?

sternj avatar Mar 04 '25 04:03 sternj

Another ping on this because I think it's probably a PIBKAC, but I can't for the life of me tell what it is

sternj avatar Mar 06 '25 02:03 sternj

Sorry, this one slipped through the cracks. I'm closing the issue, but if you still need help let me know.

erezsh avatar Aug 03 '25 11:08 erezsh