lark icon indicating copy to clipboard operation
lark copied to clipboard

Rethinking the import system

Open Niki4tap opened this issue 8 months ago • 18 comments

Hello Lark developers!

I've tried using this library in a personal project, and really liked it, thank you for it!

The problem

...except one thing, which was grammar composition. Lark docs make you think it has a nice module system where you can just import rules and terminals and everything magically works... but for some reason it doesn't.

Rules don't compose well since when you %import them, the grammar module that imported the rule is now responsible for parsing it since the rule has namespace of the current grammar module assigned to it, and funnily enough, every rule that it depends on still gets the imported grammar module prefix.

So if you had a layout like this:

# lib.lark

b: "b"
a: b
# main.lark
%import lib.a

start: a

Parse tree for "b" would look like:

start
	a
		lib__b

which makes approximately 0 sense to me :smile: (this has been discussed before in lark, and in my opinion the "official solution" came out looking as more of a workaround than something to be used and recommended to others)

Main grammar is supposed to parse imported terminal, but none of its children. Which is a very odd way of making the import system work. For example, if multiple grammar modules import the same terminal, they would all be responsible for parsing it, but not its children, which in transformer implementation terms means that importing transformers would have to inherit the implementation for parsing the imported terminal, and then on top of that get merge_transformers'd with the lib transformer (which also wouldn't be responsible for parsing one of its own rules?).

All of this makes working with multiple grammars very inconvenient.

The solution

%use directive.

More is explained in added docs, but here is a quick excerpt:

Imports terminals and rules just as %import does, but also puts the imported rule itself in its namespace and automatically aliases it.

Useful for importing rules that this chunk of grammar uses, but does not define a transformer for (isn't responsible for parsing it).

Changing %import to %use in the example above, and parsing same "b" will now result in:

start
	lib__a
		lib__b

This allows to simply compose transformers with merge_transformers and not worry about inheritance, mixins, or anything else for such a simple example!

Compatibility with %import

%use is currently incompatible with %import, i. e. you cannot do this:

# mixing both not allowed
%import a.b
%use a.c

or in the other direction, but you can import/use on different modules:

# this is ok!
%import a.b
%use b.c

The implementation

I am not going to lie, the implementation could use some design work, but right now it works as is.

Any comments or ideas on how to make it more robust will be appreciated as I am new to this codebase.

Currently it creates new group of aliases called use_aliases and uses that to rewrite all _defines, and that's about it. I am a bit worried it has some hidden bugs I haven't considered, but I hope for them to get ironed out during the code review :)

Other concerns

Of course, all of these issues could be solved with

...using inheritance? Or mixins? Python already has a few mechanisms (and libraries) for that sort of composition.^1

But in my opinion, lark shouldn't impose a hell of mixins, inheritance, and a ton of other OOP boilerplate on its users just to do such a simple thing as basic grammar composition, especially when it can be "just fixed" in a hundred lines or so in lark itself, for basically no cost.

Another option would be to enforce manually prefixing imported rules, which doesn't scale. Like at all.

And I guess you could always just string together all files and implement all rules and terminals in one transformer, but then you would have to manually namespace all rules and terminals, and their uses, not just %imports.

Lastly...

This isn't supposed to be final, I'm open to comments and suggestions! The tests aren't supposed to be final either, I can write some more, the only one I added is just supposed to show how it all works :)

P.S.: I've also added minimal .editorconfig to the repository since my editor always wanted to convert everything to tabs, maybe it will also help other contributors

Niki4tap avatar Mar 08 '25 09:03 Niki4tap

Hi @Niki4tap !

It's possible that the current design isn't ideal. IIRC when I made the choice, I was trying to match the behavior of importing terminals, and I didn't take nested transformers into consideration.

The question remains, whether both behaviors are useful, or only the name spaced behavior is correct. Because if the latter is true, we should just change the behavior of import, instead of introducing a new keyword.

Regarding the implementation, I think it might be a bit over complicated. Unless I'm missing something, "use" can use the existing import mechanism, with the only difference of renaming the imported rule at the end.

@MegaIng thoughts?

erezsh avatar Mar 08 '25 12:03 erezsh

Hi,

The question remains, whether both behaviors are useful, or only the name spaced behavior is correct. Because if the latter is true, we should just change the behavior of import, instead of introducing a new keyword.

Yeah I wasn't sure about this, hence the new keyword, and even if that's the case, I'm not sure about backwards compatibility, maybe something depends on the current behavior.

The renaming should allow for switching back to old behavior if we go with changing the %import, but it might be a bit too inconvenient. With two separate directives it should be fine.

Regarding the implementation, I think it might be a bit over complicated. Unless I'm missing something, "use" can use the existing import mechanism, with the only difference of renaming the imported rule at the end.

Yes, that's what I was trying to do, but I didn't find a way to rename the rule in the current grammar, as changing aliases would seemingly affect only imported grammars, so I felt like I had to implement that myself and that's what I did.

I'll look around some more but if you have any specific pointers, it would be appreciated.

Also I'll look at the CI fails in a bit.

Niki4tap avatar Mar 08 '25 15:03 Niki4tap

Had to work around types for a bit, but I think CI should be green now.

Niki4tap avatar Mar 08 '25 16:03 Niki4tap

I think we should at some point reconsider the entire import system. There are a few issues I can think of:

  • weird inconsistencies between importing rules and importing tokens. Some inconsistencies is ok, but it should be clear what they are.
  • Better handling of diamong-shaped imports - these are sometimes broken
  • Better interfacing for subgrammars - i.e. what is mentioned in this PR.

For the last part, I think something like this would be good:

class SubGrammarTransformer(Transformer):
    ...

class MainGrammarTransformer(Transformer):
    sub_grammar = SubGrammarTransformer

## or

class MainGrammarTransformer(Transformer):
    def __init__(self, ...):
        self.sub_grammar = SubGrammarTransformer()

And whatever renaming scheme we decide upon is handled internally.

I haven't read the code of this PR, but I don't like the general idea adding an additional directive - I think having two directives with similar behavior is a bad idea. Instead we should consider how we can make a breaking change here.

MegaIng avatar Mar 08 '25 22:03 MegaIng

In the past we made breaking changes by adding a switch, such as legacy_imports = True, with a deprecation warning, and then once we release v2, say a few months down the line, we could switch it to False. (it doesn't have to be a huge release like v1 was).

I agree we should fix problems with diamond-shaped imports, but it doesn't have to be in this PR.

The biggest "fix" I'd like is for %ignore to be importable (by being local to rules), but that's a big and complicated task.

erezsh avatar Mar 08 '25 23:03 erezsh

weird inconsistencies between importing rules and importing tokens. Some inconsistencies is ok, but it should be clear what they are.

Actually, I don't think I've noticed any?

Better handling of diamong-shaped imports - these are sometimes broken

Again, maybe the luck has spared me, but I think I have a couple in my project, and I haven't had any issues, except aforementioned namespacing troubles.

Better interfacing for subgrammars - i.e. what is mentioned in this PR. For the last part, I think something like this would be good:

That's how it would look like if the proposed namespacing changes are applied:

class LeafGrammarTransformer(Transformer): ...

class SubGrammarTransformer(Transformer): ...

class MainGrammarTransformer(Transformer): ...

def make_transformer(...) -> Transformer:
	leaf_grammar_transformer = LeafGrammarTransformer()
	sub_grammar_transformer = merge_transformers(SubGrammarTransformer(), leaf = leaf_grammar_transformer)
	return merge_transformers(MainGrammarTransformer(), sub = sub_grammar_transformer, leaf = leaf_grammar_transformer)

And mostly how it looks like right now if your grammar is 1 level deep, if it's more, you still have to do this, but also mixin imported implementations, and manually namespace in the grammar.

I haven't read the code of this PR, but I don't like the general idea adding an additional directive - I think having two directives with similar behavior is a bad idea. Instead we should consider how we can make a breaking change here.

I'm completely fine with changing current %import behavior.

In the past we made breaking changes by adding a switch, such as legacy_imports = True, with a deprecation warning, and then once we release v2, say a few months down the line, we could switch it to False

Sounds fine!

I agree we should fix problems with diamond-shaped imports, but it doesn't have to be in this PR.

I can take a peek at it, if pointed to the problem directly, since I haven't encountered one in the wild.

The biggest "fix" I'd like is for %ignore to be importable (by being local to rules), but that's a big and complicated task.

Yeah, this sounds like a good idea, though I'm not sure how it would work, since %ignores don't define anything %importable. Also, yes, that sounds like a big task since that needs lexer integration and being huge is not the aim of this PR :sweat_smile:


I can change the implementation of %import to do what %use does in this PR right now (automatic namespacing and aliasing of imported terminals), I think it would be a good start to rethinking the import system, and then we can see if the changes look good to merge or if any other ideas pop up.

Also would be nice to see what tests fail or trip up on this, since we probably want to consider those usecases.

Niki4tap avatar Mar 09 '25 06:03 Niki4tap

Dove a bit deeper into the implementation, made the change to %import instead of adding a new directive, and rewrote the older implementation since I thought I would get a bit more insight into the codebase and would end up with a better version, but it seems like I've just reimplemented the old :p

Simple stuff works, but a few tests are failing (33 failed, 959 passed, 134 skipped), I've migrated most of them, but some have weird errors I couldn't quite trace back to the source.

I'll try to fix them as soon as I'll have some more time, but for now feel free to look at the implementation (I probably missed something important)

Failing test cases:

TestGrammar::test_import_custom_sources3
TestGrammar::test_list_grammar_imports
TestGrammar::test_override_rule

# these all fail with basically the same error
Test{Parser}{Lexer}::test_relative_import
Test{Parser}{Lexer}::test_relative_import_rename
Test{Parser}{Lexer}::test_relative_multi_import
Test{Parser}{Lexer}::test_error_with_interactive_parser

Niki4tap avatar Mar 09 '25 17:03 Niki4tap

We do want to be backwards compatible, and changing the tests kind of beats the purpose of having them in the first place :)

erezsh avatar Mar 09 '25 17:03 erezsh

We do want to be backwards compatible, and changing the tests kind of beats the purpose of having them in the first place :)

Oh yeah, I forgot about that, we probably want to test the new implementation as well though, so I might need to do the same thing as with the lexers/parsers, but with old and new imports

Niki4tap avatar Mar 09 '25 17:03 Niki4tap

A couple updates:

1. Added an option, and refactored some of the code (again), to allow for legacy imports.

2. Hacked together a solution to run tests that involve importing with both legacy imports and new ones.

With that, only 1 old test is failing, the cache one, I'll look into it hopefully tomorrow. Not sure why, but it seems like it's still trying to load the grammar with legacy_import option set to False, even though I pass it as True. Would love some pointers around the caching system, I did see the _LOAD_ALLOWED_OPTIONS in lark.py but adding in there didn't seem to do much.

3. Well, I've got good news and bad ones: Good news is: I think I've figured out the reason other tests are failing. Bad news is: it's complicated.

Imagine a grammar layout like this:

# lib.lark

a: "b"
# nested.lark

%import lib.a
# main.lark
start: a

%import nested.a

Important thing to note is that %import doesn't define a rule/terminal, it only declares it, and while the old system deals with this just fine since it rewrites the names of children items (rules/terminals), I've made the new one to rewrite the rules/terminals it defines instead, and expect the imported names to be correct.

So the new import system would expect definitions to look like:

start
	nested__a
		nested__lib__a

But what happens in reality is:

start
	nested__a # expected
	nested__lib__a # actual (declaration)
		nested__lib__a # (definition)

And it throws an error with nested_a used but not found.

From here I see 2 potential solutions: 1. Make %imports strong, i. e. each %import defines a rule/terminal, not just declare it 2. On grammar load, return defined rules/terminals separately from re-imported ones

I personally prefer the second one (although it might be a little complicated), but I'm not exactly sure about it, and I want to hear some opinions on it.

Related: This is also why I added start rule to some tests, while I don't think the second option would solve it, the first one would. But I think using an %imported rule as a start isn't a good practice anyways.

P. S. I've also added venv to .gitignore, was a little annoying that git kept trying to commit that

Niki4tap avatar Mar 15 '25 18:03 Niki4tap

Hello again, I think I've fixed all of the tests now.

Turns out cache test wasn't an issue, but output was different due to defining the start rule.

Interactive parser test just needed the namespacing.

And I worked around the re-import issue by doing none of what I have proposed before, but linking new names to old definitions (override now overwrites the existing definition instead of creating a new one and replacing the old one).

With that all the tests are passing on my machine now and pre-commit also doesn't error so I think the CI should be green again.

Nothing to be done on my side for now, I think this should be ready for a review.

cc @erezsh

Niki4tap avatar Mar 26 '25 07:03 Niki4tap

What a weird type error, I don't think I even touched that code in this pr, but should be fixed anyways.

Niki4tap avatar Mar 26 '25 09:03 Niki4tap

Before I go deeply into the implementation, can you explain in a few words why it took changing so many lines? I mean, if the only difference in behavior is renaming the imported rule, why was it necessary to update so many different parts?

erezsh avatar Mar 26 '25 10:03 erezsh

I've split up the changes across commits, so it would be more reviewable, if you're looking just at the diff number github gives you in the pr it hides a lot of things. I think most of the changes are test-related, the commit with the actual implementation has the diff numbers of +151 -30 and a lot of additions are because I kind of duplicated the load_grammar method to handle both old and new grammar.

Niki4tap avatar Mar 26 '25 10:03 Niki4tap

For now, I'm talking mostly about the changes in load_grammar.py

I kind of duplicated the load_grammar method

That's part of what I'm asking. Why did you need to duplicate it?

erezsh avatar Mar 26 '25 10:03 erezsh

Ah I guess I didn't necessarily need to, when I was starting out I thought it would be littered with if checks so I thought it would be cleaner to separate them, but looking at it now it doesn't look like a lot is changed, so I can probably unify them if that's preferred.

Niki4tap avatar Mar 26 '25 10:03 Niki4tap

Yes, minimal code is preferred..

erezsh avatar Mar 26 '25 14:03 erezsh

Unified the function and rebased.

Also set legacy_import to True by default and adapted a few tests I overlooked.

Niki4tap avatar Mar 27 '25 07:03 Niki4tap

Hi there, I'm also having trouble with grammar composition and this being merged would be very helpful! Is there anything that needs to be done here for this to get merged @erezsh?

cademirch avatar Sep 15 '25 23:09 cademirch

If I remember correctly, only thing missing was documentation and probably rewriting examples to the new import system.

The implementation should work, I used my fork for my own project and it worked as I've expected, but the PR probably still needs a proper look.

I no longer have a need in this PR, but would be fine doing whatever is needed to push it over the finish line :)

Niki4tap avatar Sep 24 '25 18:09 Niki4tap

For PRs to make it into Lark, they need to be solid and of high quality. Passing the tests is a good start, but it is not a strong enough qualifier, when so many of the tests themselves have been changed in this PR.

I'm not convinced that this PR took the best approach to solve this problem. And more importantly, I'm not convinced that it doesn't break backwards compatibility, which would be a very bad thing.

So, while I do appreciate the intention and the effort it took to write this, I'm going close this PR.

If anyone wants to make another attempt at this issue, they are welcome to it. But please understand that such changes need to be done very carefully and methodically.

erezsh avatar Sep 24 '25 20:09 erezsh