lark
lark copied to clipboard
Grammar linter
I get a lot of use from linters in programming python generally, such as pylint, flake8, mypy, bandit.
Would it be possible to provide linters for lark grammars, which would give suggestions on constructing good grammars?
I would be very happy to accept such a contribution. I don't have much experience with writing linters or interfacing with editors for that purpose. But if anyone writes such a tool, I'll be happy to add all sorts of heuristics to it.
I think typically linters just provide a command-line executable which takes a target directory or file path as an argument, and outputs messages on stdout. The editors handle these messages themselves. Pylint for example outputs in one of several formats, e.g. http://pylint.pycqa.org/en/latest/user_guide/output.html
I suppose a lark linter could be a separate linter executable altogether for .lark files, or just a pylint plugin which could parse the first argument of lark.Lark(). A pylint plugin might be simplest for easily getting it up and running, I'm not sure.
I've never done this before so I don't really know what I'm doing, but based on some examples I made an attempt and it seems to work. I'd expect it to just work in any project or editor by adding "myplugin" to the pylintrc file.
# mypackage/myplugin.py
import astroid
import pylint.checkers
import pylint.interfaces
def register(linter):
"""Register checkers."""
linter.register_checker(LarkChecker(linter))
BASE_ID = 33
class LarkChecker(pylint.checkers.BaseChecker):
__implements__ = (pylint.interfaces.IAstroidChecker,)
name = "my-lark-checker"
UNDEFINED_REFERENCE = "undefined-reference"
EXCESSIVE_AMBIGUITY = "excessive-ambiguity"
msgs = {
f"E{BASE_ID}01": (
"%s: something is undefined",
UNDEFINED_REFERENCE,
"All references must be defined",
),
f"E{BASE_ID}02": (
"%s: grammar is excessively ambiguous",
EXCESSIVE_AMBIGUITY,
"Make it clearer",
),
}
@pylint.checkers.utils.check_messages(UNDEFINED_REFERENCE, EXCESSIVE_AMBIGUITY)
def visit_call(self, node):
"""Called for every function call in the source code."""
if isinstance(node, astroid.node_classes.Attribute):
if node.func.expr.name != "lark":
return
if node.func.attrname != "Lark":
return
if isinstance(node, astroid.node_classes.Call):
if node.func.name != "Lark":
return
first_argument = node.args[0]
# Take the first inferred guess about the value of the grammar.
grammar = next(first_argument.infer()).value
# Bad!
if len(grammar) < 3:
self.add_message(self.UNDEFINED_REFERENCE, args=node.func.name, node=node)
if len(grammar) > 5:
self.add_message(self.EXCESSIVE_AMBIGUITY, args=node.func.name, node=node)
Running it over this code
# tmp/example.py
from lark import Lark
grammar = """
SOMETHING
"""
Lark(grammar)
with
$ pylint --load-plugins mypackage.myplugin tmp/example.py
gives
************* Module example
tmp/example.py:1:0: C0111: Missing module docstring (missing-docstring)
tmp/example.py:3:0: C0103: Constant name "grammar" doesn't conform to UPPER_CASE naming style (invalid-name)
tmp/example.py:8:0: E3302: Lark: grammar is excessively ambiguous (excessive-ambiguity)
---------------------------------------------------------------------
Your code has been rated at -13.33/10 (previous run: 2.50/10, -15.83)
@jtrakk Have you made any progress with it?
Using pyglint (a small convenience wrapper for pylint), it can be relatively concise to add checkers. Pyglint relies on Python 3; it wouldn't be hard to build a similar API that also supports Python 2. I don't have time to do a Py2 version at the moment, but I think this is a working approach.
"""Basic lints for lark grammars."""
import re
import typing as t
import attr
import astroid
import pyglint
# Setup
###############
group = pyglint.CheckerGroup("pylint-lark")
@attr.s(auto_attribs=True, order=False)
class GrammarCheckerRegistry:
"""Registry for collecting grammar checkers.."""
functions: t.List[t.Callable] = attr.ib(factory=list)
def checker(self, function):
self.functions.append(function)
return function
@group.check(astroid.node_classes.Call)
def check_grammar(checker, node) -> t.Iterator[pyglint.Message]:
"""Generate messages for Lark grammars."""
if isinstance(node, astroid.node_classes.Attribute):
if node.func.expr.name != "lark":
return
if node.func.attrname != "Lark":
return
if isinstance(node, astroid.node_classes.Call):
if node.func.name != "Lark":
return
first_argument = node.args[0]
# Take the first inferred guess about the value of the grammar.
node = next(first_argument.infer())
# Assumes the grammar is provided as a string literal. If the grammar is loaded from a
# file or with importlib.resources, code can be added here to handle that case.
grammar = node.value
for function in registry.functions: # pylint: disable=not-an-iterable
for msg in function(grammar, node):
msg = attr.evolve(msg, line=msg.line + node.lineno - grammar.count("\n"))
yield msg
def register(linter):
"""Register checkers."""
checker = pyglint.make_pylint_checker(group)
linter.register_checker(checker(linter))
registry = GrammarCheckerRegistry()
# Define problems
##########################
BAD_NAME = group.problem(
name="bad-name",
text="The name '{identifier}' is against the guidelines.",
explanation="It's a good idea to have a useful and descriptive name. For example, Counter instead of ctr.",
)
# Define checkers
############################
@registry.checker
def metasyntactic_names(grammar: str, node) -> t.Iterator[pyglint.Message]:
"""Identify some metasyntactic names literally."""
for lineno, line in enumerate(grammar.splitlines()):
for match in re.finditer("(foo|bar|baz|quux)", line):
yield pyglint.message(
problem=BAD_NAME,
node=node,
line=lineno,
col_offset=match.start(),
identifier=match.group(1),
)
@registry.checker
def short_names(grammar: str, node) -> t.Iterator[pyglint.Message]:
"""Identify short names."""
for lineno, line in enumerate(grammar.splitlines()):
for match in re.finditer(r"\b([A-Za-z])\b", line):
yield pyglint.message(
problem=BAD_NAME,
node=node,
line=lineno,
col_offset=match.start(),
identifier=match.group(1),
)
An example module to be linted.
"""A basic test for the linter."""
from lark import Lark
GRAMMAR = """foo
SOMETHING : x
"""
Lark(GRAMMAR)
Running it:
$ python -m pylint --load-plugins examples.larklint examples/lark_linted.py
************* Module examples.lark_linted
examples/lark_linted.py:5:0: E3219: The name 'foo' is against the guidelines. (bad-name)
examples/lark_linted.py:6:12: E3219: The name 'x' is against the guidelines. (bad-name)
Keep in mind that linting can occur by looking at the text of the grammar, or the parse tree from the grammar.
Another possible idea (and similar to linting) would be to provide an optimizer for grammars. Lark parses the grammar and outputs the parse tree which is processed and produces:
- information about suspicious constructs
- places where a grammar change could make it "more correct/less ambiguous"
- information about which of the types of parsers (lalr, early, etc.) could parse this grammar
- could rewrite the parse tree to be "optimal" for the given grammar, and output the new grammar.