lark Grammar linter

I get a lot of use from linters in programming python generally, such as pylint, flake8, mypy, bandit.

Would it be possible to provide linters for lark grammars, which would give suggestions on constructing good grammars?

Mar 03 '19 19:03 jtrakk

I would be very happy to accept such a contribution. I don't have much experience with writing linters or interfacing with editors for that purpose. But if anyone writes such a tool, I'll be happy to add all sorts of heuristics to it.

Mar 03 '19 19:03 erezsh

I think typically linters just provide a command-line executable which takes a target directory or file path as an argument, and outputs messages on stdout. The editors handle these messages themselves. Pylint for example outputs in one of several formats, e.g. http://pylint.pycqa.org/en/latest/user_guide/output.html

I suppose a lark linter could be a separate linter executable altogether for .lark files, or just a pylint plugin which could parse the first argument of lark.Lark(). A pylint plugin might be simplest for easily getting it up and running, I'm not sure.

Mar 03 '19 19:03 jtrakk

I've never done this before so I don't really know what I'm doing, but based on some examples I made an attempt and it seems to work. I'd expect it to just work in any project or editor by adding "myplugin" to the pylintrc file.

# mypackage/myplugin.py

import astroid

import pylint.checkers
import pylint.interfaces


def register(linter):
    """Register checkers."""
    linter.register_checker(LarkChecker(linter))


BASE_ID = 33


class LarkChecker(pylint.checkers.BaseChecker):

    __implements__ = (pylint.interfaces.IAstroidChecker,)

    name = "my-lark-checker"

    UNDEFINED_REFERENCE = "undefined-reference"
    EXCESSIVE_AMBIGUITY = "excessive-ambiguity"

    msgs = {
        f"E{BASE_ID}01": (
            "%s: something is undefined",
            UNDEFINED_REFERENCE,
            "All references must be defined",
        ),
        f"E{BASE_ID}02": (
            "%s: grammar is excessively ambiguous",
            EXCESSIVE_AMBIGUITY,
            "Make it clearer",
        ),
    }

    @pylint.checkers.utils.check_messages(UNDEFINED_REFERENCE, EXCESSIVE_AMBIGUITY)
    def visit_call(self, node):
        """Called for every function call in the source code."""

        if isinstance(node, astroid.node_classes.Attribute):
            if node.func.expr.name != "lark":
                return
            if node.func.attrname != "Lark":
                return

        if isinstance(node, astroid.node_classes.Call):
            if node.func.name != "Lark":
                return

        first_argument = node.args[0]
        # Take the first inferred guess about the value of the grammar.
        grammar = next(first_argument.infer()).value

        # Bad!
        if len(grammar) < 3:
            self.add_message(self.UNDEFINED_REFERENCE, args=node.func.name, node=node)
        if len(grammar) > 5:
            self.add_message(self.EXCESSIVE_AMBIGUITY, args=node.func.name, node=node)

Running it over this code

# tmp/example.py

from lark import Lark

grammar = """
SOMETHING
"""


Lark(grammar)

with

$ pylint --load-plugins mypackage.myplugin tmp/example.py

gives

************* Module example
tmp/example.py:1:0: C0111: Missing module docstring (missing-docstring)
tmp/example.py:3:0: C0103: Constant name "grammar" doesn't conform to UPPER_CASE naming style (invalid-name)
tmp/example.py:8:0: E3302: Lark: grammar is excessively ambiguous (excessive-ambiguity)

---------------------------------------------------------------------
Your code has been rated at -13.33/10 (previous run: 2.50/10, -15.83)

Mar 03 '19 21:03 jtrakk

@jtrakk Have you made any progress with it?

Jan 16 '20 12:01 erezsh

Using pyglint (a small convenience wrapper for pylint), it can be relatively concise to add checkers. Pyglint relies on Python 3; it wouldn't be hard to build a similar API that also supports Python 2. I don't have time to do a Py2 version at the moment, but I think this is a working approach.

"""Basic lints for lark grammars."""

import re
import typing as t

import attr
import astroid
import pyglint

# Setup
###############

group = pyglint.CheckerGroup("pylint-lark")


@attr.s(auto_attribs=True, order=False)
class GrammarCheckerRegistry:
    """Registry for collecting grammar checkers.."""

    functions: t.List[t.Callable] = attr.ib(factory=list)

    def checker(self, function):
        self.functions.append(function)
        return function


@group.check(astroid.node_classes.Call)
def check_grammar(checker, node) -> t.Iterator[pyglint.Message]:
    """Generate messages for Lark grammars."""

    if isinstance(node, astroid.node_classes.Attribute):
        if node.func.expr.name != "lark":
            return
        if node.func.attrname != "Lark":
            return

    if isinstance(node, astroid.node_classes.Call):
        if node.func.name != "Lark":
            return

    first_argument = node.args[0]
    # Take the first inferred guess about the value of the grammar.
    node = next(first_argument.infer())

    # Assumes the grammar is provided as a string literal. If the grammar is loaded from a
    # file or with importlib.resources, code can be added here to handle that case.
    grammar = node.value

    for function in registry.functions:  # pylint: disable=not-an-iterable
        for msg in function(grammar, node):
            msg = attr.evolve(msg, line=msg.line + node.lineno - grammar.count("\n"))
            yield msg


def register(linter):
    """Register checkers."""
    checker = pyglint.make_pylint_checker(group)
    linter.register_checker(checker(linter))


registry = GrammarCheckerRegistry()

# Define problems
##########################


BAD_NAME = group.problem(
    name="bad-name",
    text="The name '{identifier}' is against the guidelines.",
    explanation="It's a good idea to have a useful and descriptive name. For example, Counter instead of ctr.",
)


# Define checkers
############################


@registry.checker
def metasyntactic_names(grammar: str, node) -> t.Iterator[pyglint.Message]:
    """Identify some metasyntactic names literally."""
    for lineno, line in enumerate(grammar.splitlines()):
        for match in re.finditer("(foo|bar|baz|quux)", line):
            yield pyglint.message(
                problem=BAD_NAME,
                node=node,
                line=lineno,
                col_offset=match.start(),
                identifier=match.group(1),
            )


@registry.checker
def short_names(grammar: str, node) -> t.Iterator[pyglint.Message]:
    """Identify short names."""

    for lineno, line in enumerate(grammar.splitlines()):
        for match in re.finditer(r"\b([A-Za-z])\b", line):
            yield pyglint.message(
                problem=BAD_NAME,
                node=node,
                line=lineno,
                col_offset=match.start(),
                identifier=match.group(1),
            )

An example module to be linted.

"""A basic test for the linter."""

from lark import Lark

GRAMMAR = """foo
SOMETHING : x
"""


Lark(GRAMMAR)

Running it:

$ python -m pylint --load-plugins examples.larklint examples/lark_linted.py
************* Module examples.lark_linted
examples/lark_linted.py:5:0: E3219: The name 'foo' is against the guidelines. (bad-name)
examples/lark_linted.py:6:12: E3219: The name 'x' is against the guidelines. (bad-name)

Jan 17 '20 22:01 jtrakk

Keep in mind that linting can occur by looking at the text of the grammar, or the parse tree from the grammar.

Another possible idea (and similar to linting) would be to provide an optimizer for grammars. Lark parses the grammar and outputs the parse tree which is processed and produces:

information about suspicious constructs
places where a grammar change could make it "more correct/less ambiguous"
information about which of the types of parsers (lalr, early, etc.) could parse this grammar
could rewrite the parse tree to be "optimal" for the given grammar, and output the new grammar.

Apr 04 '22 17:04 julie777

lark lark copied to clipboard

Grammar linter

lark
lark copied to clipboard