sly icon indicating copy to clipboard operation
sly copied to clipboard

Panic Mode Recovery at End of File

Open alanbarr opened this issue 4 years ago • 1 comments

Background

Ideally I want to be able to parse out some specially formatted C++ comments and the function which they are documenting. (Think a bespoke form of Doxygen).

After some reading it sounded a lot like using a Lexer/Parser had already solved the hard part of this.

Possible problem is I'm trying to be lazy and ignore all the surrounding C++ code. So, outside of my golden comment blocks (and later the function being documented) there's a sea of syntax errors.

I was hoping I could easily pull out the interesting parts and ignore everything else. I'm starting to think this might be outside intended operating conditions of such a parser though...

Sly

I've been testing out Sly which I've proved will easily do what I want when there is no unexpected text.

However, I can't quite seem to get the rather extreme error handling to do what I'd like. Currently the problem appears to be when the unexpected text is between a valid statement and the EOF.

Looking at the state debugfile, it looks like I need to get either a COMMENT_OPEN or an $end to reduce what should be a complete expression on the stack. However, I'm entering error() handling before hitting the end of the file and I wonder if I need to be signaling this somehow?

I've got some simplified test code below.

Test Code

#! /usr/bin/env python3

from sly import Parser
from sly import Lexer
from pprint import pprint


class CommentLexer(Lexer):
    tokens = {COMMENT_OPEN, COMMENT_CLOSE, WORD, SEMI}

    COMMENT_OPEN = r"/\* COMMENT:"
    COMMENT_CLOSE = r"\*/"
    WORD = r"[^; \*\t\n\r\f\v]+"
    SEMI = r";"

    ignore_astrix = r"\*"
    ignore_newline = r"\n"
    ignore_space = r" "

    def ignore_newline(self, t):
        self.lineno += t.value.count("\n")

    def error(self, t):
        print("Line %d: Bad character %r" % (self.lineno, t.value[0]))
        self.index += 1


class CommentParser(Parser):
    tokens = CommentLexer.tokens
    debugfile = "comment_parser.out"

    def __init__(self):
        self.comments = []

    @_("comment_doc comment_doc")
    def comment_doc(self, p):
        pass

    @_("COMMENT_OPEN string COMMENT_CLOSE")
    def comment_doc(self, p):
        print("#########")
        print(f"Got: {p.string}")
        print("#########")
        self.comments.append(p.string)
        return p.string

    @_("string string")
    def string(self, p):
        return p[0] + " " + p[1]

    @_("WORD")
    def string(self, p):
        return p.WORD

    def error(self, p):
        pprint(p)

        if not p:
            print("Hit the end of the file!")
            return

        print(f"Syntax error at type: {p.type} value: {p.value} line: {p.lineno}")
        while True:
            tok = next(self.tokens, None)

            if tok == None:
                print("Error Tok: Hit None")
                return tok

            if tok.type == "COMMENT_OPEN":
                print("Error Tok: Found new comment")
                return tok

            print(f"Ignoring: {tok.type}")


def test_one_comment_recovery_after():
    lexer = CommentLexer()

    test_data = """
    /* COMMENT: This is the
       only comment string I'd
       like to parse out
    */

    /* I don't care about this one. */

    """

    parser = CommentParser()
    parser.parse(lexer.tokenize(test_data))
    assert len(parser.comments) == 1


def test_one_comment_recovery_before():
    lexer = CommentLexer()

    test_data = """
    /* I don't care about this one. */

    /* COMMENT: This is the
       only comment string I'd
       like to parse out
    */

    """

    parser = CommentParser()
    parser.parse(lexer.tokenize(test_data))
    assert len(parser.comments) == 1

alanbarr avatar Aug 04 '19 23:08 alanbarr

Trying to continue parsing after a syntax error is going to be messy, your better chance is in tokenizing everything, and discard what you don't need.

alberth avatar Feb 11 '20 12:02 alberth