sly
sly copied to clipboard
Panic Mode Recovery at End of File
Background
Ideally I want to be able to parse out some specially formatted C++ comments and the function which they are documenting. (Think a bespoke form of Doxygen).
After some reading it sounded a lot like using a Lexer/Parser had already solved the hard part of this.
Possible problem is I'm trying to be lazy and ignore all the surrounding C++ code. So, outside of my golden comment blocks (and later the function being documented) there's a sea of syntax errors.
I was hoping I could easily pull out the interesting parts and ignore everything else. I'm starting to think this might be outside intended operating conditions of such a parser though...
Sly
I've been testing out Sly which I've proved will easily do what I want when there is no unexpected text.
However, I can't quite seem to get the rather extreme error handling to do what I'd like. Currently the problem appears to be when the unexpected text is between a valid statement and the EOF.
Looking at the state debugfile, it looks like I need to get either a COMMENT_OPEN or an $end to reduce what should be a complete expression on the stack. However, I'm entering error() handling before hitting the end of the file and I wonder if I need to be signaling this somehow?
I've got some simplified test code below.
Test Code
#! /usr/bin/env python3
from sly import Parser
from sly import Lexer
from pprint import pprint
class CommentLexer(Lexer):
tokens = {COMMENT_OPEN, COMMENT_CLOSE, WORD, SEMI}
COMMENT_OPEN = r"/\* COMMENT:"
COMMENT_CLOSE = r"\*/"
WORD = r"[^; \*\t\n\r\f\v]+"
SEMI = r";"
ignore_astrix = r"\*"
ignore_newline = r"\n"
ignore_space = r" "
def ignore_newline(self, t):
self.lineno += t.value.count("\n")
def error(self, t):
print("Line %d: Bad character %r" % (self.lineno, t.value[0]))
self.index += 1
class CommentParser(Parser):
tokens = CommentLexer.tokens
debugfile = "comment_parser.out"
def __init__(self):
self.comments = []
@_("comment_doc comment_doc")
def comment_doc(self, p):
pass
@_("COMMENT_OPEN string COMMENT_CLOSE")
def comment_doc(self, p):
print("#########")
print(f"Got: {p.string}")
print("#########")
self.comments.append(p.string)
return p.string
@_("string string")
def string(self, p):
return p[0] + " " + p[1]
@_("WORD")
def string(self, p):
return p.WORD
def error(self, p):
pprint(p)
if not p:
print("Hit the end of the file!")
return
print(f"Syntax error at type: {p.type} value: {p.value} line: {p.lineno}")
while True:
tok = next(self.tokens, None)
if tok == None:
print("Error Tok: Hit None")
return tok
if tok.type == "COMMENT_OPEN":
print("Error Tok: Found new comment")
return tok
print(f"Ignoring: {tok.type}")
def test_one_comment_recovery_after():
lexer = CommentLexer()
test_data = """
/* COMMENT: This is the
only comment string I'd
like to parse out
*/
/* I don't care about this one. */
"""
parser = CommentParser()
parser.parse(lexer.tokenize(test_data))
assert len(parser.comments) == 1
def test_one_comment_recovery_before():
lexer = CommentLexer()
test_data = """
/* I don't care about this one. */
/* COMMENT: This is the
only comment string I'd
like to parse out
*/
"""
parser = CommentParser()
parser.parse(lexer.tokenize(test_data))
assert len(parser.comments) == 1
Trying to continue parsing after a syntax error is going to be messy, your better chance is in tokenizing everything, and discard what you don't need.