lark Terminal Editing before Compression

Terminal Editing before Compression

Open clementfaisandier opened this issue 1 year ago • 3 comments

The Problem

I am trying to apply a general grammar on various types of text files; specifically on code and documentation files in languages such as python, C, LaTeX... All of these use different comment characters, since my grammar has a keen interest in comments, the COMMENT_CHAR terminal must be set to the right value for each file I need to parse.

I recommend lark should allow users to change/set terminal values. Specifically,

Alternatives

I have already tried using edit_terminals to produce this behavior. However, edit_terminals occurs after terminals values are processed and compressed into a minimal set of tokens. Although one could modify these complex regexes, this requires the user to understand how lark works behind the scenes, would be very clunky, and prone to errors.

Although one could also modify the input grammar before passing it to lark, text processing should be lark's responsibility. Having to parse a grammar and modify it so lark can parse the user's files seems a bit backwards.

Context

This my original post and has an example to describe the problem:

I'm having an issue using edit_terminals: I'm finding that Lark is compressing the terminals I've defined before I use edit_terminals.

This is my grammar; the terminal I am looking to modify is COMMENT_CHAR to support multiple languages:

start: (snippet | LINE)*

snippet: snippet_marker LINE* 

// TODO: Evaluate if the prefix for each token should be in or out of the token.
snippet_marker.1: PREFIX MARKER _IWS* /.+/ SUFFIX
PREFIX: _IWS* COMMENT_CHAR _IWS*
MARKER: _SPIDER
SUFFIX: _EOL
_DEFINITION_TOKENS: CONTEXT _IWS* TOPIC _IWS* CONTENT_TYPE
_BOOLEAN_FLAGS: _DEFINITION_TOKENS? (LINK | EMBEDDING)          // Makes the definition tokens filtering options
CONTEXT: "#" /\w{1,16}/     // What's the general theme?
TOPIC: "@" /\w{1,16}/       // What is this snippet about?
CONTENT_TYPE: "$" /[DRAC]/  // Documentation, Reasoning, API, Code
LINK: "?"
EMBEDDING: "!"

// Resources

_SPIDER: "//\(oo)/\\"
_BORING: "SNIPPET"
_ROBOT: "[o_o]"

LINE: _IWS* COMMENT_CHAR? _SENTENCE? _EOL
_SENTENCE: (_WORD _IWS+)* _WORD
_WORD: /\S+/
COMMENT_CHAR: "TO BE OVERRIDE BY PROGRAM- DO NOT REMOVE - DO NOT USE ANOTHER TOKEN FOR COMMENT CHAR"

_EOL: _IWS* _NL
_IWS: /[\t ]/
_NL: /\r?\n/

This is the python:

import lark

def terminal_callback(terminal_definition):
        print(terminal_definition)

with open('grammar.lark', 'rt') as file:
    parser = lark.Lark(file.read(), edit_terminals=terminal_callback)

with open('sandbox/src/base_calc.py', 'rt') as file:
        parser.parse(text=file.read())

But the output is:

TerminalDef('PREFIX', '(?:[\t ])*TO\\ BE\\ OVERRIDE\\ BY\\ LARIAT\\ \\-\\ DO\\ NOT\\ REMOVE\\ \\-\\ DO\\ NOT\\ USE\\ ANOTHER\\ TOKEN\\ FOR\\ COMMENT\\ CHAR(?:[\t ])*')
TerminalDef('MARKER', '//\\\\\\(oo\\)/\\\\')
TerminalDef('SUFFIX', '(?:[\t ])*\r?\n')
TerminalDef('LINE', '(?:[\t ])*(?:TO\\ BE\\ OVERRIDE\\ BY\\ LARIAT\\ \\-\\ DO\\ NOT\\ REMOVE\\ \\-\\ DO\\ NOT\\ USE\\ ANOTHER\\ TOKEN\\ FOR\\ COMMENT\\ CHAR)?(?:(?:\\S+(?:[\t ])+)*\\S+)?(?:[\t ])*\r?\n')
TerminalDef('_IWS', '[\t ]')
TerminalDef('__ANON_0', '.+')

Clearly COMMENT_CHAR was absorbed by PREFIX, which makes it difficult to consistently.

There's a possibility I'm not using this right, but I also feel the edit_terminals option should occur before compression, otherwise users need to predict compression to use it consistently.

Thank you, Clement

Sep 03 '24 13:09 clementfaisandier

lark lark copied to clipboard

Terminal Editing before Compression

The Problem

Alternatives

Context

lark
lark copied to clipboard