gocc
gocc copied to clipboard
Feature: Auto suggestion
Hi. I have a use case where I am using gocc for a simple query language for my application. However, the end user of the application does not have in depth knowledge on the syntax defined in the bnf file.
I thought it might be useful to try and use the parser in order to help the end user with the query language by capturing the output when the parser errors and just providing the list of expected tokens it generates to my end user.
I created a small example here for demonstration: autosuggest example The additional code I added to the parser is just this bit: SuggestParse
As you can see, I just wrap the scanner and force a INVALID
token at the end. This forces the parser to generate an error and supply a list of suggested tokens.
There is a small demo you can build for the example:
go build github.com/goccmack/gocc/example/autosuggest/cmd/demo
./demo
You can submit your text with enter
and it will supply a list of suggested token the parser is expecting.
One obvious issue is the string representation of the token might not be the same as the actual characters it will expect. For example, running the demo with:
find author with "test" at timerange
ExpectedTokens:
[0] int_lit
[1] dayMonth
[2] number
[3] space
In this case the dayMonth
token actually expects '-' _decimal_digit _decimal_digit
.
In order to present this to an end user I would have to translate the suggested tokens to their character representation. An additional table that could contain the regex expression for a token might be a bit more human friendly.
dayMonth -> "-[0-9][0-9]"
This can then be presented to an end user.
Would this be a feature you would want to include in gocc? I have no problem finishing the work. I just want to hear your thoughts before I continue.
I think this would be a great feature that will help with adoption of and onboarding for new DSLs.
@mewmew @shivansh @goccmack @sangisos What do you think?
Definitely looks like a cool application of Gocc! Does it need to be included in the main Gocc tool though? Or can it be a stand alone tool? Personally, I'd prefer to keep the Gocc implementation rather minimal, and would warmly invite a larger ecosystem surrounding Gocc parsers and related grammars. It there are specific technical limitations that prevents autosuggest from being a separate tool, then we could specify and discuss those.
Cheers, Robin
I like the idea but it seems to me that by wrapping the standard parser with the suggest parser the original semantics of parsing a complete sentence that ends with EOF becomes a bit obfuscated. The suggest parser has the feeling of a stream parser, which is subtly different. It seems cleaner to me to separate the two cases and provide two parser constructors: 1) the existing complete string parser, and 2) the new stream parser. The latter is the wrapped parser suggested by jmarais. All properly documented of course.
What do you think?
Doei, Marius
Since the suggest parser is not a proper stream parser I would also suggest calling it "suggest parser" rather than "stream parser". We might want to consider a proper stream parser in future.
I think this would be a great feature that will help with adoption of and onboarding for new DSLs.
Yes. That is the main motivation behind this suggestion.
@mewmew
Does it need to be included in the main Gocc tool though? I'd prefer to keep the Gocc implementation rather minimal It there are specific technical limitations that prevents autosuggest from being a separate tool, then we could specify and discuss those.
I understand the sentiment. I thought it might be a suitable place to add it since you are already parsing the BNF syntax which one would need in order to produce token-to-characters translations. I also thought there could be some re-usable piece that might be able to enhance error messages in the current generated parser.
@goccmack
I like the idea but it seems to me that by wrapping the standard parser with the suggest parser the original semantics of parsing a complete sentence that ends with EOF becomes a bit obfuscated.
Yes, At the moment it is just a hack in order to convey the core idea. I will admit that I have limited parser (and parser generation) experience so there might be a better place to hook something like this in. Perhaps there is a better way to use the lexer?
It seems cleaner to me to separate the two cases and provide two parser constructors: 1) the existing complete string parser, and 2) the new stream parser. The latter is the wrapped parser suggested by jmarais. All properly documented of course.
I like that idea. Different constructors definitely provides a more intuitive API.
It could be that one does not need the 'auto suggestion' as a feature of gocc. I think I might need some additional generic structures/methods in order to enable the feature.
- A token-to-characters translation map. It would be beneficial for gocc to generate this because it could improve error messages. If I then generate or add an 'auto suggester' it would mean one does not have to re-parse the BNF syntax in order to provide this translation.
- If a stream parser is added one could use that. However I think there will still be a need for a token-to-characters translation map. a) Would backspacing be a problem with a stream parser? Currently I get around this by just re-parsing every time. b) There might be more nuances regarding a stream parser?
I added a method to resolve the tokens to their character definitions using the LexPart.RegDefs
.
Here is the example:
TokenTranslations
Here is the output on my BNF example:
characterMap
Note: No special formatting has been applied to the character sequences. This will/can be changed.
I also updated the demo to automatically read in the characters you press, however it only works on linux now. But for the same example as above, you will be able to get the character sequence as suggestions:
find author with "test" at timerange
ExpectedTokens:
[0] '0'|['-']('1'-'9'){'0'-'9'}
[1] '-''0'-'9''0'-'9'
[2] '0'-'9'{'0'-'9'}
[3] '/''/'{}'\n'|' '|'\t'|'\n'|'\r'{' '|'\t'|'\n'|'\r'}
compared to the case without token-to-character translation:
find author with "test" at timerange
ExpectedTokens:
[0] int_lit
[1] dayMonth
[2] number
[3] space
Sorry guys. I was very busy this week and next week will be no better. @awalterschulze @mewmew: Do we go ahead with a separate parser constructor method?
Sorry guys. I was very busy this week and next week will be no better. @awalterschulze @mewmew: Do we go ahead with a separate parser constructor method?
Hi! I'm also busy in the weeks to come. Mid-term exams coming up, and will be travelling throughout Korea :) I trust your decision for this feature and agree that it has useful applications.
Cheers, Robin
In theory I also agree with adding a token-to-characters translation map to gocc. But I think it would be better as a token-to-regex map. Where we then also provide a function to parse the regex and return first characters for the regexes.
The reason for this change is that I can think of many ways to provide suggestions, for example:
'0'|['-']('1'-'9'){'0'-'9'}
int_lit
int_lit where int_lit is '0'|['-']('1'-'9'){'0'-'9'}
`0` | `-` | '1'-'9'
`0`,`-` | '1'-'9'
int_lit where firsts are '0', '-', '1'-'9'
int_lit where firsts are '0', '-', '1', '2', '3', '4', '5', '6', '7', '8', '9'
I think we should provide functions and generated code to aid the user in providing suggestions and make it as easy as possible to do so, but still provide the flexibility of choosing the way the suggestions are reported.
PS
It does not necessarily have to be regexes, but it must preserve as much as possible of the original lex structure, so that we can provide options of how to report the error.