gocc icon indicating copy to clipboard operation
gocc copied to clipboard

Feature: Auto suggestion

Open jmarais opened this issue 5 years ago • 11 comments

Hi. I have a use case where I am using gocc for a simple query language for my application. However, the end user of the application does not have in depth knowledge on the syntax defined in the bnf file.

I thought it might be useful to try and use the parser in order to help the end user with the query language by capturing the output when the parser errors and just providing the list of expected tokens it generates to my end user.

I created a small example here for demonstration: autosuggest example The additional code I added to the parser is just this bit: SuggestParse

As you can see, I just wrap the scanner and force a INVALID token at the end. This forces the parser to generate an error and supply a list of suggested tokens.

There is a small demo you can build for the example:

go build github.com/goccmack/gocc/example/autosuggest/cmd/demo
./demo

You can submit your text with enter and it will supply a list of suggested token the parser is expecting.

One obvious issue is the string representation of the token might not be the same as the actual characters it will expect. For example, running the demo with:

find author with "test" at timerange
ExpectedTokens:
[0] int_lit
[1] dayMonth
[2] number
[3] space

In this case the dayMonth token actually expects '-' _decimal_digit _decimal_digit. In order to present this to an end user I would have to translate the suggested tokens to their character representation. An additional table that could contain the regex expression for a token might be a bit more human friendly.

dayMonth -> "-[0-9][0-9]"

This can then be presented to an end user.

Would this be a feature you would want to include in gocc? I have no problem finishing the work. I just want to hear your thoughts before I continue.

jmarais avatar Apr 03 '19 18:04 jmarais

I think this would be a great feature that will help with adoption of and onboarding for new DSLs.

awalterschulze avatar Apr 06 '19 07:04 awalterschulze

@mewmew @shivansh @goccmack @sangisos What do you think?

awalterschulze avatar Apr 06 '19 07:04 awalterschulze

Definitely looks like a cool application of Gocc! Does it need to be included in the main Gocc tool though? Or can it be a stand alone tool? Personally, I'd prefer to keep the Gocc implementation rather minimal, and would warmly invite a larger ecosystem surrounding Gocc parsers and related grammars. It there are specific technical limitations that prevents autosuggest from being a separate tool, then we could specify and discuss those.

Cheers, Robin

mewmew avatar Apr 06 '19 14:04 mewmew

I like the idea but it seems to me that by wrapping the standard parser with the suggest parser the original semantics of parsing a complete sentence that ends with EOF becomes a bit obfuscated. The suggest parser has the feeling of a stream parser, which is subtly different. It seems cleaner to me to separate the two cases and provide two parser constructors: 1) the existing complete string parser, and 2) the new stream parser. The latter is the wrapped parser suggested by jmarais. All properly documented of course.

What do you think?

Doei, Marius

goccmack avatar Apr 07 '19 08:04 goccmack

Since the suggest parser is not a proper stream parser I would also suggest calling it "suggest parser" rather than "stream parser". We might want to consider a proper stream parser in future.

goccmack avatar Apr 07 '19 09:04 goccmack

I think this would be a great feature that will help with adoption of and onboarding for new DSLs.

Yes. That is the main motivation behind this suggestion.

@mewmew

Does it need to be included in the main Gocc tool though? I'd prefer to keep the Gocc implementation rather minimal It there are specific technical limitations that prevents autosuggest from being a separate tool, then we could specify and discuss those.

I understand the sentiment. I thought it might be a suitable place to add it since you are already parsing the BNF syntax which one would need in order to produce token-to-characters translations. I also thought there could be some re-usable piece that might be able to enhance error messages in the current generated parser.

@goccmack

I like the idea but it seems to me that by wrapping the standard parser with the suggest parser the original semantics of parsing a complete sentence that ends with EOF becomes a bit obfuscated.

Yes, At the moment it is just a hack in order to convey the core idea. I will admit that I have limited parser (and parser generation) experience so there might be a better place to hook something like this in. Perhaps there is a better way to use the lexer?

It seems cleaner to me to separate the two cases and provide two parser constructors: 1) the existing complete string parser, and 2) the new stream parser. The latter is the wrapped parser suggested by jmarais. All properly documented of course.

I like that idea. Different constructors definitely provides a more intuitive API.

It could be that one does not need the 'auto suggestion' as a feature of gocc. I think I might need some additional generic structures/methods in order to enable the feature.

  1. A token-to-characters translation map. It would be beneficial for gocc to generate this because it could improve error messages. If I then generate or add an 'auto suggester' it would mean one does not have to re-parse the BNF syntax in order to provide this translation.
  2. If a stream parser is added one could use that. However I think there will still be a need for a token-to-characters translation map. a) Would backspacing be a problem with a stream parser? Currently I get around this by just re-parsing every time. b) There might be more nuances regarding a stream parser?

jmarais avatar Apr 08 '19 08:04 jmarais

I added a method to resolve the tokens to their character definitions using the LexPart.RegDefs. Here is the example: TokenTranslations Here is the output on my BNF example: characterMap Note: No special formatting has been applied to the character sequences. This will/can be changed.

I also updated the demo to automatically read in the characters you press, however it only works on linux now. But for the same example as above, you will be able to get the character sequence as suggestions:

find author with "test" at timerange
ExpectedTokens:
[0] '0'|['-']('1'-'9'){'0'-'9'}
[1] '-''0'-'9''0'-'9'
[2] '0'-'9'{'0'-'9'}
[3] '/''/'{}'\n'|' '|'\t'|'\n'|'\r'{' '|'\t'|'\n'|'\r'}

compared to the case without token-to-character translation:

find author with "test" at timerange
ExpectedTokens:
[0] int_lit
[1] dayMonth
[2] number
[3] space

jmarais avatar Apr 09 '19 07:04 jmarais

Sorry guys. I was very busy this week and next week will be no better. @awalterschulze @mewmew: Do we go ahead with a separate parser constructor method?

goccmack avatar Apr 13 '19 13:04 goccmack

Sorry guys. I was very busy this week and next week will be no better. @awalterschulze @mewmew: Do we go ahead with a separate parser constructor method?

Hi! I'm also busy in the weeks to come. Mid-term exams coming up, and will be travelling throughout Korea :) I trust your decision for this feature and agree that it has useful applications.

Cheers, Robin

mewmew avatar Apr 13 '19 14:04 mewmew

In theory I also agree with adding a token-to-characters translation map to gocc. But I think it would be better as a token-to-regex map. Where we then also provide a function to parse the regex and return first characters for the regexes.

The reason for this change is that I can think of many ways to provide suggestions, for example:

'0'|['-']('1'-'9'){'0'-'9'}
int_lit
int_lit where int_lit is '0'|['-']('1'-'9'){'0'-'9'}
`0` | `-` | '1'-'9'
`0`,`-` | '1'-'9'
int_lit where firsts are '0', '-', '1'-'9'
int_lit where firsts are '0', '-', '1', '2', '3', '4', '5', '6', '7', '8', '9'

I think we should provide functions and generated code to aid the user in providing suggestions and make it as easy as possible to do so, but still provide the flexibility of choosing the way the suggestions are reported.

awalterschulze avatar Apr 13 '19 15:04 awalterschulze

PS

It does not necessarily have to be regexes, but it must preserve as much as possible of the original lex structure, so that we can provide options of how to report the error.

awalterschulze avatar Apr 13 '19 15:04 awalterschulze