lark icon indicating copy to clipboard operation
lark copied to clipboard

Using lark for a language server

Open rmcgregor1990 opened this issue 5 years ago • 22 comments

Hey, I'm wondering how conducive lark would be for creating a language server implementation for a given DSL.

For some background:

  • I have a custom DSL I'm using to describe device interfaces and then using jinja2 to generate the specific interface into the target language
  • The main implementation of this obviously in Python
  • I'm using antlr4 to define the grammar, which is used to generate a python and typescript implementations.
  • I have written a basic language server extension for vscode in typescript using the antlr completion engine antlr4-c3. This provides live validation and auto-completion which is really useful.

Problems:

  • Antlr's python implementation is quite slow (~350ms to parse + transform a sample file compared to ~50ms with lark)
  • I'm having to duplicate a lot of work between python & typescript, though admittedly they are solving slightly different problems.

I would quite like to use to Lark and implement the language server as part of the main python package if possible.

I realize a language sever requires some specific parser features namely:

  • Error tolerance: This can partly be resolved by making the grammar itself more error tolerant and solving those errors later. I notice lark also an error puppet which can maybe be used for this purpose.
  • Support for auto completion information: Its is namely done by parsing an incomplete source file up to the current caret position and extracting both the next possible tokens and the currently parser rule context from the parser. This can be combined with info from the last successful parse to produce auto completion options.

I'm wondering if lark as any support for these features? And If so how I might go about implementing them?

If something like this is possible, it might also be cool to write a language server for the lark grammar itself to aid development.

Cheers, Rob

rmcgregor1990 avatar Sep 07 '20 12:09 rmcgregor1990

I already worked on a lsp for lark itself, but found the same problems you did. In theory there is currently a few projects running that improve exactly these points. (Error recovery/partial parsing) (as example #652 go in that direction, and there are already a few merged PR concerning this).

It would be interesting what ideas you have on how the API should look like.

MegaIng avatar Sep 07 '20 15:09 MegaIng

Hi Rob,

Yes, I believe that it should be possible to use Lark for a language server. I think the error tolerance is already pretty good, as it's possible to ignore any error and continue parsing. The main difficulty is in continuing correctly, but I think that's a challenge for most highlighters, because user intent is hard to guess. Lark does allow you to try many different continuations, and backtrack, and as long as you don't overuse it, it shouldn't degrade performance too much.

Completion information might be more tricky, depending on the language. I have some ideas regarding how to do it, which I'll have to take some time to explain fully. But I already have some code to show that it's possible.

However, doing it incrementally poses yet another challenge, and I don't know yet if it could be adapted to support it or not.

I'm also personally interested in pursuing this direction for my own projects, so if you'd like to help me tackle it and provide a language-server interface for Lark that's generic (as much as that's possible), I'll be happy to join forces and help make that happen.

Cheers, Erez

erezsh avatar Sep 07 '20 16:09 erezsh

Correction: Some errors might not be recoverable right now, but they will be soon :)

erezsh avatar Sep 07 '20 16:09 erezsh

Hey thanks for the info,

Firstly I would be interested in helping to make this happen. It would be awesome to able to easily create language servers in python, and as far as I can tell there isn't any obvious starting points for that at the moment.

I think I might have to play around with lark a bit more to fully understand how error recovery might work. I agree it might require some additional information about the langauge to get right. However antlr does seem to have some builtin free error recovery, so maybe I should look more into how that functions.

Regarding the completion API, in antlr-c3 it looks a bit like this:

class CompletionCandidates:
    # a map of the next possible token to a list of tokens which immediately follow it, 
    # if they exist (I haven't personally found any use for the following token list with my grammar)
    tokens: Mapping[TokenId, List[TokenId]]
    # a map of the current parser rule to the rule stack at that point
    rules: Mapping[RuleId, List[RuleId]]

def collect_candidates(
    parse_tree,
    ignored_tokens: List[TokenId],  # the token blacklist
    preferred_rules: List[RuleId],   # the rule whitelist
    caret_token_idx: int) -> CompletionCandidates: ...

The tokens give you keyword completion and you use the rules + their contexts + previous parse information to do more advanced autocompletion of identifer names etc.

Of course you have to locate the caret token index yourself and you seem to end up which lots of parser rule name aliases in your grammar so they can be filtered out specifically during completion.

Do you think something similar could be achievable with lark?

For my specific DSL lark is fast enough that I don't think incremental parsing is required. Just re-evaluating the entire document with every change will be performant enough.

If you'd like to make this happen, what do you think the first steps should be?

As a side node, if we're going to do this, I think I would quite like to make my own python lsp implementation using asyncio and the new TypedDict to define the json interface as a standalone package. Though this pygls seems to exist which might also work fine.

Cheers, Rob

rmcgregor1990 avatar Sep 07 '20 19:09 rmcgregor1990

Awesome, so let's see how we can make that happen!

I'm not sure why they need ignored_tokens or preferred_rules. My current approach is to parse normally, and when I encounter an EOF error, insert a MARKER token (if it expected a NAME), and unfold the state so it produces a legal tree (based on grammar). Then you can interpret the tree, pushing variables into scope like you would normally, and see what's available in scope when you reach MARKER. It seems to work pretty well even for non-trivial languages. The only language-specific part is the interpreter (which admittedly might be a lot of work, but doesn't need anything but the AST)

I think the first thing to do is to define an API. Maybe we can base it on the jupyter API? https://jupyter-client.readthedocs.io/en/stable/wrapperkernels.html#MyKernel.do_complete

We might have to extend it a little bit.

Then we need

  1. a vscode plugin that knows how to use that API

  2. A python class that provides that API given a grammar + language specific code.

If you can take care of part 1, I think I can do part 2 (using the method I described above)

But of course the point is to use the collective wisdom, since my use case might a little different than yours.

How does that sound?

erezsh avatar Sep 07 '20 20:09 erezsh

  1. a vscode plugin that knows how to use that API

That is in theory what a LSP is for. This is quite simple to implement, but it might be worth to create a whole extra package, that depends on Lark, dealing with everything in regards to API/Server site state keeping/caching.

MegaIng avatar Sep 07 '20 21:09 MegaIng

Just to be clear I'm talking about implementing language-server-protocol with python acting as the server. This is an editor agnostic json-rpc interface. Hence the vscode plugin part is super simple as it just runs the python based language server as a subprocess and connects to it with vscodes builtin language server client.

rmcgregor1990 avatar Sep 07 '20 21:09 rmcgregor1990

  1. a vscode plugin that knows how to use that API

That is in theory what a LSP is for. This is quite simple to implement, but it might be worth to create a whole extra package, that depends on Lark, dealing with everything in regards to API/Server site state keeping/caching.

Yes, I think there is another python package in the middle here to implement the language server json API + all the document caching / change tracking. I'm not sure why it needs to depend on lark in any way though.

rmcgregor1990 avatar Sep 07 '20 21:09 rmcgregor1990

My idea was to implement as much generic code as possible in this package already, e.g. syntax highlighting/error messages, but that is not required.

MegaIng avatar Sep 07 '20 21:09 MegaIng

I see, I was thinking that was up to the specfic language implementation as I'm not entirely sure how that could be done in a generic way. When implementing my (admittedly) very simple language server, those were the sort of operations I performed using the intermediate representation of my DSL and didn't invole the parser at all.

rmcgregor1990 avatar Sep 07 '20 22:09 rmcgregor1990

Especially syntax error, and code folding are easy to do in general with little help from language specific code. (well, code folding needs a little bit, but still)

MegaIng avatar Sep 07 '20 22:09 MegaIng

Syntax highlighting is an interesting one. Vscode uses textmate grammers as it main highlighting engine and then supports additional semantic syntax highlighting via the language server protcol. I'm not sure its intended to do all syntax highlighting via the language server protocol, I think it idea is they are ment to be additive.

Normal structural syntax highlighting could be done via lark I guess, but I imagine that would produce a lot of traffic. Semantic syntax highlighting requires language specific knowledge outside of the grammar, and definately a lanaguage specific thing.

I agree syntax error should be almost forwarded directly from lark, but there really isn't much code in the middle there. Its just a range of text and a message.

rmcgregor1990 avatar Sep 07 '20 22:09 rmcgregor1990

Lark has a simple interface for reporting syntax errors. It's somewhat limited, but it works in many situations. That's what Lark itself uses to report errors in the grammar. (see this example: https://github.com/lark-parser/lark/blob/master/examples/advanced/error_reporting_lalr.py)

For syntax highlighting, it's possible to automatically generate a stream of tokens. Then all you need is a dictionary translating token types (as they appear in the grammar) to the editor's color classes. Regarding vscode, perhaps it's possible to automatically generate textmate grammars for most of the terminals?

Code folding might also be fairly easy (propagate_positions really helps for that). You'll just have to list the rules that you want to fold by.

While none of them are huge tasks, together they seem a little daunting, especially for those who don't know all the little tricks Lark provides. So it would be nice if a single library provided a well-defined API for all of them. It will certainly remove an obstacle of doubt and encourage users to implement editor support.

erezsh avatar Sep 07 '20 22:09 erezsh

Lark has a simple interface for reporting syntax errors. It's somewhat limited, but it works in many situations. That's what Lark itself uses to report errors in the grammar. (see this example: https://github.com/lark-parser/lark/blob/master/examples/advanced/error_reporting_lalr.py)

Looks good, all the language server needs is line, column and error message.

For syntax highlighting, it's possible to automatically generate a stream of tokens. Then all you need is a dictionary translating token types (as they appear in the grammar) to the editor's color classes. Regarding vscode, perhaps it's possible to automatically generate textmate grammars for most of the terminals?

Emitting a simple textmate grammar the terminals should be super easy, of course the user will have to define the scopes of each manually (this is a good explaination of how it works), but it would definately be a useful starting point.

Code folding might also be fairly easy (propagate_positions really helps for that). You'll just have to list the rules that you want to fold by.

Vscode provides folding support based on indent, regex or bracket matching out of the box, just using the language configuration json. But this might be useful for more complex languages.

Ok, when I have some time I'll implement a simple vscode example extention which uses lark to provide some live syntax errors (probably just the first one in the file) and we can use that example to further this discussion.

rmcgregor1990 avatar Sep 07 '20 22:09 rmcgregor1990

I also wouldn't ignore that a LSP might be useful for more than just VSCode. Anything from notepad++ to intelijj and vim can support that.

MegaIng avatar Sep 07 '20 22:09 MegaIng

Yes, if we could support the general case it would be really great.

erezsh avatar Sep 08 '20 07:09 erezsh

I've create the example vscode language server extension. It uses the lark json error reporting example to provide json syntax errors on files with the lark-json language mode. Hopefully the readme should have enough info to get it up and running.

https://github.com/rmcgregor1990/vscode-lsp-lark

rmcgregor1990 avatar Sep 08 '20 16:09 rmcgregor1990

I think it might be helpful to define an example DSL grammar (something with a little bit of complexly to make autocompletion interesting) and attempt to implement all our desired language server features for it using lark. I think experimenting with such a thing will help us get a good idea of how such features might be implemented and what a common lark API might look like.

rmcgregor1990 avatar Sep 08 '20 17:09 rmcgregor1990

I have been thinking for a couple of weeks about write a lsp for lark. I know #331 is about that but it's focus is on Lint.

This could be done as an example for lsp implementation with lark. I mean, all users of lark should know lark syntax to write his grammars. So it makes sense to use lark as an example.

Lark BNFE grammar is less complex than others languages, and still is not so simple, so it makes a great target to expose some tricks.

So we could use the lsp to solve #331 (and we could use #331 to begin).

We could change #76 from lark to lark lsp as configurable option and it makes sence to have such a thing as part of a lsp (I know, backtracking is hard, in fact it makes python regex a contex sensitive grammar, right?).

We could add it to piode online editor.

We could have go to definition and hover. To small grammar it dosen't matter, but on big ones would be great!

We could even add some extra features like #559 so client editor could render selected rules on demand, save as file and open or insert in place (editor based). I just want to make my editor do this, so you can ignore it if you want (or propose more crazy! one like having the parsed tree as image an go to rule in editor on click at node, and yes, i want to do this too).

What did you all think?

omega16 avatar Sep 18 '20 05:09 omega16

I like the idea of using https://github.com/openlawlibrary/pygls, or something of that sort. I haven't played around with it yet but it looks nice. I'm still not sure how to use it to provide definitions, or code folding, and if it can provide syntax highlighting or if that has to be done separately.

It makes sense to start with the Lark grammar itself, as it's indeed pretty simple, and useful for everyone using Lark.

It's still an open question to me, how to do as much of these things simply from looking at the grammar. And how to let the user extend/override the automatic default.

  • linting should be pretty simple with match_examples, I think. And maybe in many cases the default error will be good enough (sometimes something like missing "," is all you need).

  • error recovery - does it makes sense to have a default behavior of "on error resume next"? Or will it cause problems in some cases and should we leave it off.

  • Go to definition should be pretty simple to implement with a visitor, but I don't think it can be done automatically.

erezsh avatar Sep 18 '20 09:09 erezsh

I'm still not sure how to use it to provide definitions, or code folding, and if it can provide syntax highlighting or if that has to be done separately.

Definitions & folding are pretty straight forward. For example support for goto definitions might like something like:

from pygls.features import DEFINITION
from pygls.server import LanguageServer
from pygls.types import TextDocumentPositionParams, Location

server = LanguageServer()

@server.feature(DEFINITION)
def definition(params: TextDocumentPositionParams) -> Location:
    """Returns the location of a symbol definition"""
    # lookup the ident under the cursor
    name = get_symbol_name_at_position(params.textDocument.uri, params.position)
    # lookup the ident within the document
    symbol = lookup_symbol(params.textDocument.uri, name)
    return symbol.location

Syntax highlighting is not currently part of the official Language Server Protocol spec. VSCode just seem to have their own extension to provide Semantic Highlighting. I'm happy to have a go writing a Lark to TextMate grammar convertor at some point, if that would be useful.

I think making a Lark LSP as a starting point is a good idea, having a working example will greatly aid us in figuring out how much can be generated just from the input grammar. Plus the finished product would be useful to everyone.

How ideally would you like to start development of such a thing? on a branch or in a separate repo?

rmcgregor1990 avatar Sep 18 '20 12:09 rmcgregor1990

I went ahead and created a repo in the lark-parser organization: https://github.com/lark-parser/lark-language-server. I sent invites to everyone in this conversation.

Its aim is to provide language server services for all IDEs, for any Lark grammar (as much as possible).

It's fine to add IDE-specific code to it, but it should be marked/located as such.

Since this is a joint effort, I think we should all work in our own branches, and merge to master only through pull-requests.

erezsh avatar Sep 18 '20 14:09 erezsh