geany icon indicating copy to clipboard operation
geany copied to clipboard

Add support for Kotlin tags

Open dolik-rce opened this issue 3 years ago • 15 comments

This is a third incarnation of PR to add support for Kotlin language. Previous discussions can be found in https://github.com/geany/geany/pull/2778 and https://github.com/dolik-rce/geany/pull/1.

Included changes:

  • Small modification of update-ctags.sh, so it can import peg based parsers from ctags.
  • Pull Kotlin parser from ctags and integrate it into tagmanager.
  • Rename filetypes.Kotlin.conf to filetypes.kotlin, to make it work correctly with tagmanager.
  • Added test for kotlin tags.

Disclaimer: The parser was written (and is maintained) by me.

dolik-rce avatar Nov 29 '21 15:11 dolik-rce

Looks good to me in general. I'm not familiar with the peg parsers - I assume these are generated and one has to always run make before so the sources get created, right?

techee avatar Dec 05 '21 21:12 techee

I'm not familiar with the peg parsers - I assume these are generated and one has to always run make before so the sources get created, right?

Yes, they are not committed in the git, but they are generated by packcc as needed. At the very least make peg/kotllin.c should be called, otherwise the update script fails. Should I add the call in the script?

dolik-rce avatar Dec 06 '21 05:12 dolik-rce

Yes, they are not committed in the git, but they are generated by packcc as needed. At the very least make peg/kotllin.c should be called, otherwise the update script fails. Should I add the call in the script?

Maybe a good idea, yes. It could be easily forgotten otherwise.

techee avatar Dec 06 '21 13:12 techee

I just noticed the size of the parser - 25klocs and 1MB of sources. Have you checked how much the size of stripped libgeany.so increases by this patch?

techee avatar Dec 06 '21 20:12 techee

@techee maybe its time for "somebody" to add the capability for uctags to load parsers from dll :-)

Although the latest Ubuntu uctags package has an installed size of just 2Mb, and we don't use it all (yet :-)

Of more interest is how fast (slow) is it, remembering it runs while editing, will Kotlists get big pauses?

elextr avatar Dec 07 '21 00:12 elextr

I just noticed the size of the parser - 25klocs and 1MB of sources. Have you checked how much the size of stripped libgeany.so increases by this patch?

Yes, the generated code is humongous. I guess it is price for implementing full Kotlin grammar. I have tried to check the library size with make && strip --strip-unneeded. Current master is about 3.57 MB, this branch yields 3.72 MB, so roughly 150kB. I agree it is quite a lot, but still manageable I hope.

Of more interest is how fast (slow) is it, remembering it runs while editing, will Kotlists get big pauses?

Frankly, the speed could be better. This has already have been discussed when I was adding the parser to ctags and it even resulted in some improvements in PackCC (which is used to generate the code), but it is still slower than handwritten parsers. You can have a look at https://github.com/universal-ctags/ctags/pull/2866 and https://github.com/universal-ctags/ctags/issues/2878 if you're interested.

Anyway, the pauses during editing are not noticeable for reasonably sized files. The only situation where the speed might be annoying is when user opens big kotlin project. But I'm not sure how this actually works, does Geany always process all open files on start? Or is there some kind of caching? I think some of the projects-related plugin even allowed to set option to index all files, even those that are not open. This could then take quite some time.

dolik-rce avatar Dec 07 '21 08:12 dolik-rce

does Geany always process all open files on start? Or is there some kind of caching? I think some of the projects-related plugin even allowed to set option to index all files, even those that are not open. This could then take quite some time.

IIUC yes, no, yes, probably

elextr avatar Dec 07 '21 08:12 elextr

Implementing the whole grammar has problems for this use-case, not only speed, but also robustness. It is being used on files that are being edited and may be eronneous or incomplet.

Unless the parser can recover from an error it is going to lose all or many symbols in more cases if it parses things that are not essential to detecting the tags since it will then find more errors. It is a problem with many languages, not just kotlin (see for example #2916), but the fact many tag parsers skip lots of code reduces the impact of partly edited code (outside of declarations).

This is also why full fat IDEs to try to insist that at least the the structural syntax is complete (eg {}s match) so their parsers have some handles to resync and recover after errors.

elextr avatar Dec 07 '21 09:12 elextr

The kotlin parser can recover from errors. If it fails to parse something, it continues until it finds next parseable top-level object (e.g. function or class). The backtracking parser is actually quite good for this.

I did try at first to write simple, handwritten parser that only gets the important tags. However the problem with Kotlin is that the grammar is quite complex and it lead to way too many inaccuracies. That is why I had to implement it all in the end. Also, it was very hard to track scope, when only some parts of the code were parsed.

dolik-rce avatar Dec 07 '21 09:12 dolik-rce

Yes, the generated code is humongous. I guess it is price for implementing full Kotlin grammar. I have tried to check the library size with make && strip --strip-unneeded. Current master is about 3.57 MB, this branch yields 3.72 MB, so roughly 150kB. I agree it is quite a lot, but still manageable I hope.

It's not that bad, I guess I'm still stuck with the "it must fit a floppy" mentality.

Implementation Time Avg. per file Tags added
C 4 s 66 us 567695
RegExp 155 s 2.5 ms 527908
Peg 523s 8.6 ms 782500

This is quite surprising, I would have expected the performance to be somewhere between a hand-written parser and a regex parser but it probably depends on the grammar and the lookahead it has to perform. Have you tried profiling the parser to see if there's not some unexpected bottleneck? With hand-written parsers I used the following:

https://wiki.geany.org/howtos/profiling/gperftools

Unless the parser can recover from an error it is going to lose all or many symbols in more cases if it parses things that are not essential to detecting the tags since it will then find more errors. It is a problem with many languages, not just kotlin (see for example #2916), but the fact many tag parsers skip lots of code reduces the impact of partly edited code (outside of declarations).

To be fair, hand-written ctags parsers aren't that great at error recovery either - when you type an extra { at the beginning of a file, it won't get parsed.

techee avatar Dec 07 '21 15:12 techee

Of more interest is how fast (slow) is it, remembering it runs while editing, will Kotlists get big pauses?

If the table above is valid and from what I remember the C parser speed is about 50MB of sources per second and the kotlin parser is about 100x slower than the C parser, it gives us 500KB/s for kotlin. And the pauses should be just something like 0.2s not to be noticeable so about 100KB kotlin source files would be roughly the maximum for Geany.

techee avatar Dec 07 '21 16:12 techee

This is quite surprising, I would have expected the performance to be somewhere between a hand-written parser and a regex parser but it probably depends on the grammar and the lookahead it has to perform. Have you tried profiling the parser to see if there's not some unexpected bottleneck?

I did a lot of profiling on PackCC and it turns out it is not very efficient with memory allocations (see https://github.com/universal-ctags/ctags/pull/2866#issuecomment-780332841), the backtracking is constantly allocating and deallocating small chunks of memory. So, more rules leads to more backtracking, which leads to more allocations, which slows it down. I have even proposed couple possible optimizations, but I did not have from enough manpower to actually implement it (sadly, I am not very proficient in C).

I might look into the grammar optimization in future, so the parser might become somewhat faster (especially, if it slows Geany down, that would annoy me a lot). Probably could beat the RegExp based ones (for comparable grammars). According to creator of ctags, the peg parser is about 15-30 times slower than handwritten parser (citation from https://github.com/universal-ctags/ctags/pull/2866#issuecomment-778786809). Even with grammar optimizations, it would still probably be about an order of magnitude slower than let's say current C parser.

dolik-rce avatar Dec 07 '21 18:12 dolik-rce

Hmm, I'm afraid I don't have very good news regarding the performance. I tried it on Raspberry Pi 400 (which should be slightly faster than regular Raspberry Pi 4) with the following file which IMO isn't that big

https://github.com/JetBrains/Exposed/blob/master/exposed-core/src/main/kotlin/org/jetbrains/exposed/sql/SQLExpressionBuilder.kt

and the lag was already very annoying (to the point that I'd much rather have parsing disabled than editing this way). IMO Raspberry Pi 4 is a reasonable low-end machine Geany should support without big speed problems (apart from that Geany is the semi-official Raspberry Pi editor as it's installed by default) and the 500 LOC file isn't anything extreme either. So I'm not sure whether this parser should be used in Geany. It's shame that PEG parsers are so slow currently - if they were e.g. 10x slower than hand-written parsers, it would be fine I think but 100x is a little too much.

By the way I also tried this file

https://github.com/JetBrains/Exposed/blob/master/exposed-core/src/main/kotlin/org/jetbrains/exposed/sql/SQLExpressionBuilder.kt

which seems to contain many anonymous functions and it seems to be a common pattern in Kotlin files and seeing the symbol tree filled with <lambda> doesn't seem that useful to me so maybe these shouldn't be reported to Geany. In any case, the performance issue seems to be a bigger problem though.

techee avatar Dec 12 '21 09:12 techee

That is really surprising :frowning_face: I'd expect RPi 4 to handle this much better. I don't have any slower hardware readily available right now, but I have tested it with cpulimit -l 50 -i geany SQLExpressionBuilder.kt, which on my i5 should mean it should get about 1.2GHz and was still pretty smooth. There might of course also be some difference due to memory speed.

Well, I started this as a "Christmas project" last year, so I guess I'll just have a another one for this year: to make it faster.

... seems to contain many anonymous functions and it seems to be a common pattern in Kotlin files and seeing the symbol tree filled with doesn't seem that useful to me so maybe these shouldn't be reported to Geany.

Yes, lambdas can be quite common in Kotlin code and you're right, that they are probably useless in the symbol tree most of the time. I think the only valid reason to keep them is that there might be something else declared in their scope. Perhaps we could filter out all lambdas without any child nodes?

dolik-rce avatar Dec 12 '21 10:12 dolik-rce

There might of course also be some difference due to memory speed.

That's quite probable.

Well, I started this as a "Christmas project" last year, so I guess I'll just have a another one for this year: to make it faster.

I know those! #3032 is Christmas 2018 :-).

I think the only valid reason to keep them is that there might be something else declared in their scope.

That's a good point. Filtering tags should in theory be possible, the question is how many such things we should do in Geany. But maybe anonymous functions are quite a common pattern to deserve this special handling. I'm just preparing a pull request that improves handling of anonymous tags so I might add something like that.

techee avatar Dec 12 '21 10:12 techee