vis icon indicating copy to clipboard operation
vis copied to clipboard

[question] Plans to resync the lexers with Scintillua?

Open moesasji opened this issue 3 years ago • 12 comments

Looking through the outstanding issues a lot appear related to outdated lexers and the most logical approach seems to be to resync the lexers with upstream if at all possible.

However I see that the format used by vis is now the legacy format for Scintillua with a migration guide here: https://orbitalquark.github.io/scintillua/api.html#migrating-legacy-lexers.

Is it realistic to move to the new format of the lexers as used by Scintillua seeing that those have seen considerable updates ( https://github.com/orbitalquark/scintillua/tree/default/lexers ) or is there a different plan in place?

moesasji avatar Dec 10 '20 02:12 moesasji

Thanks for bringing this up.

I initially chose Scintillua because it is flexible, relatively lightweight, easy to integrate and had a lot of existing syntax definitions. Basically I was looking for a simple highlighting scheme where I can out source the actual maintenance work. Towards that goal I also submitted changes from our community back upstream (see e.g. here, here and here).

Then upstream decided to embrace an object-oriented API. I am not sure whether I like the new approach. At the time I remarked that the existing _rules table emphasizes the importance of rule ordering for PEGs by grouping them into a single table. Ever since the two implementations diverged.

Mostly because I didn't spend time to properly look at the upstream changes and the existing code worked "good enough". As an aside, for me personally syntax highlighting isn't of upmost importance, it is useful to indicate runaway strings etc. but I don't need much more than that. However, as indicated by the number of filled issues, the user base probably thinks differently. Also a lot (the majority?) of external contributions are related to it.

The API changes should mostly be mechanical. However, conceptually our integration is probably a bit more performance sensitive than in the upstream case. We do currently not maintain a token cache, meaning we (re)start lexing from scratch after every redraw. As a result we added an upper bound on lexing time (see 15d213e4b6e33670cb50d472ad3f532245ebcc3b) and removed some especially slow rules (e.g. #797, #726).

To get a scope for the work, one would have to go through both repositories compare them and list

  1. lexers only present in our tree
  2. lexers only present upstream
  3. lexers with diverging rule changes

Points 1) and 2) should be fairly simple based on the source file names and file type association. 3) is more complex, the rather mechanic API changes preserving the existing rules aren't particularly important. Of interest are the logical modifications, i.e. everything which would need be reapplied on top of upstream. Some issues might already be fixed upstream, others should be fixed differently etc. This would need some coordination. One approach is to go through our commit history up to the last synchronization point:

git log --stat --oneline dc5f5a45a2315011ebeeb0a56a7434ead292dc96...HEAD lua/lexers/

To summarize, the goal is still the same: have a simple, flexible highlighting scheme with the least amount of (long term) work for myself. If this can be accomplished by sharing resources with upstream even better.

martanne avatar Dec 10 '20 12:12 martanne

To summarize, the goal is still the same: have a simple, flexible highlighting scheme with the least amount of (long term) work for myself. If this can be accomplished by sharing resources with upstream even better.

Thanks for the extensive response and thoughts on this. If I understand it correctly tracking the new upstream format used by Scintillua isn't an option for performance reasons, which unfortunately automatically implies additional maintenance effort compared to just tracking upstream code. Or am I misunderstanding your response?

btw) In particular the third point sounds very painful if it isn't a priority for your own usage as that requires some insight in the reason/motivation behind changes being made in either code-base seeing that the lexers already look substantially different even if the changes made might be only mechanical in nature. (I had a quick look at the ansi_c one to get an impression of the differences)

answer to point 1 - lexers only present in vis tree < clojure.lua < dsv.lua < elm.lua < fantom.lua < fstab.lua < gemini.lua < git-rebase.lua < julia.lua < meson.lua < networkd.lua < pony.lua < reason.lua < routeros.lua < spin.lua < strace.lua < systemd.lua < xs.lua < zig.lua

answer to point 2 - lexers only present in upstream tree

jq.lua mediawiki.lua txt2tags.lua

moesasji avatar Dec 10 '20 17:12 moesasji

If I understand it correctly tracking the new upstream format used by Scintillua isn't an option for performance reasons

No, that is not what I intended to express. I meant we should rebase our (performance sensitive) changes on top of upstream.

In particular the third point sounds very painful

Yes it needs some effort and more importantly discussion with upstream. For each lexer one would have to: import the current Scintillua version, go through our git history for the file in question and apply the same changes, create a patch/pull request for upstream.

I am not sure how much the actual lexer rules have diverged. ansi_c is probably a bad example in this regard, because it is one of the most used and changed ones.

martanne avatar Dec 10 '20 20:12 martanne

Thanks for the clarification as this now makes sense to me; first step in this appears to be to make vis understand the new lexer format as just swapping out a lexer looses the highlighting.

moesasji avatar Dec 10 '20 21:12 moesasji

First step in this appears to be to make vis understand the new lexer format as just swapping out a lexer looses the highlighting.

Looking through the code I suspect the following calls to lexer._TOKENSTYLE require adapting to be able to do the boring work to gradually switch away from the legacy lexers:

https://github.com/martanne/vis/blob/master/lua/vis.lua#L266 https://github.com/martanne/vis/blob/master/lua/vis-std.lua#L66

Unfortunately Lua is new to me....

moesasji avatar Dec 11 '20 19:12 moesasji

I rebased our changes on top of the most recent upstream lexer.lua and pushed the result to the scintillua branch. It isn't really tested, but at least in theory should understand both lexer formats.

Based on your list above, it seems like our community is at least as active as upstream? Albeit our lexers might be of more dubious quality. I tend to generously merge changes in this area because they are self-contained and somewhat hard to test without example files and familiarity of the format. Also we have some fairly obscure stuff which might not be of general interest. I still haven't looked at the modifications of individual lexers and what kind of improvements upstream developed. But I guess if we wanted, we could also do it the other way around and backport those ...

Maybe some past contributors would like to comment on their preferred format? We also have a mailing list for those of you who prefer that.

martanne avatar Dec 15 '20 17:12 martanne

I rebased our changes on top of the most recent upstream lexer.lua and pushed the result to the scintillua branch. It isn't really tested, but at least in theory should understand both lexer formats.

A quick try in swapping out a legacy lexer for an upstream one seems to work as expected, so for me your change does the job.

Based on your list above, it seems like our community is at least as active as upstream?

Looks like it, but that might just be a reflection of the type of user that is attracted by vis, i.e. users more willing or capable to commit a lexer they need. It could very well be that upstream has more people actually using the lexers as Scite used to be pretty popular for Windows users and their mailing list appears pretty active (far more than the vis one). Also on that mailing list there a lot of questions and activity around lexers. Note that both Geany and Anjuta actually use Scintilla as well (https://texteditors.org/cgi-bin/wiki.pl?ScintillaEditorFamily)

But I guess if we wanted, we could also do it the other way around and backport those ...

Maybe some past contributors would like to comment on their preferred format? We also have a mailing list for those of you who prefer that.

I think this is really your call, but goes a bit against the idea of trying to minimize work for you.

Whatever way you go: to assist with some of the issues with quality of the lexer it might be worth starting to use the tests that are part of the upstream tests of lexers, see: https://github.com/orbitalquark/scintillua/blob/default/tests.lua

btw) a quick look at the book on Lua did manage to put me off. It really isn't a language I want to touch in my spare time. Sorry!

moesasji avatar Dec 19 '20 10:12 moesasji

Note that both Geany and Anjuta actually use Scintilla as well

I am not really familiar with either environment, but don't they typically use the C++ lexers?

goes a bit against the idea of trying to minimize work for you.

That would indeed be undesirable. Generally fragmentation of already small communities should be avoided. Anyway, somebody would have to go through the respective changes and merge one into the other. I am still hoping somebody else will step up and do the actual work.

btw) a quick look at the book on Lua did manage to put me off. It really isn't a language I want to touch in my spare time. Sorry!

That is a pity. I think it is well suited for what we are using it for and an easy way to contribute something.

martanne avatar Dec 19 '20 15:12 martanne

I am not really familiar with either environment, but don't they typically use the C++ lexers?

Both would be editors that are popular in the gnome/xfce community, but a quick check shows that both indeed use the C++ lexers.

That would indeed be undesirable. Generally fragmentation of already small communities should be avoided. Anyway, somebody would have to go through the respective changes and merge one into the other. I am still hoping somebody else will step up and do the actual work.

With the changes you've made I'll at least try to make a start as keeping things more in sync would make sense to not fragment more than needed.

btw) a quick look at the book on Lua did manage to put me off. It really isn't a language I want to touch in my spare time. Sorry!

That is a pity. I think it is well suited for what we are using it for and an easy way to contribute something.

It is indeed well suited for what you need; it is just that some of the choices they've made in the grammar and use of symbols would drive me nuts.

moesasji avatar Dec 19 '20 16:12 moesasji

The current effort can be tracked in the scintillua branch. The remaining TODO items are:

  • [ ] theme review: some lexers use pre-defined styles which our (default) themes should provide, zenburn in particular is missing some of these. That is for example the reason why some markdown elements are not properly highlighted, even though the lexer matches them.
  • [ ] performance patches in html, xml and wsf lexers: those have been rebased, but are not entirely correct. They are also not all identical, the original patch 7e9e0a2ca868aaa214fb38a79fe71da34d6e00da only changed the in_tag definition while the subsequent baa51e934ce057af5b5be829d6a73a3e8b4c03d0 also uncommented its use, but only in the html lexer.
  • [ ] bash here document patches, those are worked on upstream.
  • [ ] dsv, in legacy format, not picked up by upstream, is of limited use, can probably be removed? @eworm-de you contributed that initially, do you use it?
  • [ ] gemini, in legacy format, not picked up by upstream, seems mostly copied from markdown lexer, could probably be improved, the style section is useless for vis. I would suggest to remove it for now, @lanodan can contribute it upstream and it will eventually trickle down to us.
  • [ ] strace and git-rebase, in new format, not picked up by upstream, they are a bit special in that they are not typical file formats, but program output. Check whether upstream is interested in them, otherwise maintain them ourselves.

@moesasji thanks again for the initial work, maybe you could check whether I missed anything?

martanne avatar Jan 27 '21 09:01 martanne

* [ ]  dsv, in legacy format, not picked up by upstream, is of limited use, can probably be removed? @eworm-de you contributed that initially, do you use it?

We use this for user and group files (passwd, shadow, group and gshadow). Wondering if adding the file extension .csv makes sense.

So I use it whenever editing one of the above files. Guess I could live without...

eworm-de avatar Jan 27 '21 10:01 eworm-de

  • [ ] gemini, in legacy format, not picked up by upstream, seems mostly copied from markdown lexer, could probably be improved, the style section is useless for vis. I would suggest to remove it for now, @lanodan can contribute it upstream and it will eventually trickle down to us.

Sounds good to me, can confirm on it being heavily based on the markdown lexer.


Also with my packager hat on, I'm wondering if it would make sense to have the option to load lexers from a system-installed scintillua? Maybe just as like an extra load path, vendoring being usually frowned upon.

lanodan avatar Aug 25 '22 11:08 lanodan

@ninewise, #1018 has been merged, so this is probably obsolete.

mcepl avatar Nov 29 '22 23:11 mcepl