delta icon indicating copy to clipboard operation
delta copied to clipboard

🚀 Syntax highlighting of diffs should derive from the source file

Open mnemnion opened this issue 4 years ago • 7 comments

There is a subtle behavior exhibited by delta, which one might never run into, but which I see constantly.

Rather than describe it in terms of my bespoke language and highlighter, imagine a file in Git-Flavored Markdown, which contains a long block of e.g. Javascript. Edits to the Javascript within a diff will be highlighted in terms of the basic Markdown syntax, rather than the Javascript subsyntax.

The brute solution to this is to parse the diff first, syntax highlight the complete document, then reconstruct the diff from the syntax highlighted document. This is... tractable, I think.

Given how syntect works, the brute solution might be the only solution. I'd venture that it qualifies as the right thing, and it would qualify as a modest but daily improvement in my work flow.

It's also wasted on many formats, and essential for some others. I'd imagine this should be configured on a per-format basis.

Alas, my Rust is far below the level at which I could submit a proof of concept. I do have some experience with parser combinators if and when it comes time to rummage through a diff.

mnemnion avatar Jan 05 '22 16:01 mnemnion

Hi @mnemnion, thanks for this. The problem you describe is similar to https://github.com/dandavison/delta/issues/117 isn't it? (I'm also curious what the real scenario where you're hitting this is.)

My off-the-cuff reaction is that it would be rather a large change -- looking for files on disk and reading them and syntax highlighting them and thus obtaining the correct highlighting for the hunk fragments -- in return for a somewhat rare win.

Also, IMO, delta should always be very fast, since users are just sitting there staring at a prompt waiting for output, and I worry that such a change could damage performance.

Something to bear in mind also is that delta is a unix filter in the sense that it just reads stdin (and gitconfig) and writes to stdout. So the "original files" might not exist. Indeed someone might be piping entirely fictional diff hunks at delta! So the full-file highlighting would always have to fall back to fragment highlighting.

Am I being pessimistic or missing some tricks? I don't want to say never, but if what I described above is accurate it sounds like a really big code change and I worry about performance.

dandavison avatar Jan 05 '22 17:01 dandavison

Vague speculation: Maybe the way to architect such a thing would be to create an entirely new project named syntect-server or something, whose job is to watch files on disk and syntax highlight them, and expose an API for querying the highlighting at a given byte range in the file. Then the code change in delta could be small -- replacing the direct syntect library call with an IPC call to the syntect server -- and perhaps the IPC could be fast enough over a suitable local socket.

I don't know much about LSP (Language Server Protocol) but my tentative conclusion from skimming things about it is that it doesn't or doesn't yet support syntax highlighting? If it did, then perhaps what I'm imagining is delta talking to a LSP server.

dandavison avatar Jan 05 '22 18:01 dandavison

Thank you Dan, you raise a number of good points , particularly about complexity of implementation.

I'm often surprised by how fast something like pretty-printing a megabyte of Lua table actually is. I mean preparing the string, streaming it to the tty is a bottleneck depending on the terminal.

Quite right about the general case but I can imagine an implementation which is optimistic and still fairly simple. The diff has a short relative path for the file, generally; can we resolve this to the file itself? Very well, we could highlight from that with the line info. If not, there's no need to chase after the perfect in pursuit of the good.

This does involve more calls to the file system, and that can be quite slow indeed. It wouldn't be a step to take casually, and it isn't necessary for many languages, although the various mentions in #117 mark the problem as prevalent.

I'm interested in the intersection of LSP support and syntax highlighting for my own reasons, so if you spot any discussion on here about it, please tag me in.

mnemnion avatar Jan 06 '22 18:01 mnemnion

Perhaps what I said was out of date -- I believe that the LSP spec does now have a section relevant to syntax highlighting: https://microsoft.github.io/language-server-protocol/specification#textDocument_semanticTokens

dandavison avatar Jan 06 '22 19:01 dandavison

One lighter alternative to a full-fledged language server is treesitter.

The main purpose of treesitter is "incremental parsing" -- though the "incremental" bit isn't necessary for delta. I don't know if it's still light enough to handle a large amount of files with arbitrary languages. But it's a bit less complex and much, much faster than using LSP, which is going to be very slow to start up, and rather difficult to package and distribute with a small tool like delta. As a bonus, treesitter should correctly highlight nested language code (e.g. python nested within markdown).

For example, neovim can be configured to use treesitter for syntax highlighting. The plugin "nvim-telescope" uses treesitter to give syntax highlighting previews of files.

YodaEmbedding avatar Jan 20 '22 02:01 YodaEmbedding