vimtex icon indicating copy to clipboard operation
vimtex copied to clipboard

about slow syntax highlighting

Open ces42 opened this issue 1 year ago • 20 comments

Description

vimtex's syntax highlighting is a bit slow at times. It's not terrible but if I open a large tex file and scroll up and down with my touchpad it is noticably not smooth. I've tried to look at the output of :syntime report and see if there's anything that can be improved. Here's the output

  TOTAL      COUNT  MATCH   SLOWEST     AVERAGE   NAME               PATTERN
  0.222296   81139  74810   0.000459    0.000003  texMathDelim       [()[\]]\|\\[{}]
  0.146481   21673  49      0.000507    0.000007  texLigature        \v%(``|''|,,)
  0.105568   25166  79      0.000556    0.000004  texSpecialChar     \%(\\\@<!\)\@<=\~
  0.090826   21627  0       0.000207    0.000004  texMathZoneLI      \%(\\\@<!\)\@<=\\(
  0.084873   70591  61790   0.000521    0.000001  texMathSuperSub    [_^]
  0.060983   74427  498     0.000408    0.000001  texMathZoneTI      \\\\\|\\\$
  0.060684   38720  30212   0.000448    0.000002  texMathOper        [-+=/<>|]
  0.039987   51544  45011   0.000128    0.000001  texMathCmd         \\\a\+
  0.022891   26430  6532    0.000241    0.000001  texComment         %.*$
  0.022842   25192  16306   0.000065    0.000001  texMathDelimMod    \\\(left\|right\)\>
  0.021043   23855  4193    0.000384    0.000001  texMathGroup       \\\\\|\\}
  0.020043   32248  17368   0.000079    0.000001  texCmd             \\[a-zA-Z@]\+
  0.020011   5365   209     0.000412    0.000004  texCommentAcronym  \v<(\u|\d){3,}s?>
  0.019337   5194   3       0.000303    0.000004  texCommentURL      \w\+:\/\/[^[:space:]]\+
  0.018543   21627  0       0.000066    0.000001  texCmdConditionalINC \\\w*@ifnextchar\>
  0.016475   21627  0       0.000101    0.000001  texCmdLigature     \v\\%([ijolL]|ae|oe|ss|AA|AE|OE)\ze[^a-zA-Z@]
  0.016086   21627  0       0.000052    0.000001  texSynIgnoreZone   ^\c\s*% VimTeX: SynIgnore\%( on\| enable\)\?\s*$
  0.014915   16338  3614    0.000067    0.000001  texMathArg         \\\\\|\\}
  0.014743   21627  0       0.000047    0.000001  texCmdSpaceCode    \v\\%(math|cat|del|lc|sf|uc)code`
  0.014660   21627  38      0.000082    0.000001  texMathZoneEnv     \\begin{\z(cd\*\?\)}
  0.014640   14477  12771   0.000071    0.000001  texMathTextAfter   \w\+
  0.014227   22104  563     0.000064    0.000001  texCmdCRef         \v\\%(%(label)?c%(page)?|C)ref>
  0.014106   25136  0       0.000119    0.000001  texCmdRef          \\\(page\|eq\)ref\>
  0.014085   23605  6905    0.000066    0.000001  texCmdEnv          \v\\%(begin|end)>
  0.013869   21627  0       0.000103    0.000001  texCmdLigature     \v\\%([ijolL]|ae|oe|ss|AA|AE|OE)$
  0.013798   25136  0       0.000148    0.000001  texComment         ^\s*\\iffalse\>
  0.013527   25136  0       0.000438    0.000001  texCmdRef          \\v\?ref\>
  0.013382   25136  0       0.000124    0.000001  texComment         ^\s*%\s*!.*
  0.012572   21627  0       0.000061    0.000001  texCmdPart         \\\(front\|main\|back\)matter\>
  0.012235   25172  48      0.000079    0.000000  texSpecialChar     \\[,;:!>]
  0.011866   21654  103     0.000117    0.000001  texCmdConditional  \\\(if[a-zA-Z@]\+\|fi\|else\)\>
  0.011844   21627  0       0.000047    0.000001  texConditionalTrueZone ^\s*\\iftrue\>

First of all, I think the very slow \v%(``|''|,,) can be replaced by the equivalent \([`',]\)\1, which was slightly faster for me, averaging 4us instead of 7us.

I was very confused by \%(\\\@<!\)\@<=\~. Am I correct in understanding that

  • it is equivalent to \\\@<!\~
  • the point of making it more complicated is that it will match faster (with a naive regex engine): After finding a ~, it will only try to check if there's a backslash before the ~ once, instead of trying to match every substring ending before ~ against the regex \\?
    If so then the same behavior could be achieved with \\\@1<!\~ which looks simpler. Unfortunately it doesn't seem to give a speedup.

Another point is that this regex will parse something like a\\~b wrongly. This is more relevant for parsing something like \\\(a^2\) -- this is valid latex but vimtex's highlighting currently doesn't recognize the math mode (OTOH I don't know why anyone would ever write that). The regex \%(\\\@!\%(\\\\\)*)\@<=\\( would fix this, checking if there's an even number of backslashes before the \(. Same goes for detecting ~. The performance of this seems to be slightly worse than \%(\\\@!)\@<=\\( though. I got 11us vs 9us.

Do you use a latexmkrc file?

No

VimtexInfo

System info:
  OS: Ubuntu 23.10
  Vim version: NVIM v0.10.0-dev-2175+g85a041716
  Has clientserver: true
  Servername: /run/user/1000/nvim.242708.0

VimTeX project: m
  base: m.tex
  root: /home/ca/vim
  tex: /home/ca/vim/m.tex
  main parser: current file verified
  document class: article
  packages: accents aliascnt aliasctr amsbsy amsfonts amsgen amsmath amsopn amssymb amstext amsthm atbegshi atbegshi-ltx atveryend atveryend-ltx autonum auxhook bigintcalc bitset calc cleveref color csquotes enumitem epstopdf-base etex etextools etoolbox geometry gettitlestring graphics graphicx hycolor hypcap hyperref iftex ifthen ifvtex infwarerr inputenc intcalc keyval kvdefinekeys kvoptions kvsetkeys letltxmacro ltxcmds mathrsfs mathtools mhsetup mleftright nameref parseargs pdfescape pdftexcmds pgf pgfcomp-version-0-65 pgfcomp-version-1-18 pgfcore pgffor pgfkeys pgfmath pgfrcs pgfsys refcount rerunfilecheck rotating textpos tgpagella thm-amsthm thm-autoref thm-kv thm-listof thm-patch thm-restate thmtools tikz tikz-cd todonotes trig uniquecounter url xcolor xkeyval
  source files:
    m.tex
    ../texmf/tex/latex/preamble.tex
  compiler: latexmk
    engine: -pdf
    options:
      -verbose
      -file-line-error
      -synctex=1
      -interaction=nonstopmode
    callback: 1
    continuous: 1
    executable: latexmk
  viewer: Zathura
    xwin id: 0
  qf method: LaTeX logfile

ces42 avatar Jan 26 '24 19:01 ces42

To test the slow \(\) more I just replaced all $-math in my tex file with \(\) and I got some pretty bad time:

  TOTAL      COUNT  MATCH   SLOWEST     AVERAGE   NAME               PATTERN           
  0.339383   8439   8439    0.000610    0.000040  texMathZoneLI      \%(\\\@<!\)\@<=\\)
  0.117477   12558  2847    0.000994    0.000009  texMathZoneLI      \%(\\\@<!\)\@<=\\(

I don't understand why finding the closing \) is so slow but it seems like replacing

  execute 'syntax region texMathZoneLI matchgroup=texMathDelimZoneLI'
          \ 'start="\%(\\\@<!\)\@<=\\("'
          \ 'end="\%(\\\@<!\)\@<=\\)"'
          \ 'contains=@texClusterMath'
          \ l:conceal

by

  execute 'syntax region texMathZoneLI matchgroup=texMathDelimZoneLI'
          \ 'start="\%(\\\@<!\)\@<=\\("'
          \ 'skip="\\\\"'
          \ 'end="\\)"'
          \ 'contains=@texClusterMath'
          \ l:conceal

in vimtex/autoload/vimtex/syntax/core.vim makes it much better (and also fixes some wrong highlighting in e.g. \(x^2\\\))

ces42 avatar Jan 26 '24 19:01 ces42

vimtex's syntax highlighting is a bit slow at times. It's not terrible but if I open a large tex file and scroll up and down with my touchpad it is noticably not smooth. I've tried to look at the output of :syntime report and see if there's anything that can be improved. Here's the output

Thanks for looking into this and for providing some profiling numbers!

First of all, I think the very slow \v%(|''|,,) can be replaced by the equivalent ([`',])\1 ``, which was slightly faster for me, averaging 4us instead of 7us.

Could you check the original pattern without the group, i.e.

  syntax match texLigature "``\|''\|,,"

I would think it should be faster still, but it would be nice to see how it compares to your current numbers. (I've pushed an update that does this already, because I can't see how it would not be an improvement. But I'm curious if your suggested version may be even faster.)

I was very confused by \%(\\\@<!\)\@<=\~.

Not surprising. It's quite complicated; perhaps needlessly so. I have to admit that it does look equivalent to \\\@<!\~. I'm updating that now.

Am I correct in understanding that …

  • the point of making it more complicated is that it will match faster (with a naive regex engine): After finding a ~, it will only try to check if there's a backslash before the ~ once, instead of trying to match every substring ending before ~ against the regex \\?

Did you already check if the original pattern matches faster than the simplified pattern? That's would be surprising to me.

Another point is that this regex will parse something like a\\~b wrongly.

I've pushed a simplification of the pattern now, and it seems to work well on a\\~b.

This is more relevant for parsing something like \\\(a^2\) -- this is valid latex but vimtex's highlighting currently doesn't recognize the math mode (OTOH I don't know why anyone would ever write that). The regex \%(\\\@!\%(\\\\\)*)\@<=\\( would fix this, checking if there's an even number of backslashes before the \(. Same goes for detecting ~. The performance of this seems to be slightly worse than \%(\\\@!)\@<=\\( though. I got 11us vs 9us.

I've tested this a little bit further, and I believe that the complexity is not really needed here. \\ is already matched early as a texTabularChar. I'm therefore pushing a further simplification on this that I believe should also work as expected and improve things somewhat.

To test the slow \(\) more I just replaced all $-math in my tex file with \(\) and I got some pretty bad time: …

I've simplified this even further. How do the timing look now?

lervag avatar Jan 26 '24 21:01 lervag

A quick test seems to indicate that \%([`',]\)\1 (average 3.0us) might be faster than ``\|''\|,, (average 4.7us). But I'm not sure this sample is representative.

ces42 avatar Jan 26 '24 22:01 ces42

Did you already check if the original pattern matches faster than the simplified pattern? That's would be surprising to me.

Some data for this: I created some files with a couple of lines like 999 times i and then a single ~ (in math mode). This should be a worst-case scenario for lookbehinds. These are the results

  TOTAL      COUNT  MATCH   SLOWEST     AVERAGE   NAME               PATTERN          
  0.846513   1964   933     0.002901    0.000431  texSpecialChar     \%(\\\@<!\)\@<=\~
  0.070227   1944   928     0.000472    0.000036  texSpecialChar     \\\@1<!\~
  0.052229   1529   727     0.000437    0.000034  texSpecialChar     \\\@<!\~                                                      
  1.125284   2296   1072    0.003568    0.000490  texSpecialChar     \%(\\\@<!\%(\\\\\)*\)\@<=\~
  0.000506   2146   1004    0.000017    0.000000  texSpecialChar     \~

So it seems like the change you already pushed is better than the way it was. However the pattern \\\@<!\~ is wrong in situations like a\\~b so maybe it would be preferable to just match \~ and rely on texTabularChar matching double backslashes first. This leads to somewhat weird highlighting of strings like \~, but that's not valid tex in math-mode anyway.

ces42 avatar Jan 26 '24 22:01 ces42

Here's another idea that might improve syntax highlighting performance. Currently there's a lot of syntax definitions that match specific commands. It might be faster to just have a syntax group that matches commands, i.e. \\[a-zA-z@]\+ and then have this syntax group contain all the specific commands, e.g. texCmdAccent, texCmdLigature. A quick test with those two syntax groups looks quite promising.

ces42 avatar Jan 26 '24 22:01 ces42

A quick test seems to indicate that \%([`',]\)\1 (average 3.0us) might be faster than |''|,, `` (average 4.7us). But I'm not sure this sample is representative.

Interesting. I can't understand why it would be faster, but I'll switch based on your evidence.

Did you already check if the original pattern matches faster than the simplified pattern? That's would be surprising to me.

Some data for this: I created some files with a couple of lines like 999 times i and then a single ~ (in math mode). This should be a worst-case scenario for lookbehinds. These are the results

  TOTAL      COUNT  MATCH   SLOWEST     AVERAGE   NAME               PATTERN          
  1.125284   2296   1072    0.003568    0.000490  texSpecialChar     \%(\\\@<!\%(\\\\\)*\)\@<=\~
  0.846513   1964   933     0.002901    0.000431  texSpecialChar     \%(\\\@<!\)\@<=\~
  0.070227   1944   928     0.000472    0.000036  texSpecialChar     \\\@1<!\~
  0.052229   1529   727     0.000437    0.000034  texSpecialChar     \\\@<!\~                                                      
  0.000506   2146   1004    0.000017    0.000000  texSpecialChar     \~

Ok, so the current version is very fast now. That's good. But …

So it seems like the change you already pushed is better than the way it was. However the pattern \\\@<!\~ is wrong in situations like a\\~b so maybe it would be preferable to just match \~ and rely on texTabularChar matching double backslashes first. This leads to somewhat weird highlighting of strings like \~, but that's not valid tex in math-mode anyway.

Yes, you are right. I'm sorry for first insisting otherwise. I think using the "trivial" \~ is really fine here, because \\ is properly matched already as texTabularChar and \~ is matched as texCmdAccent. In math mode this latter command does not exist and will typically be an error anyway, so why worry about it?

Here's another idea that might improve syntax highlighting performance. Currently there's a lot of syntax definitions that match specific commands. It might be faster to just have a syntax group that matches commands, i.e. \\[a-zA-z@]\+ and then have this syntax group contain all the specific commands, e.g. texCmdAccent, texCmdLigature. A quick test with those two syntax groups looks quite promising.

Yes, you may be right. But it does seem lik a large amount of work to do this. And in my experience, syntax performance is not really a big issue?

lervag avatar Jan 28 '24 20:01 lervag

I'll close this, but feel free to continue the discussion.

lervag avatar Jan 28 '24 20:01 lervag

From your original list of slow patterns, it seems we should consider the texMathDelim pattern. Do you have any ideas on this one?

lervag avatar Jan 28 '24 20:01 lervag

Also: If you care to share a nice example file with which you are now testing the syntax speed, that would be nice. I'm thinking of adding an example to the test files so that I have a nice way to reproduce timings.

lervag avatar Jan 28 '24 21:01 lervag

I've added a very tiny example here: https://github.com/lervag/vimtex/commit/2477b879251fa8ec61dd017702a099c6048ea0ef#diff-b6fcc94b4e1e1c06afd70f6fa03d63100d069eed363f356fe23eb30bbe2af033

lervag avatar Jan 28 '24 21:01 lervag

I modified your script slightly:

set nolazyredraw
let LINES = line('$')
syntime on
for s:x in range(2*LINES/winheight(0))
  norm! 
  redraw!
endfor

and ran it on the thesis.tex example file included with vimtex. The top syntimes are

  TOTAL      COUNT  MATCH   SLOWEST     AVERAGE   NAME               PATTERN
  0.358374   58463  7456    0.000086    0.000006  texLength          \<\d\+\([.,]\d\+\)\?\s*\(true\)\?\s*\(bp\|cc\|cm\|dd\|em\|ex\|in\|mm\|pc\|pt\|sp\)\>
  0.152289   135299 83807   0.000049    0.000001  texCmd             \\[a-zA-Z@]\+
  0.066881   53552  0       0.000042    0.000001  texComment         ^\s*\\iffalse\>
  0.044951   53552  0       0.000043    0.000001  texComment         ^\s*%\s*!.*
  0.043604   82835  35604   0.000042    0.000001  texOptSep          ,\s*
  0.038223   180644 31556   0.000038    0.000000  texOpt             \]
  0.038166   58993  3851    0.000034    0.000001  texArg             \\\\\|\\}

This is interesting, because for one of my files I get

  TOTAL      COUNT  MATCH   SLOWEST     AVERAGE   NAME               PATTERN
  0.017515   11303  10559   0.000023    0.000002  texMathDelim       [()[\]]
  0.010766   11929  100     0.000008    0.000001  texMathZoneTI      \\\\\|\\\$
  0.010017   10471  9433    0.000030    0.000001  texMathSuperSub    [_^]
  0.006673   9823   9159    0.000013    0.000001  texMathCmd         \\\a\+
  0.006661   5076   4084    0.000018    0.000001  texMathOper        [-+=/<>|]
  0.003620   3829   516     0.000006    0.000001  texMathGroup       \\\\\|\\}
  0.002590   2689   0       0.000019    0.000001  texCmdConditionalINC \\\w*@ifnextchar\>
  0.002480   4763   2925    0.000009    0.000001  texCmd             \\[a-zA-Z@]\+
  0.002314   2457   1411    0.000012    0.000001  texMathDelimMod    \\\(left\|right\)\>

So which particular rules take long might vary from case to case. Anyway I also plotted the results (form thesis.tex) plot_thesis and I think looking at the top syntimes might be barking up the wrong tree. It seems like the large number of (fast) syntax rules is a bigger issue than some individual slow ones.

ces42 avatar Feb 19 '24 18:02 ces42

I modified your script slightly:

set nolazyredraw
let LINES = line('$')
syntime on
for s:x in range(2*LINES/winheight(0))
  norm! �
  redraw!
endfor

So, the idea here is to scroll through a file, right? So it's norm! <c-f> or something?

and ran it on the thesis.tex example file included with vimtex. The top syntimes are …

This is interesting, because for one of my files I get …

So which particular rules take long might vary from case to case. Anyway I also plotted the results … and I think looking at the top syntimes might be barking up the wrong tree. It seems like the large number of (fast) syntax rules is a bigger issue than some individual slow ones.

The thesis.tex file is not really a very good example of a common LaTeX project. First, it does not contain very much math. Next, the content is repeated several times to increase the length of the file so that it becomes much bigger than most projects.

Thus, it is not so strange that there are big differences in which rules take long.

Further, the main things we want is for a single screen render to be quick. For this, we want to have low average (and slowest) times for all rules. We don't want slow rules, or at least we want them to be very rare.

lervag avatar Feb 20 '24 22:02 lervag

I found some time to spend on this today. I ended up using the source of this paper https://arxiv.org/abs/1512.07213 to time things with that scrolling script (the non-printable character is ^D). It seems like I was able to get a 20% speedup by trying to reduce the number of syntax rules created. I've put my changes in the faster-syntax branch on my fork.

Most of it is just "merging" regular expressions, although I also tried changing the vimtex#syntax#core#new_env function so that it only creates one (big) syntax rule for texMathEnvBgnEnd, texMathZoneEng and texMathError (so every time the function is called I delete the old syntax rule and replace it). For this I had to limit what you can do vimtex#syntax#core#new_env when {'math': v:true}, in particular you can't pass the __predicate argument. I'm not exactly sure what the use case of that is. The new function just throws errors whenever the combination of arguments could cause trouble. I think this maybe doesn't limit functionality too much.

Looking at the code probably makes things more clear than I can explain here.

ces42 avatar Jul 15 '24 19:07 ces42

Interesting. With your branch I do get a very noticeable speedup on my example:

image

Now, I notice you do a lot of different stuff, e.g. changing to the old regex engine. It's a little bit hard to read which of your changes are the most significant. But I'm beginning to think that one of the most significant factors is the number of rules. Thus, as you say, reducing the number of rules by using more complex regexes seems to be a useful trick.

lervag avatar Jul 31 '24 21:07 lervag

Could you explain the timings you've added in your commits? E.g.

image

image

Are the numbers the current runtime? If so, it seems to be increasing with the commits and the latest one is the slowest. Clearly, that's not the correct understanding, but perhaps you see my confusion?

lervag avatar Jul 31 '24 21:07 lervag

Now, it looks like you've done a very good and thorough job here. I believe it may be a good idea to add a comment to the top of the core.vim file that summarizes some of the key reflections here?

Also, I am wondering if you are proposing that I merge this or if you want to open a PR with your work more cleaned up?

lervag avatar Jul 31 '24 21:07 lervag

Could you explain the timings you've added in your commits? E.g.

They are just the runtimes of test.vim (using that arxiv paper I linked as main.tex) on my computer (while fixing cpu frequency). They are not very meaningful by themselves, I was just adding them to keep track of how much mileage I was getting out of every commit.

ces42 avatar Jul 31 '24 21:07 ces42

Ok, thanks for clarifying. How about my other questions?

lervag avatar Aug 01 '24 17:08 lervag