elisp-tree-sitter
elisp-tree-sitter copied to clipboard
Highlighting in REPL (performance)
Hi,
I've been using emacs-tree-sitter for Python highlighting for a couple of months now, and am very happy with it - thanks!
I develop a lot with org-babel
and emacs-jupyter
, keeping an org-file and REPL open side-by-side, sending code back and forth.
I use the following to get tree-sitter highlighting in the REPL:
(add-to-list 'tree-sitter-major-mode-language-alist '(jupyter-repl-mode . python))
(add-to-list 'tree-sitter-major-mode-language-alist '(inferior-python-mode . python))
(add-hook 'tree-sitter-after-on-hook #'tree-sitter-hl-mode)
I notice that as the REPL buffer grows in length, everything slows down proportionally, the reason for which is tree-sitter. Obviously only a fraction of the output in the REPL is actually code that can be parsed and should be semantically highlighted (although I don't mind the highlighting that is produced even for output).
Occasionally clearing the buffer helps, but would there be an option to make this more performant?
keeping an org-file and REPL open side-by-side
It is known that org-mode highlights works by opening a hidden buffer, pasting the code there, and getting back the text with the highlights (this comes from the Q&A at EmacsConf about tree-sitter) so it is bound to get slower the more you add to it.
I think the workaround for this is to turn off emacs-tree-sitter on your org-mode file. The solution to the performance problem is having org-mode integrate emacs-tree-sitter for their syntax highlight.
Actually the issue is less with org-mode, and more with the REPL. The org files are fast enough, there is not that much code in them. You could reproduce this without using org-mode at all. The REPL session has lots of output, and executes a lot more code than is finally written to the org file. Since I hooked tree-sitter to the REPL-mode, I think tree-sitter tries to parse the entire buffer as Python, which must slow it down. It would help already if I could somehow distinguish code from output and parse only that, perhaps occasionally clearing the buffer.
This is an interesting use of tree-sitter-hl
!
In theory, it should only re-parse the parts that have changed. Maybe jupyter-repl-mode
or inferior-python-mode
are doing something special with text changes.
Does it happen when you navigate around the buffer, when you type something, or when you send the code to the buffer?
What are the minor modes enabled in the REPL buffer?
Is it easily reproducible with some Python fragments? For example, would it be enough to repeatedly paste a large piece of Python code?
If not, there are several things you can try to troubleshoot it yourself:
-
M-x profiler-start
, use it for a while,M-x profiler-stop
. - Trace the parsing and highlighting functions:
;; To see whether it parses from scratch too often (last arg is nil). (trace-function #'tsc-parse-chunks) ;; To see whether it parses incrementally, but with regions that are ;; too large (the difference between 3rd and 4th args). (trace-function #'tsc-edit-tree) ;; To see whether it highlights regions that are too large. (trace-function #'tree-sitter-hl--highlight-region)
- Add logging advices to parsing and highlighting functions, if the above is too noisy:
(define-advice tsc-parse-chunks (:before (_parser _input_fn old-tree) log-full-parse) (unless old-tree (message "[%s] tsc-parse-chunks full" (buffer-name)))) (define-advice tsc-edit-tree (:before (_tree _beg-byte old-end-byte new-end-byte &rest _) log-edit-size) (message "[%s] tsc-edit-tree %s %s" (buffer-name) (- old-end-byte beg-byte) (- new-end-byte beg-byte))) (define-advice tree-sitter-hl--highlight-region (:before (beg end &rest _) log-hl-size) (message "[%s] tree-sitter-hl--highlight-region %s" (buffer-name) (- end beg)))
Thanks for the detailed response! I do like the fact that in theory, I can get great syntax highlighting inside the Jupyter REPL with just a hook - and actually bypass a font-lock bug with emacs-jupyter in the process (https://github.com/nnicandro/emacs-jupyter/issues/219) [this issue has a fix that disables traditional font-lock in the REPL buffer, but ts brings it back better].
I have the following modes enabled:
Auto-Compression Auto-Encryption Beacon Column-Number Company Counsel Eldoc
Electric-Indent File-Name-Shadow Font-Lock Global-Eldoc Global-Flycheck
Global-Font-Lock Global-Git-Commit Global-Hl-Line Global-Tree-Sitter
Global-Visual-Line Ivy Ivy-Rich Jupyter-Repl-Interaction
Jupyter-Repl-Persistent Line-Number Magit-Auto-Revert Mouse-Wheel
Override-Global Pdf-Occur-Global Projectile Savehist Shell-Dirtrack Show-Paren
Smartparens Smartparens-Global Transient-Mark Tree-Sitter Tree-Sitter-Hl
Visual-Line Which-Key Winner
That's a lot, but I can isolate the slowdown to tree-sitter
.
I just did some more checks, and find that so far I can actually only reproduce this with Jupyter-REPL
when dealing with large pandas DataFrames.
To reproduce, I load a simply load a big DataFrame and return it, ie.:
First, separately:
import pandas as pd
df = pd.read_csv("many_rows.csv")
Now:
df
Every time I execute df
again, displaying the output takes longer.
After doing this a couple of times, even simple things like python 1+1
will display with delay. Turning tree-sitter-mode
off makes it fast again.
I think the lag mainly accumulates when my functions/executions return DataFrame output. (Note that tree-sitter
does not actually highlight the output in the Jupyter REPL, just the code)
The behaviour is different in the normal inferior Python shell: executing the code is instantaneous, tree-sitter
highlights the output as if it was Python code, and repeating this many times slows down that buffer in general. (vs. just the apparent execution inside Jupyter-REPL). Again though, turning tree-sitter
off makes the buffer speedy again.
I don't have time to trace this myself right now, but hope to get to that still this week.
Hi, sorry it took me a long time to get to this! I've done some profiling now with the code from above (making a large pandas dataframe and repeatedly printing it).
I believe the main problem is that tree-sitter-do-parse
is called many times. Here a chunk of the profiler report:
- jupyter-repl-insert-prompt 232 15%
- tree-sitter--after-change 156 10%
- tree-sitter--do-parse 156 10%
Note that there are many (intermediate?) actions done by the jupyter REPL that cause tree-sitter
to parse:
Almost every
jupyter-repl
call there leads to a parse (I didn't unfold everything, but you get the gist).
The trace of parse-chunks (trace-function #'tsc-parse-chunks)
resulted in 16 individual outputs (I'm not familiar with tracing so I may not properly describe this). 8 when I send press RET to execute the line which will result in printing the dataframe, and another 8 when the dataframe has been printed:
1 -> (tsc-parse-chunks #<user-ptr ptr=0x5606bb9feee0 finalizer=0x7fa35c71a600> tsc--buffer-input #<user-ptr ptr=0x5606c53039e0 finalizer=0x7fa35c71a910>)
1 <- tsc-parse-chunks: #<user-ptr ptr=0x5606b7726ef0 finalizer=0x7fa35c71a910>
Tracing the edit tree does the same (16 outputs), with output like this:
1 -> (tsc-edit-tree #<user-ptr ptr=0x5606b8b927d0 finalizer=0x7fa35c71a910> 42295 42295 42296 #1=(794 . 58) #1# (795 . 0))
1 <- tsc-edit-tree: nil
The highlight-region
trace has output such as this, again many entries:
1 -> (tree-sitter-hl--highlight-region 54381 54439 nil)
1 <- tree-sitter-hl--highlight-region: (jit-lock-bounds 54282 . 54609)
Other than the profile report, which leads me to believe that too many REPL actions trigger re-parsing, I don't know how to interpret the output.
I believe the main problem is that
tree-sitter-do-parse
is called many times.
Yes, it can be a problem. It also depends on whether these parses are from-scratch or incremental (and if incremental, whether the region to re-parse is too large). The third approach of advising the parsing and highlighting functions would help getting more data on this.
Here a chunk of the profiler report:
- jupyter-repl-insert-prompt 232 15% - tree-sitter--after-change 156 10% - tree-sitter--do-parse 156 10%
Would it be possible for you to copy the whole profiler report, as text?
making a large pandas dataframe and repeatedly printing it
How large should this be? If you have a CSV file, I can try reproducing this myself.
I will do this soon and also prepare a test case with which you can reproduce the issue!
Here's the full profiler report: https://gist.github.com/timlod/6111d2f96357f55b1d2cc1976ac9740f
The advice output is:
[*jupyter-repl[python 3.8.6]-env*] tree-sitter-hl--highlight-region 3
Error during redisplay: (jit-lock-function 53070) signaled (wrong-type-argument number-or-marker-p nil)
[*jupyter-repl[python 3.8.6]-env*] tree-sitter-hl--highlight-region 500 [7 times]
[*jupyter-repl[python 3.8.6]-env*] tree-sitter-hl--highlight-region 203
Now, to reproduce, you can use the following:
- Open http://download.geonames.org/export/dump/countryInfo.txt
- Copy all text starting at
ISO ISO3 ISO-Numeric fips
... (do not include the leading # on that line) and write it to a file, e.g.test.tsv
- Open a REPL and
import pandas as pd
- Load the file as
df = pd.read_table("test.tsv")
- Just display
df
many times (just typedf
+ RET, maybe 20 or so times)
You should notice the REPL slowing down - at first you may see no delay, later the cell indicator on the left will show a brief [*]
before displaying. The slowdown is starker with bigger and varied dataframes, but you should be able to notice it with this test case.
Let me know if this helps!
Thanks, I'll check it out.
Maybe
jupyter-repl-mode
orinferior-python-mode
are doing something special with text changes.
In jupyter-repl-mode
the only text changes that are done are to add a
special text property that distinguishes REPL input from REPL output
and to re-insert newlines, marking them as continuing REPL input to
another line.
In the profiler report that @timlod showed it seems that lots of
parsing calls happen because of text property changes and insertions
that are done to maintain the structure of the REPL buffer. These
happen around the jupyter-repl-insert-prompt
calls that are shown in
the traceback.
I think the main issue is that the REPL buffer contains regions of
output, not meant to be fontified in any programming language, but
tree-sitter
nonetheless wants to fontify those regions when it is
called in after-change-functions
.
The way that jupyter-repl-mode
uses the fontification functions is
that it narrows down to the REPL input cell, the region that is
guaranteed to be in the programming language of the REPL buffer and
allows the fontification function of the language to do what it
needs.
Would there be a way to have tree-sitter
work on narrowed
regions of a buffer in a similar way instead of the
whole buffer?
I think the main issue is that the REPL buffer contains regions of output, not meant to be fontified in any programming language, but
tree-sitter
nonetheless wants to fontify those regions when it is called inafter-change-functions
.
Yeah, I also think tree-sitter-mode
shouldn't be enabled directly in these buffers. It's better for the integration to be done through a new minor e.g. tree-sitter-jupyter-repl-mode
(or optionally in jupyter-repl-mode
itself).
Would there be a way to have
tree-sitter
work on narrowed regions of a buffer in a similar way instead of the whole buffer?
That's a useful functionality that I wanted to add, but haven't started yet. There's tsc-set-included-ranges
, but for this use case restriction-by-narrowing is probably more suitable.
The way that
jupyter-repl-mode
uses the fontification functions is that it narrows down to the REPL input cell, the region that is guaranteed to be in the programming language of the REPL buffer and allows the fontification function of the language to do what it needs.
I think I can give a try at implementing the integration. I have some questions to get started:
- How do we provide
jupyter-repl-mode
the fontification function? - How does the fontification function know whether the input cell it's working on is the currently editable cell?
- Are there cell-editing hooks, as opposed to
before-change-functions
andafter-change-functions
, which run for things like inserting the output cells as well?
How do we provide
jupyter-repl-mode
the fontification function?
The fontification function is copied over from the value of font-lock-defaults
for the major-mode
of the REPL language in the function jupyter-repl-initialize-fontification. The only thing jupyter-repl-mode
does with the major-mode
fontification function is call it on input cell regions, see jupyter-repl-font-lock-fontify-region. So what happens in the case of @timlod is that the major-mode
of the REPL language has tree-sitter
enabled and the fontification function gets propagated over to the REPL buffer's fontification function. And the tree-sitter
fontification function doesn't work because we don't have the necessary buffer local variables set up.
How does the fontification function know whether the input cell it's working on is the currently editable cell?
This is done by marking the currently editable text as being part of an input cell through the field
text property. If text has been marked with a field
text property of cell-code
then it is part of an input cell. The output cell regions don't have such a text property. The marking is done an after-change-functions
function. See jupyter-repl-do-after-change and the associated jupyter-repl-after-change. Once the cell code has been marked, the function jupyter-repl-map-cells is used to iterate over regions of input/output cells by the jupyter-repl-mode
fontification function. This is how input/output regions are detected by the fontification process.
Are there cell-editing hooks, as opposed to
before-change-functions
andafter-change-functions
, which run for things like inserting the output cells as well?
There aren't any hooks for insertions of input/output cells. But I would be happy to look into how they could be added if I had a reason on why they would be needed.
Please let me know if there is any other information that I could provide. Thanks for looking into this.
There aren't any hooks for insertions of input/output cells. But I would be happy to look into how they could be added if I had a reason on why they would be needed.
I mean hooks for edits within an input cell, not new cell insertions. tree-sitter
's incremental parsing mode needs to precisely track all text modifications. Cell-editing hooks would help with filtering out uninteresting buffer modifications.
I'm thinking of an integration where the parse tree is associated with the "active input cell", while previous read-only cells are not touched, as they were already highlighted.