haskell-mode icon indicating copy to clipboard operation
haskell-mode copied to clipboard

Degrading scroll performance in terminal mode

Open dminuoso opened this issue 3 years ago • 10 comments

The Problem

Since I switched to emacs I noticed how scrolling sporadically became worse. At times, it would take 3-4s to redraw 40 lines of code.

Apparent relevant factors

After doing a lot of analyzing I narrowed it down to the following combination of things:

  1. Emacs must run in terminal, in GUI either the problem is not as apparent or its non-existant.
  2. gc-cons-threshold must be set to a high value. LSP users likely have something like (setq gc-cons-threshold 100000000)
  3. haskell-mode and font-lock-mode must be enabled
  4. Do some text editing

Watch drawing lines become slower. The presence of constructor names seems especially bad. Note, that the entire line is drawn much slower, so if you vertically tile your emacs to have some unrelated buffer on the right hand side, then the entire line with many data constructors on the left hand window will be drawn slowly.

  1. If you then cause a (garbage-collect), scrolling becomes fast again. test.hs.gz

Reproduce

Locally I used emacs 27.1 installed through the emacs nix overlay:

Click to see/use my nixos module

{ config, pkgs, ... }:

let
  emacsOverlay = import (builtins.fetchTarball {
  # Descriptive name to make the store path easier to identify
    name = "emacs-overlay-2020-09-11";
    # Commit hash for nixos-unstable as of 2018-09-12
    url = https://github.com/nix-community/emacs-overlay/archive/a69588a3f7de6d68f20cea21562ab7f6f91a400a.tar.gz;
    # Hash obtained using `nix-prefetch-url --unpack <url>`
    sha256 = "180jp31zfrlhnz490x6hckz6z3bvfxwz1613cg4zmf0lrflrl3gp";
  });

  myEmacs = pkgs.emacs;
  emacsWithPackages = (pkgs.emacsPackagesGen myEmacs).emacsWithPackages;
  emacs = emacsWithPackages (epkgs: (with epkgs.melpaStablePackages; [
  # melpa stable
    ripgrep
    magit
    # dockerfile-mode
    # markdown-mode
    # yaml-mode
    # json-mode
    # go-mode
    # nix-mode
    use-package
    projectile
    # direnv
    # swiper
    # ox-gfm
    # pdf-tools
  ]) ++ (with epkgs.melpaPackages; [
  # melpa unstable
    solarized-theme
    gruvbox-theme
    company
    counsel
    counsel-projectile
    ivy
    evil
    evil-collection
    evil-magit
    haskell-mode
    lsp-mode
    lsp-ui
    lsp-haskell
    lsp-treemacs
    treemacs
    flycheck
  ]) ++ (with epkgs.elpaPackages; [
  # elpa packages
    undo-tree
  ]) ++ [
  # main packages
  ]);
in

{
  nixpkgs.overlays = [emacsOverlay];

  services.emacs.enable = true;
  services.emacs.install = true;
  services.emacs.defaultEditor = true;
  services.emacs.package = emacs;
}

  1. Download the attached test file and edit with emacsclient -nw
  2. Ensure you do not have undo-tree-mode enabled. This minor mode causes GCs during the test case, leading to inconsistent behavior.
  3. Set (setq garbage-collection-messages t) to have emacs tell you if a GC is run Note: This is just for your own sanity so you do know if for some reason there's an unexpected GC
  4. Set (setq gc-cons-threshold most-positive-fixnum) Note: We use the highest value to ensure the GC doesn't interfere with our testcase due to some unforseen circumstances. It's an unreasonable value for the real world.
  5. Repeatedly cut+paste the block below the definition of info until the end of the file. Ideally keep a small terminal (say 80x24), this lets you see the redisplay of the definitions of info Attr ... nicely during cut+paste. Repeating this 4-5 times is enough to make performance absolutely unbearable.

Watch the performance degrade with every cut+paste.

  1. Run a (garbage-collect) and watch performance be fast again.

Notes of interest:

  1. If the block at the end is wrapped inside a comment, performance stays stable during cut+paste
  2. ~~If the block at the end is not font locked (as far as I can tell), performance degrades during cut+paste~~ I'm beginning to think there's font-locking on whatever context is there - and the code behind it introduces markers, perhaps per-character or line.
  3. The redisplay degradation is particularly linked to the font locking on data constructors. If we add more data constructors in those definitions of info, then the performance degradation is much worse.
  4. The redisplay performance extends to the entire frame, not just the buffer. Two ways to demonstrate this: If you resize the frame to full screen, watch how every line including toolbar and mode line is drawn slowly too. You can also create horizontal/vertical split windows, and watch how these get redrawn slowly too when the frame is switched between full screen and windowed.
  5. The problem is reproducible with undo-tree-mode present, (setq gc-cons-threshold 100000000) and normal editing of text. Often writing a 60 line region can cause quite bad performance degradation already, it's just more inconsistent. This I want to point out before someone makes a comment about how repeatedly cut+paste on 10k lines of code is not normal emacs usage. This is just a synthesized test case with a clear communicatable description how to reproduce.
  1. The emacs profiler does not help at all.
  2. During degraded performance, the CPU usage spikes to 100% while scrolling.

Profiler output

If the performance has degraded, running the profiler while scrolling up and down for a minute gives these CPU and Memory reports

CPU Profiler output

Click me to expand!

 Function                                                  CPU samples    %                                                                                                                                                                                                                                                                                                
- command-execute                                                3417  99%                                                                                                                                                                                                                                                                                                 
 - call-interactively                                            3417  99%                                                                                                                                                                                                                                                                                                 
  - funcall-interactively                                        3284  96%                                                                                                                                                                                                                                                                                                 
   - execute-extended-command                                    2300  67%                                                                                                                                                                                                                                                                                                 
    - sit-for                                                    2287  66%                                                                                                                                                                                                                                                                                                 
     - redisplay                                                 2287  66%                                                                                                                                                                                                                                                                                                 
      + redisplay_internal (C function)                             1   0%                                                                                                                                                                                                                                                                                                 
    + command-execute                                              12   0%                                                                                                                                                                                                                                                                                                 
     scroll-down-command                                          534  15%                                                                                                                                                                                                                                                                                                 
   - scroll-up-command                                            450  13%                                                                                                                                                                                                                                                                                                 
    + scroll-up                                                     1   0%                                                                                                                                                                                                                                                                                                 
  + byte-code                                                     133   3%                                                                                                                                                                                                                                                                                                 
+ evil-repeat-post-hook                                             1   0%                                                                                                                                                                                                                                                                                                 
  evil--jump-hook                                                   1   0%                                                                                                                                                                                                                                                                                                 
+ timer-event-handler                                               1   0%                                                                                                                                                                                                                                                                                                 
+ ...                                                               0   0%      

Memory profiler output

Click me to expand!

 Function                                                        Bytes    %                                                                                                                                                                                                                                                                                                
- command-execute                                           8,755,700  99%                                                                                                                                                                                                                                                                                                 
 - call-interactively                                       8,755,700  99%                                                                                                                                                                                                                                                                                                 
  - byte-code                                               4,469,695  50%                                                                                                                                                                                                                                                                                                 
   - read-extended-command                                  4,469,695  50%                                                                                                                                                                                                                                                                                                 
    - completing-read                                       4,469,695  50%                                                                                                                                                                                                                                                                                                 
     - ivy-completing-read                                  4,469,695  50%                                                                                                                                                                                                                                                                                                 
      - ivy-read                                            4,469,695  50%                                                                                                                                                                                                                                                                                                 
       - read-from-minibuffer                               2,628,499  29%                                                                                                                                                                                                                                                                                                 
        + ivy--queue-exhibit                                2,068,301  23%                                                                                                                                                                                                                                                                                                 
        + command-execute                                      49,096   0%                                                                                                                                                                                                                                                                                                 
        + redisplay_internal (C function)                      12,416   0%                                                                                                                                                                                                                                                                                                 
        + minibuffer-inactive-mode                              7,392   0%                                                                                                                                                                                                                                                                                                 
        + #<compiled 0x472e8f9>                                 1,056   0%                                                                                                                                                                                                                                                                                                 
        + undo-auto--add-boundary                               1,056   0%                                                                                                                                                                                                                                                                                                 
        + undo-auto--undoable-change                               72   0%                                                                                                                                                                                                                                                                                                 
       + ivy--reset-state                                     256,404   2%                                                                                                                                                                                                                                                                                                 
       + ivy-call                                               1,056   0%                                                                                                                                                                                                                                                                                                 
       + ivy--update-prompt                                     1,024   0%                                                                                                                                                                                                                                                                                                 
  - funcall-interactively                                   4,286,005  48%                                                                                                                                                                                                                                                                                                 
   - execute-extended-command                               4,259,293  48%                                                                                                                                                                                                                                                                                                 
    - command-execute                                       3,791,263  43%                                                                                                                                                                                                                                                                                                 
     - call-interactively                                   3,791,263  43%                                                                                                                                                                                                                                                                                                 
      - funcall-interactively                               3,791,247  43%                                                                                                                                                                                                                                                                                                 
         profiler-stop                                      3,790,592  43%                                                                                                                                                                                                                                                                                                 
         profiler-start                                           655   0%                                                                                                                                                                                                                                                                                                 
    + sit-for                                                 273,374   3%                                                                                                                                                                                                                                                                                                 
   + scroll-down-command                                       15,960   0%                                                                                                                                                                                                                                                                                                 
   + scroll-up-command                                         10,752   0%                                                                                                                                                                                                                                                                                                 
+ redisplay_internal (C function)                              23,224   0%                                                                                                                                                                                                                                                                                                 
+ jit-lock--antiblink-post-command                              8,360   0%                                                                                                                                                                                                                                                                                                 
+ timer-event-handler                                           6,200   0%                                                                                                                                                                                                                                                                                                 
  evil--jump-handle-buffer-crossing                             2,112   0%                                                                                                                                                                                                                                                                                                 
+ global-font-lock-mode-check-buffers                           1,056   0%                                                                                                                                                                                                                                                                                                 
+ internal-echo-keystrokes-prefix                               1,056   0%                                                                                                                                                                                                                                                                                                 
  ...                                                               0   0%                 

GC Output

I first did cut+paste to degrade performance, then executed two (garbage-collect) in a row

The first, which fixed performance, gave us the following output

First GC (After degraded performance): Click me to expand!

((conses 16 4979055 645610)
 (symbols 48 93078 29)
 (strings 32 158480 15678)
 (string-bytes 1 8876291)
 (vectors 16 92580)
 (vector-slots 8 1958997 416166)
 (floats 8 336 1527)
 (intervals 56 1006222 9317)
 (buffers 1000 42))                                                                                                                                                

Second GC, immediately again. This is just to observe whether the first GC has collected huge amounts of memory. Note: It did not. The output is very similar to that of the run before.

First GC (After degraded performance): Click me to expand!

((conses 16 4947118 644997)
 (symbols 48 93078 29)
 (strings 32 158477 15588)
 (string-bytes 1 8875403)
 (vectors 16 92579)
 (vector-slots 8 1958985 414138)
 (floats 8 336 1527)
 (intervals 56 1006220 9301)
 (buffers 1000 42))                                                                                                                                            

dminuoso avatar Jun 19 '21 09:06 dminuoso

It seems that 75% of the time is spent in encode_coding_object

Attached you can find the perf_events report and flamegraph.

perf_report.gz out.svg.gz out.perf.gz

 Samples: 4K of event 'cycles'
# Event count (approx.): 112485425315
#
# Children      Self       Samples  Command          Shared Object          Symbol                                        
# ........  ........  ............  ...............  .....................  ..............................................
#
    99.67%     0.00%             0  .emacs-27.1-wra  [unknown]              [k] 0x41fd89415541f689
            |
            ---0x41fd89415541f689
               __libc_start_main
               main
:
               recursive_edit_1
               command_loop
               internal_catch

               internal_condition_case
               command_loop_1
               |          
               |--77.34%--read_key_sequence
               |          |          
               |           --77.28%--read_char
               |                     |          
               |                      --77.22%--redisplay_internal
               |                                |          
               |                                |--75.83%--update_frame
               |                                |          update_frame_1
               |                                |          update_frame_line
               |                                |          tty_write_glyphs
               |                                |          |          
               |                                |           --75.81%--encode_terminal_code

               |                                |                      --75.76%--encode_coding_object
               |                                |          

dminuoso avatar Jun 19 '21 11:06 dminuoso

gprof agrees:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 96.23    160.67   160.67    32275     4.98     4.98  encode_coding_object

5ms per call into encode_coding_object makes for absolute horrid performance I guess.

dminuoso avatar Jun 19 '21 12:06 dminuoso

Underlying issue for the test-case appears to be a large accumulation of unreachable markers in the buffer. encode_coding_object traverses the singly linked of markers at least once, potentially twice.

https://git.savannah.gnu.org/cgit/emacs.git/tree/src/coding.c?h=emacs-27.1&id=86d8d76aa36037184db0b2897c434cdaab1a9ae8#n8281

In the degenerate case this singly linked list can contain millions of elements, causing the slowdown.

What is haskell-font-lock.el doing to make this problem so apparent, even when editing smaller code regions?

dminuoso avatar Jun 19 '21 14:06 dminuoso

Hi. i was asked by the issue author to help with confirming/reproducing the issue.

i'd like to report back that, after following the explicit steps and conditions here i can reproduce the issue on my machine to degrade scroll performance to the point of being sluggish in terminal mode (emacs -nw) (did not test gui)

it seems to get a lot better / came back to normal after i ran (garbage-collect), which also goes with the expected observation in the issue.

my emacs is compiled from source.

i was also asked to provide the output of C-h v system-configuration-options , which is as follows

system-configuration-options is a variable defined in ‘C source code’.

Its value is
"--with-native-compilation --with-json --with-pgtk --with-modules --with-mailutils --without-pop --without-xwidgets"

String containing the configuration options Emacs was built with.

hope it helps!

Edit:- My exact emacs version would be:-

This is GNU Emacs 28.0.50 (build 3, x86_64-pc-linux-gnu, GTK+ Version 3.24.29, cairo version 1.17.4)
 of 2021-06-04
Copyright (C) 2021 Free Software Foundation, Inc.

arjunkathuria avatar Jun 19 '21 19:06 arjunkathuria

~~I suspect this to be the culprit:~~

https://github.com/haskell/haskell-mode/blob/b35e8547b67f3d2c4d38ec7e7e516e3c0925352f/haskell-lexeme.el#L86-L131

~~Is it possible that the frequent call of this during font locking generates a lot of match data with markers, left for GC? If that's the case, why don't I notice the scrolling problem in GUI mode?~~

dminuoso avatar Jun 19 '21 22:06 dminuoso

It seems that in GUI mode there's never a call into encode_coding_object.

So to summarize: It seems that we generate a lot of unreachable markers in haskell-lexeme.el with ~~looking-at~~ haskell-lexeme-looking-at-qidsym. These markers cause a slow-down in encoding_coding_object which traverses the list of markers once or twice. That function is then called on redisplay on terminal mode, causing a delay on drawing each line. However, the function is never called in GUI mode.

A garbage collect would instantly bring the performance back to normal, but especially lsp-users are likely to want a high gc-cons-threshold. Also, generating this much markers causes a noticeable increase in GC delay.

Normally this wouldn't bother me as much, but I regularly trip into this the problem while doing very normal code editing - granted the scrolling speed does not get as bad as in the text-case after 4-5 cut+paste, but it's noticably sluggish nevertheless.

Assuming my hypothesis is correct:

As a user workaround I can see

  1. Reducing gc-cons-threshold
  2. Using something like gcmh
  3. Switching to GUI mode

dminuoso avatar Jun 19 '21 22:06 dminuoso

I can reproduce the slow-down, and the first thing to try would indeed be to cut down on the generation of markers – haskell-lexeme.el generates way more markers than necessary. haskell-lexeme-looking-at-token may be the worst of the lot.

Note that the match data is not stored as markers internally in Emacs, only as positions (integers) and an optional reference to the buffer to which they apply (unless the last match was against a string). This means that (match-data) will reify each position as a marker if the last match was against a buffer, which is quite a waste if done frequently. Using a non-nil INTEGERS argument eliminates that (at the cost of getting the buffer as an extra element at the end, but conses are much cheaper).

It also means that set-match-data can safely be called with a list of integers – there should be no need to create markers just to set the match data.

More fundamentally, I'm not sure if this obsessive use of mutating the match data everywhere to return information really is the best design. I'd just return a list like (TOKEN START END ...) etc, and use pcase to pattern-match on it. It would probably even be faster.

mattiase avatar Aug 16 '21 08:08 mattiase

Thanks @mattiase, that makes sense. I'm probably the most active maintainer currently, but still not very active: I'm unlikely to have time to be able to integrate these suggestions myself, but I'd be happy to help shepherd a PR into the repo.

purcell avatar Aug 16 '21 09:08 purcell

Two reasons why you shouldn't apply that PR I just wrote: it hasn't been tortured properly (leaving that to you), and by doing so we may never find and fix the performance bug in Emacs that is triggered by having large amounts of markers in a buffer.

mattiase avatar Aug 16 '21 15:08 mattiase

For the record, the performance problem has largely been solved in Emacs 28. Since it won't be released for a while, it probably makes sense to do something along the lines of PR #1760 as well.

mattiase avatar Aug 18 '21 11:08 mattiase