haskell-language-server icon indicating copy to clipboard operation
haskell-language-server copied to clipboard

[documentation] Experiment using concurrent garbage collector

Open mpickering opened this issue 4 years ago • 13 comments

If a GC happens during a request there can be a reasonably big pause if you have a largish heap.

We should try using the concurrent GC to attempt to reduce pause times.

mpickering avatar Mar 27 '20 21:03 mpickering

I gave -xn -I1 a try and got all kinds of crashing after a few minutes of loading the ghcide codebase in VSCode:

  • Unknown closure type error
  • Seg fault
  • Lock up

I also checked that these problems did not reproduce without -xn. I cannot repro without -I1 either, i.e. just -xn (ghcide is built with -rtsopts -I0)

Will open tickets upstream for @bgamari and @osa1 to investigate

EDIT: Updated to account for -I1

pepeiborra avatar Apr 04 '20 10:04 pepeiborra

Backtrace from gdb:

Reading symbols from /home/pepe/scratch/ghcide/dist-newstyle/build/x86_64-linux/ghc-8.10.1/ghcide-0.1.0/x/ghcide/build/ghcide/ghcide...
[New LWP 23766]
[New LWP 23298]
[New LWP 23321]
[New LWP 23301]
[New LWP 23310]
[New LWP 23326]
[New LWP 23320]
[New LWP 23330]
[New LWP 23318]
[New LWP 23380]
[New LWP 23322]
[New LWP 23316]
[New LWP 23383]
[New LWP 23424]
[New LWP 23302]
[New LWP 23314]
[New LWP 23311]
[New LWP 23308]
[New LWP 23312]
[New LWP 23669]
[New LWP 23317]
[New LWP 23611]
[New LWP 23454]
[New LWP 23315]
[New LWP 23299]
[New LWP 23381]
[New LWP 23313]
[New LWP 23300]
[New LWP 23332]
[New LWP 23319]
[New LWP 23329]
[New LWP 23389]
[New LWP 23323]
[New LWP 23349]
[New LWP 23328]
[New LWP 23327]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/9rabxvqbv0vgjmydiv59wkz768b5fmbc-glibc-2.30/lib/libthread_db.so.1".
Core was generated by `/home/pepe/scratch/ghcide/dist-newstyle/build/x86_64-linux/ghc-8.10.1/ghcide-0.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000003e8bdbd in nonmovingSweepMutLists ()
[Current thread is 1 (Thread 0x7f5f02928700 (LWP 23766))]
(gdb) bt
#0  0x0000000003e8bdbd in nonmovingSweepMutLists ()
haskell/ghcide#1  0x0000000003e6a47c in nonmovingMark_.constprop.0 ()
haskell/ghcide#2  0x0000000003e6a655 in nonmovingConcurrentMark ()
haskell/ghcide#3  0x00007f5fbbe78edd in start_thread () from /nix/store/9rabxvqbv0vgjmydiv59wkz768b5fmbc-glibc-2.30/lib/libpthread.so.0
haskell/ghcide#4  0x00007f5fbbbbaa4f in clone () from /nix/store/9rabxvqbv0vgjmydiv59wkz768b5fmbc-glibc-2.30/lib/libc.so.6

pepeiborra avatar Apr 04 '20 17:04 pepeiborra

Thanks @pepeiborra! I'm looking into it (although do open a GHC ticket as well)

bgamari avatar Apr 04 '20 17:04 bgamari

Raised https://gitlab.haskell.org/ghc/ghc/issues/18016

pepeiborra avatar Apr 04 '20 17:04 pepeiborra

Fix is in this MR: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/3186

Ben reports

•bgamari> nonmoving collector reduces the average gen1 pause of ghcide from >350ms to ~10ms
6:36 PM <mpickering> That sounds promising
6:36 PM <mpickering>  How much residency is there?
6:37 PM <•bgamari> maximum goes from 1s to 60ms
6:37 PM <•bgamari> bytes copied goes down to a factor of 8

mpickering avatar May 02 '20 14:05 mpickering

For the record, this measurement was taken by an ad hoc editing session against the lens library. I am currently working on a more systematic measurement.

bgamari avatar May 02 '20 16:05 bgamari

@pepeiborra the ghc mr is merged, did it fixed the problems with the concurrent garbage collector?

jneira avatar Oct 05 '20 07:10 jneira

I haven't checked

pepeiborra avatar Oct 05 '20 09:10 pepeiborra

I have checked again with ghc 8.10.3 and it seems to work pretty well. The crashes are gone. I have seen one crash with --nonmoving-gc -A128M enabled, but it's extremely hard to reproduce and could be something else:

ghcide: internal error: SMALL_MUT_ARR_PTRS_FROZEN_CLEAN object (0x4228156f38) entered!
    (GHC version 8.10.3 for x86_64_apple_darwin)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug

I also collected some performance numbers using the benchmark suite. Overall, the timings were similar or slightly worse in most benchmarks. I wasn't able to find a set of GC flags using --nonmoving-gc that showed an improvement over the ones we use right now (-A128M -qg -I0). But I did notice that -qg, which disables the parallel GC, is a net loss - all the benchmarks are faster without it.

For the edit experiment in the lsp-types example the chart below shows the live bytes over time (as reported by -S) for various configurations over 100 samples:

  • upstream: -A128M -qg -I0
  • Adefault: -qg -I0
  • parallelGC: -A128M -I0
  • A64: -A64M -qg -I0
  • nmA64: --nonmoving-gc -A64M
  • nmAdefault: -qg -I0 --nonmoving-gc

image

Branch to reproduce: https://github.com/pepeiborra/ide/tree/benchmark-rts-opts-nm

pepeiborra avatar Jan 03 '21 23:01 pepeiborra

@pepeiborra Thanks for looking into this. I will ask about the panic you are seeing.

Isn't the idea behind using the nonmoving-gc to reduce the pause times? This is interesting for us because if the pause happens when serving a request then the user will notice it. For example, if you hover, then a GC kicks in for 1-2s then the hover response will also be delayed. With the nonmoving-gc then the pause will be shorter and therefore a smoother experience for the user, even if it's slightly slower.

mpickering avatar Jan 04 '21 08:01 mpickering

Yes, that's correct. To measure pauses the benchmark suite needs to be extended to show max time (in addition to total time which it currently does).

pepeiborra avatar Jan 04 '21 08:01 pepeiborra

https://twitter.com/monadiccheng/status/1539583255317446658 by @TerrorJack

haskell-language-server with --nonmoving-gc is a lot smoother, when heap size goes beyond 10GiB

fishtreesugar avatar Jun 24 '22 11:06 fishtreesugar

Since this works, can someone raise a PR to add this to the documentation for experimentation friendly users?

hasufell avatar Jul 13 '22 17:07 hasufell