clangd icon indicating copy to clipboard operation
clangd copied to clipboard

Absolutely hammers my M1

Open tolgraven opened this issue 3 years ago • 8 comments

Basically on a largeish project (ESP-IDF-based) while indexing my laptop just grinds to a halt. Even vim gets really sluggish, iStat takes like 15s to even open and it's just not fun.

I've wrapped clangd in a nice -n 19 shell script, with kitty tmux and nvim on the other end of things, but it makes little difference. Load average hits over 100 which, uh, what? I guess it's just too much to handle no matter what priority (though I don't understand why). In any case 128 threads certainly seems like... a lot? Unless most are io and blocking or something, but they're obviously all just going at it. I don't get how that'd be at all efficient, even when it doesn't bring down a system.

I'm not sure if this is an M1 issue, I haven't really experienced nice not working before, though I struggle to find anyone talking about it. There is definitely a slowdown, albeit not as hardcore, when doing an actual build in gcc, and it's been bugging me a bit, but then even a full build is finishes so quickly on this machine. But trying to do some heavy refactoring is just a real struggle, and not reindexing probably isn't much of an alternative when making major changes.

What to do?

System information

Macbook Pro 16" M1 Pro

Output of clangd --version:

clangd version 13.0.0 (https://github.com/espressif/llvm-project 7b5afb55f5c7959a2903978a25774c75172e8741)
Features: mac+debug+xpc
Platform: arm64-apple-darwin21.2.0

(espressif xtensa fork of llvm)

Editor/LSP plugin:

nvim 0.7.0-dev

Operating system:

macOS 12.1

tolgraven avatar Dec 24 '21 01:12 tolgraven

You can control the number of threads used for background indexing with the -j=N (similar to make and some other build systems) command-line argument to clangd.

$ clangd --help | grep '\-j'
  -j=<uint>                       - Number of async workers used by clangd. Background index also uses this many workers.

HighCommander4 avatar Dec 24 '21 04:12 HighCommander4

Ouch, my bad. That’s what I get for only looking for “jobs” and “threads”. Thanks a bunch.

Any idea why it defaults to such a humongous amount though?

tolgraven avatar Dec 24 '21 14:12 tolgraven

Good point about defaults.

The threads that will grind away are the background index threads. (There are others, but they shouldn't be constantly loaded).

The number of those threads is equal to -j, and the default is the detected number of physical cores. (This was tuned for performance on Intel, where hyperthreads made things worse).

It's possible that:

  • this is too aggressive on laptops in general
  • this isn't appropriate for the M1 (I guess we're counting both the big and little cores)
  • the core-counting function isn't working right on M1 (unfortunately we don't log this)

Can I ask:

  • can you confirm your CPU is a 10-core M1 pro (not Max).
  • could you find out what the default -j ends up being on your machine? e.g. run without -j, count threads with ps M <pid> | wc -l, then do the same with a few values of -j looking for a close match?

(I'll add some logging for next time)

sam-mccall avatar Dec 24 '21 16:12 sam-mccall

Indeed Pro 10-core (as is Max fwiw, only graphics differs). Without -j I got a total of 128 threads. With -j=8 I get 76, and -j=10 I get 78. So indeed the counting appears to somehow go horribly wrong.

I don't think using the low power cores should be a problem, though. But 60 workers on 10 cores, yeah that'd do it!

Need hardly be mentioned but with a sane -j not only does the system keep running smooth, but clangd itself is massively quicker.

Edit: done some digging, really confused. If I'm following along correctly lack of HT should lead down path to eventually just reading hw.physicalcpu which my sysctl properly gives as 10. So don't understand what could go wrong here.

tolgraven avatar Dec 26 '21 21:12 tolgraven

This sounds really bad. I'm not sure how many of the extra threads are background indexing (there are other spawned threads that mostly sit idle) but clearly -j is being inferred incorrectly.

@hokein @kadircet does one of you have an M1 laptop to repro this on? It may need to be a max/pro but quite likely not.

Need hardly be mentioned but with a sane -j not only does the system keep running smooth, but clangd itself is massively quicker.

Right - halving the threads on x64 (avoiding HT) actually improved indexing throughput, let alone interactive responsiveness...

sam-mccall avatar Dec 27 '21 00:12 sam-mccall

An update but not an enlightening one: @hokein tried this out on a (first-gen) M1 and #threads, responsiveness and sysctl output were all normal (8 cores detected).

Can I ask you to run with CLANGD_TRACE=trace.json (you can set the environment before launching the editor) and upload the file?

  • This will include the nature of the threads and what each one was doing.
  • It will contain details of the code you have open so this is only suitable for public code.
  • If it's convenient, a trace with a known -j value would be a useful baseline too

sam-mccall avatar Jan 04 '22 14:01 sam-mccall

I run with -j 4 for my 10 Core M1 Mac Pro.

  503 85964 85885   0 10:38AM ??         6:21.97 /Library/Developer/CommandLineTools/usr/bin/clangd -j=4

Activity Monitor only shows two processes running clangd:

 clangd	39.4	4:11.34	8	0	Apple	0.0	0.00	85964	bneradt	35.2	-	No	No		0 bytes	0 bytes	0 bytes	No	No	No	0 bytes	0	0 bytes	0	0 bytes	0 bytes	0 bytes	0 bytes	0 bytes	(null)	
 clangd	96.4	10:25.25	14	0	Apple	0.0	0.00	85957	bneradt	87.8	-	No	No		0 bytes	0 bytes	0 bytes	No	No	No	0 bytes	0	0 bytes	0	0 bytes	0 bytes	0 bytes	0 bytes	0 bytes	(null)	

Related to this, indexing seems to be taking far longer on my M1 Mac Pro than my Intel Mac. The one process above is using most of a CPU, so the processing is intensive, but indexing is taking many minutes rather than seconds before. I haven't timed things before because I hadn't needed to, but it feels like this is taking 1 or 2 orders of magnitude longer to complete.

FWIW, I'm indexing Apache Traffic Server: https://github.com/apache/trafficserver.git

If you brew install [email protected] (along with bear, automake, pcre, pcre2...you might need some others, the configure output should be helpful.) you can compile with:

autoreconf -fi
./configure --prefix /var/tmp/ats_build --with-openssl=/opt/homebrew/opt/[email protected] --enable-experimental-plugins --enable-example-plugins
bear -- make -j10
make install

I then run clangd on the resultant compile_commands.json. I do so via nvim's coc plugin.


Update

I suppose my observations here are actually a duplicate of the already-filed #1119, which I just found. That would explain why my threads are limited to two cores (the efficiency ones) and why the performance of indexing is as slow as it is.

Thanks!

bneradt avatar May 16 '22 16:05 bneradt

Can be closed then?

Trass3r avatar Sep 10 '22 17:09 Trass3r