online icon indicating copy to clipboard operation
online copied to clipboard

coolwsd freezes regularly

Open doobry-systemli opened this issue 1 year ago • 9 comments

Describe the Bug

We're running latest Collabora Online CODE edition (24.04.4-2) installed on a Debian Bookworm server and integrated into Nextcloud.

Since several releases (probably more or less since the upgrade 24.04.) coolwsd freezes on a regular basis. The coolwsd webservice no longer responds and requests run into timeouts.

Not sure if it's related, but this log entry seems to show up before the freeze:

coolwsd[479202]: frk-442418-442418 2024-06-24 11:56:39.708686 +0200 [ forkit ] INF  Forkit initialization complete: setting log-level to [warning] as configured.| kit/ForKit.cpp:789
coolwsd[479202]: kit-479202-479202 2024-06-24 18:05:59.423378 +0200 [ kitbroker_0de ] ERR  ToMaster-2479: error: cmd=getclipboard kind=nodocloaded| common/Session.hpp:163
coolwsd[479202]: kit-479202-479202 2024-06-24 18:05:59.423424 +0200 [ kitbroker_0de ] WRN  ToMaster-2479: Skipping unload on incomplete view [ToMaster-2479], viewId: -1| kit/ChildSession.cpp:159
coolwsd[442405]: wsd-442405-479513 2024-06-24 18:05:59.424148 +0200 [ docbroker_0de ] ERR  ToClient-2479: getclipboard error failure: nodocloaded| wsd/ClientSession.cpp:1806

Apart from that, this log appears from time to time, but not shortly before or after the freeze:

coolwsd[442405]: wsd-442405-476075 2024-06-24 18:24:31.027392 +0200 [ docbroker_0cd ] ERR  #67: Read failed, have 0 buffered bytes (ECONNRESET: Connection reset by peer)| net/Socket.hpp:1146

Desktop

(Please complete the following information)

  • Collabora version: 24.04.4-2
  • OS and version: Debian Bookworm

doobry-systemli avatar Jun 24 '24 19:06 doobry-systemli

https://github.com/CollaboraOnline/online/issues/9318 could be the same problem :thinking:

doobry-systemli avatar Jun 24 '24 19:06 doobry-systemli

Hi @doobry-systemli , do you think you could try to profile with the watchdog enabled to see what's the backtrace of this hang?

Here is how to use the watchdog:

  1. You need to set the COOL_WATCHDOG environment variable to something non-empty, like '1'.
  2. Use the scripts/watchdog-cool from this repo to run 'perf' with suitable parameters, this will give you a flamegraph that shows what are the places where we hang for too long.

Hopefully that would give us a picture of the situation. Could you please try to run the watch and paste the generated SVG here, so we can improve the situation? Thanks.

vmiklos avatar Jun 25 '24 09:06 vmiklos

Hey @vmiklos, sure I can do this. Am I supposed to run the watchdog-cool script once coolwsd froze? Or before? And the COOL_WATCHDOG variable probably needs to be set in the environment that starts coolwsd, i.e. in the systemd unit, right?

doobry-systemli avatar Jun 25 '24 10:06 doobry-systemli

@vmiklos also, which time should I give as argument to watchdog-cool?

Or would this need to be called before the freeze happens and run until the freeze happens? Then I'd probably have to pass something like 5h to it as for now I don't have a clear reproducer.

doobry-systemli avatar Jun 25 '24 11:06 doobry-systemli

watchdog-cool is meant to be long-running, like you run it for 1h and then you can see when the watchdog triggered in that timefame. So you start it proactively, not when you already have a hang.

The COOL_WATCHDOG env var has to be set on the coolwsd process yes, e.g. in the systemd unit, correct.

The example timeout is 1h, so perhaps try with that. That means you start it proactively and you hope you get a hang in the next hour. If it doesn't happen in the next hour, you can run the script (just a wrapper around 'perf', really) again, with an increased timeframe. If something is not clear, just ask. :-) Thanks.

vmiklos avatar Jun 25 '24 11:06 vmiklos

Thanks @vmiklos. I started it now. One additional question: is it ok to abort the perf process once the freeze happens? Or do I have to wait until the process finishes with the specified timeframe?

doobry-systemli avatar Jun 25 '24 12:06 doobry-systemli

You can press Ctrl-C any time to stop collecting if you think the interesting hang happened already.

The important part is to look at the generated (interactive) SVG and if it has meaningful function names, then attach it to this issue. Thanks.

vmiklos avatar Jun 26 '24 07:06 vmiklos

Here's a flamegraph of watchdog-cool running for 5 hours and a freeze happening after +/- 2 hours.

watchdog

doobry-systemli avatar Jun 26 '24 08:06 doobry-systemli

Hrm, my hope was that the watchdog would show what kind of request causes the hang, but this is not a lot of data.. unless @caolanm thinks otherwise.

vmiklos avatar Jun 28 '24 14:06 vmiklos

Same here. Really problematic in production mode.

Desktop

Collabora version: 24.04.4.2-1
OS and version: Debian Stable (Bookworm)

How can we help in identifying or bypassing the issue?

bateast avatar Jul 01 '24 11:07 bateast

I wonder if you have the coolwsd-dbgsym package installed? If not, then installing that may give more fine grained results that might be more helpful

caolanm avatar Jul 01 '24 13:07 caolanm

I wonder if you have the coolwsd-dbgsym package installed? If not, then installing that may give more fine grained results that might be more helpful

Thanks. The package indeed was not installed yet. We installed it now and restarted the coolwsd service. (Un)fortunately the hang/freeze didn't happen within the last days. I wonder why, but after the initial catch with the flamegraph from above it didn't happen again since we started to try to catch it :zany_face:

One change we had to do was to reboot the kernel without lockdown mode, because otherwise /sys/kernel/tracing/ was empty despite tracefs being mounted. Usually we boot our VMs with lockdown=confidentiality as kernel boot parameter. Maybe that's related? :thinking:

doobry-systemli avatar Jul 01 '24 13:07 doobry-systemli

One change we had to do was to reboot the kernel without lockdown mode, because otherwise /sys/kernel/tracing/ was empty despite tracefs being mounted. Usually we boot our VMs with lockdown=confidentiality as kernel boot parameter. Maybe that's related? 🤔

Hello, still present here (and again, really problematic in a production mode for small organization). Only quick and efficient solution found up to now: monitoring and fast server reboot on first hang detected.

by the way, our boot parameters (about lockdown): cat /proc/cmdline BOOT_IMAGE=/@rootfs/boot/vmlinuz-6.1.0-22-cloud-amd64 root=UUID=2ff0bc0c-642d-4aeb-985e-e6a9e1cd6856 ro rootflags=subvol=@rootfs console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0 So, no, lockdown is not activated…

bateast avatar Jul 04 '24 06:07 bateast

with the mentioned coolwsd-dbgsym (or the equivalent coolwsd-debuginfo + coolwsd-debugsource rpms for the rpm-using distros) when it hangs if you try:

sudo gdb --q --n --ex 'thread apply all backtrace full' --batch --pid=$(pidof coolwsd) > /tmp/full.backtrace.log and attach the resulting /tmp/full.backtrace.log that might be useful

caolanm avatar Jul 04 '24 13:07 caolanm

Ah - so ... the problem here seems simple; the new server-side downloading clipboard code (which avoids the browser's various well meaning policy issues forbidding it to connect to other servers) does:

ClientSession.cpp: http::Request httpRequest(Poco::URI(url).getPathAndQuery()); if (!httpSession->asyncRequest(httpRequest, docBroker->getPoll()))

And you'd think that asyncRequest was fully async; but unfortunately the DNS is synchronous; is there a reason why a DNS lookup would take an unreasonably long time on your machine ? Either way I think we end up where it says:

HttpRequest::asyncRequest -> HttpRequest::connect() -> net::connect which does:

struct addrinfo* ainfo = nullptr;
struct addrinfo hints;
std::memset(&hints, 0, sizeof(hints));
const int rc = getaddrinfo(host.c_str(), port.c_str(), &hints, &ainfo);

synchronously.

mmeeks avatar Jul 04 '24 14:07 mmeeks

Then again - why that would block the websrv thread instead of the poll of the DocumentBroker is something of an unusual mystery.

mmeeks avatar Jul 04 '24 14:07 mmeeks

@mmeeks would that DNS lookup happen on the server side, or on the client side? If the second was true, it would explain that the issue gets triggered by some particular clients and if they stop using Collabora the problem doesn't appear any longer.

mejo- avatar Jul 04 '24 15:07 mejo-

Thanks @mejo- we think that this could be cause by non-asynchrnous DNS during a connect as above - which could block a DocumentBroker thread (but not the whole coolwsd). I believe Caolan is looking at making that fully async - in case it's the issue. We have also found a potentially related threading issue with a similar trace that we have a potential fix (or two) for that may be related. Thinking and testing is ongoing there.

mmeeks avatar Jul 08 '24 21:07 mmeeks

I feel there is a good chance that https://github.com/CollaboraOnline/online/commit/d6b15b9f2de67de2ae99321e65380495acbebd17 will help here given the similarity of the callstack seen in the motivating case there

caolanm avatar Jul 11 '24 14:07 caolanm

Possibly; I'm not hyper-convinced that the checkFileInfoSync path is ever taken; but at least if by the time we're unwinding the stack there the CheckFileInfo class is cleaned up - then at least we don't de-reference a freed 'this' =)

mmeeks avatar Jul 11 '24 15:07 mmeeks

We caught the issue another time, this time with coolwsd-debug being installed. Doesn't look like the flamegraph has more info this time, but here it is:

watchdog

doobry-systemli avatar Jul 16 '24 12:07 doobry-systemli

The originally anonymous [coolwsd] symbol now appears as CheckFileInfo::checkFileInfo which is good to see as the concrete hanging location.

caolanm avatar Jul 16 '24 12:07 caolanm

I think now in 24.04.5-2 that this exact glitch shouldn't happen. Let me know if it reappears in that version though.

caolanm avatar Jul 22 '24 13:07 caolanm