libkiwix icon indicating copy to clipboard operation
libkiwix copied to clipboard

Better unexpected exception handling?

Open rgaudin opened this issue 3 years ago • 6 comments

Using a nightly kiwix-serve on the hidden selection of ZIM files I frequently get the following output:

2022-04-25T14:05:33Z ===== Unhandled unknown error

The container is then restarted. Looking at the exit code (139), it looks like a segfault.

What I know:

  • Library loads fine. it's periodically refreshed as I'm using --monitorLibrary but the crashes are uncorrelated.
  • Some ZIM files may be broken. There are development ZIM files there and I know for a fact that some of them lack the Counter metadata for instance.
  • I'm running an alpine:3 container with the latest kiwix-tools-linux x86_64 nightly binaries.

Is there a way to have better log on unhandled exceptions? Although it wouldn't help in segfault cases.

https://github.com/kiwix/libkiwix/blob/dc42f831c0723290f47a0514dfdce23f34bde451/src/server/internalServer.cpp#L365-L370

The crashes are frequent but not systematic so I don't know where to start looking at. What would you suggest?

rgaudin avatar Apr 25 '22 14:04 rgaudin

Except debugging, there is nothing we can do.

The full catch is https://github.com/kiwix/libkiwix/blob/dc42f831c0723290f47a0514dfdce23f34bde451/src/server/internalServer.cpp#L361-L369 It means that if we have a exception which inherit from std::exception we print the content of the exception, which would allow us to have some clue about what going wrong. The catch (...) is the last option to not have the server crashing, it will catch anything "not a std::exception" (by opposition to the previous catch). It is when we have something raised but we don't even know what it is.

The container is then restarted. Looking at the exit code (139), it looks like a segfault.

A segfault is not a exception. The program "simply" crash. Who is restarting the container ? (For my own curiosity, it will not help for this)

What would you suggest?

Run the server using a debugger and try to reproduce the bug. At server crash, debugger will stop and you will be able to see the call-stack and where it crashes.

mgautierfr avatar Apr 25 '22 16:04 mgautierfr

OK, I think I'll need to use the debugger to find out what's going on. It seems that I don't have the log of the culprit request (if the culprit is request-bound) as the last logs and crash time don't really matches.

container is running in the k8s cluster which restarts it.

rgaudin avatar Apr 25 '22 16:04 rgaudin

Library loads fine. it's periodically refreshed as I'm using --monitorLibrary but the crashes are uncorrelated.

Do you mean that the crashes happen even without --monitorLibrary?

Is there a way to have better log on unhandled exceptions?

Before returning a HTTP 500 INTERNAL SERVER ERROR response more debug info is output when run in --verbose mode:

https://github.com/kiwix/libkiwix/blob/dc42f831c0723290f47a0514dfdce23f34bde451/src/server/internalServer.cpp#L305-L312

veloman-yunkan avatar Apr 28 '22 08:04 veloman-yunkan

No, I've only used --monitorLibrary on this instance so I can't say it happens without. I meant to say that the crashes are not happening at the time of the library reloading (which are known).

I'm always running it in verbose mode and I see many INTERNAL ERROR, Unhandled error as well as crashes. Because there aren't that many requests on this instance, I though the crash and those unhandled logs were related but it seems those are not.

Exceptions don't crash the process but it seems segfaults happens very early in that I don't have the initial request details printed.

I'll run it via gdb next and will report the results

rgaudin avatar Apr 28 '22 08:04 rgaudin

Because there aren't that many requests on this instance, I though the crash and those unhandled logs were related but it seems those are not.

They may be related. As this server use broken zim files, the first request with the INTERNAL ERROR may leave the process in a incoherent state and the next request "uses" this incoherent state and segfault.

mgautierfr avatar Apr 28 '22 12:04 mgautierfr

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Jul 10 '22 23:07 stale[bot]