scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

coredump files get corrupted

Open bhalevy opened this issue 1 year ago • 5 comments

As seen in https://github.com/scylladb/scylla-enterprise/issues/3968 Coredump files that were collected could not be opened by gdb.

The symptom looks like:

Failed to read a valid object file image from memory.

And additional printouts that are probably unrelated:

Trying host libthread_db library: search-path /opt/scylladb/libreloc/libthread_db.so.1.
open failed: No such file or directory.
thread_db_load_search returning 0

@fruch theorized that since scylla was observed to crash in quick succession maybe new coredumps caused the previous coredumps that get truncated by coredumpctl.

We need to make sure in testing that if scylla crashed, then the scylla service wouldn't start until the previous coredump was collected (and archived on s3/gc? can we do that?).

@avikivity is that something we can/should do on production machine images?

bhalevy avatar Mar 07 '24 13:03 bhalevy

here's one example of such occurrence:

# SCT starts compressing 
2024-02-25T22:13:05.713+00:00 multi-dc-rackaware-tablets--db-node-7a99b942-5   !NOTICE | sudo[17453]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/pigz --fast --keep /var/lib/systemd/coredump/core.scylla.112.7ca8b11c592f4c37b5e4e54b56772b37.17368.1708899083000000
...
# systemd-coredump decides this file should be remove (probably cause of no space, or any other calculation it might do)
2024-02-25T22:13:15.213+00:00 multi-dc-rackaware-tablets--db-node-7a99b942-5     !INFO | systemd-coredump[17550]: Removed old coredump core.scylla.112.7ca8b11c592f4c37b5e4e54b56772b37.17368.1708899083000000.
...
# SCT finishes the compression
2024-02-25T22:14:23.213+00:00 multi-dc-rackaware-tablets--db-node-7a99b942-5     !INFO | sudo[17453]: pam_unix(sudo:session): session closed for user root

the default values for coredump.conf are probably not good enough for us in some cases https://manpages.debian.org/testing/systemd-coredump/coredump.conf.5.en.html

I would recommend running it with ExternalSizeMax=infinity as a start

fruch avatar Mar 07 '24 14:03 fruch

Maybe a possible workaround for now to be able to upload the coredump is to change Coredump event to be CRITICAL event, which will stop the test and hopefully prevent the next crash that probably corrupts the coredump upload. (It will also leave the machine up and running).

@ShlomiBalalis please try that.

roydahan avatar Mar 10 '24 11:03 roydahan

Please report the bug to systemd so it can be fixed at the origin, in addition to working around it.

avikivity avatar Mar 10 '24 15:03 avikivity

It's terrible to leave a machine up when there's a coredump.

avikivity avatar Mar 10 '24 16:03 avikivity

Please report the bug to systemd so it can be fixed at the origin, in addition to working around it.

systemd specificly state that on-abnormal includes coredumps https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#Restart=

and we are using it since 2016: https://github.com/scylladb/scylladb/commit/1b49c0ce19d22b891f3cd4161cfe2a809ab64c06

what bug do you want to report to systemd ?

fruch avatar Mar 10 '24 16:03 fruch

@fruch Continuing from https://github.com/scylladb/scylla-enterprise/issues/4246#issuecomment-2147245586:

other than that I don't I have any fix for this situation.

@fruch Come on, there at least two ways to fix the problem (that SCT fails to upload any coredumps from bugs which cause a crash loop) without touching scylla-server.service at all:

  1. At the end of the test, after Scylla is stopped for the final time, collect the most recent core, (if it's still present and it hasn't been collected yet). The last core couldn't have been evicted by any newer core, so it should be fully present on disk. You lose nothing, and it gives you at least one core from the crash loop.

might work, if the timing of calling systemctl stop scylla would match perfectly, and scylla won't be again in the middle of starting and crushing

I'm saying that we can download the most recent core after the test, when we are free to stop the Scylla service, because it's no longer needed. If we call systemctl stop scylla, systemd isn't going to restart it. (Right?). So I think there is no timing to speak of here?

3. Prevent systemd from truncating the core file while SCT is processing (compressing and uploading) it. Systemd uses this code for deleting the core files during cleanup: https://github.com/systemd/systemd/blob/e1c3ac1f67c746827814eb446a8857955921b494/src/basic/fs-util.c#L711-L712 So it deliberately truncates files which have only one link, but doesn't truncate files which have more than one link. So you should be able to preserve the coredump from the automatic truncation by creating a hardlink (outside of the coredump directory) to it.

so one should count on this undocumented behavior ? based on observation of the code ?

Yes. It's very unlikely to ever change, and even if it does, the worst thing that can happen is that we will have to adjust SCT again.

michoecho avatar Jun 04 '24 23:06 michoecho

@fruch Continuing from scylladb/scylla-enterprise#4246 (comment):

other than that I don't I have any fix for this situation.

@fruch Come on, there at least two ways to fix the problem (that SCT fails to upload any coredumps from bugs which cause a crash loop) without touching scylla-server.service at all:

  1. At the end of the test, after Scylla is stopped for the final time, collect the most recent core, (if it's still present and it hasn't been collected yet). The last core couldn't have been evicted by any newer core, so it should be fully present on disk. You lose nothing, and it gives you at least one core from the crash loop.

might work, if the timing of calling systemctl stop scylla would match perfectly, and scylla won't be again in the middle of starting and crushing

I'm saying that we can download the most recent core after the test, when we are free to stop the Scylla service, because it's no longer needed. If we call systemctl stop scylla, systemd isn't going to restart it. (Right?). So I think there is no timing to speak of here?

either way we will upload all of the cores, and you'll still need to look for the correct complete one. I still think there a timing issue, cause you might call stop while it's already started and clear the previous core, and core from the new process wasn't happening yet, the logic which systeamd evacuate them isn't completely clear to me to say for sure this process you suggest would work better.

  1. Prevent systemd from truncating the core file while SCT is processing (compressing and uploading) it. Systemd uses this code for deleting the core files during cleanup: https://github.com/systemd/systemd/blob/e1c3ac1f67c746827814eb446a8857955921b494/src/basic/fs-util.c#L711-L712 So it deliberately truncates files which have only one link, but doesn't truncate files which have more than one link. So you should be able to preserve the coredump from the automatic truncation by creating a hardlink (outside of the coredump directory) to it.

so one should count on this undocumented behavior ? based on observation of the code ?

Yes. It's very unlikely to ever change, and even if it does, the worst thing that can happen is that we will have to adjust SCT again.

I've tried this idea in: https://github.com/scylladb/scylla-cluster-tests/commit/6c00c31f94a6399bfe3ab7b3d6d4d611f48ffc8f

and a run with it: https://argus.scylladb.com/test/a9ff9a8c-6b1d-43f4-afa2-3e31cdee2d9a/runs?additionalRuns[]=1039e962-4edd-482e-9a20-f937ddd865be

there are plenty of coredumps there, but I don't know how to tell which one is correct and which one isn't

fruch avatar Jun 05 '24 07:06 fruch

how about prioritizing https://github.com/scylladb/scylladb/pull/18854 and see if we still hit problems with it? It should improve timings and maybe help us avoid corruptions?

soyacz avatar Jun 05 '24 07:06 soyacz

how about prioritizing scylladb/scylladb#18854 and see if we still hit problems with it? It should improve timings and maybe help us avoid corruptions?

I said exactly that in https://github.com/scylladb/scylla-enterprise/issues/4246#issuecomment-2146545425

one of the things it's waiting, is for SCT to be able to not compress it yet again i.e. to identify the compression is on

fruch avatar Jun 05 '24 07:06 fruch

how about prioritizing scylladb/scylladb#18854 and see if we still hit problems with it? It should improve timings and maybe help us avoid corruptions?

I said exactly that in scylladb/scylla-enterprise#4246 (comment)

one of the things it's waiting, is for SCT to be able to not compress it yet again i.e. to identify the compression is on

compression can be easily identified (e.g. file ends with zst), so should be a quick fix that we can implement even before merging ^.

soyacz avatar Jun 05 '24 07:06 soyacz

how about prioritizing scylladb/scylladb#18854 and see if we still hit problems with it? It should improve timings and maybe help us avoid corruptions?

I said exactly that in scylladb/scylla-enterprise#4246 (comment) one of the things it's waiting, is for SCT to be able to not compress it yet again i.e. to identify the compression is on

compression can be easily identified (e.g. file ends with zst), so should be a quick fix that we can implement even before merging ^.

I'd recommend using the file utility to determine is the file is already compressed. For example:

halevy@lt tmp$ file foo.gz
foo.gz: gzip compressed data, last modified: Wed Jun  5 07:55:24 2024, from Unix, original size modulo 2^32 24

bhalevy@lt tmp$ file foo.zst
foo.zst: Zstandard compressed data (v0.8+), Dictionary ID: None

bhalevy avatar Jun 05 '24 07:06 bhalevy

  1. Prevent systemd from truncating the core file while SCT is processing (compressing and uploading) it. Systemd uses this code for deleting the core files during cleanup: https://github.com/systemd/systemd/blob/e1c3ac1f67c746827814eb446a8857955921b494/src/basic/fs-util.c#L711-L712 So it deliberately truncates files which have only one link, but doesn't truncate files which have more than one link. So you should be able to preserve the coredump from the automatic truncation by creating a hardlink (outside of the coredump directory) to it.

so one should count on this undocumented behavior ? based on observation of the code ?

Yes. It's very unlikely to ever change, and even if it does, the worst thing that can happen is that we will have to adjust SCT again.

I've tried this idea in: 6c00c31

and a run with it: https://argus.scylladb.com/test/a9ff9a8c-6b1d-43f4-afa2-3e31cdee2d9a/runs?additionalRuns[]=1039e962-4edd-482e-9a20-f937ddd865be

there are plenty of coredumps there, but I don't know how to tell which one is correct and which one isn't

@fruch What you did isn't enough. Creating the hardlink will only prevent the original core file from being "truncated". It won't prevent it from being unlinked. If you want to minimize possible races, you shouldn't access the file (when compressing or uploading) via the original name, but only via the hardlink. Creating a hardlink to the original file also won't prevent the compressed copy from being truncated and unlinked. So it can still get truncated while you are compressing or uploading it.

A proper sequence would look like this:

  1. Hardlink the newest core file.
  2. If the hardlink succeeded, wait until systemd finishes writing it. Otherwise, ignore the core.
  3. Compress the core (with the hardlink as the input, and with an output filename that doesn't fall under systemd-coredump's GC rules).
  4. Remove the hardlink.
  5. Upload the compressed file.
  6. Remove the compressed file.

how about prioritizing scylladb/scylladb#18854 and see if we still hit problems with it? It should improve timings and maybe help us avoid corruptions?

@soyacz "It should improve timings and maybe help" isn't good enough. SCT's current code only works if it's able to notice, compress and upload the new core faster than Scylla is able to restart and dump a new core. Delaying the restart until compression finishes will "help" with that, but SCT still has to notice and fully upload the core before the next crash happens. I don't like the odds of that.

michoecho avatar Jun 05 '24 10:06 michoecho

put the trial to use hardlink into a PR: https://github.com/scylladb/scylla-cluster-tests/pull/7689

fruch avatar Jun 18 '24 20:06 fruch