flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

flux-fsck tool needed

Open garlick opened this issue 10 months ago • 2 comments

Problem: Flux fails catastrophically when it cannot write to /var/lib/flux. Generally when this occurs, manual intervention from a flux developer is required. This is error prone, will not scale, and will not be possible in all environments.

Develop a script similar to an "fsck" that can be run by sys admins after such an event.

Some requirements:

  • Always provide a no-intervention route to a working system, even if jobs may be lost
  • Make a heroic effort not to lose the list of drained nodes
  • Get an emergency dump from the running system, if possible
  • Ensure that new job IDs are higher numbered than historical ones to avoid problems with duplicate IDs in logs
  • Generate a report of active jobs, if possible
  • Create RESTORE symlink so flux can start from the most recent known good dump (not necessarily the emergency dump)
  • If there is no known good dump, create one with only the resource.eventlog and checkpoint.job-manager keys.

garlick avatar Jan 30 '25 17:01 garlick

Right now, I'm picturing a C program similar to flux dump rather than a script. It would operate directly on the content store, like flux dump. We could call it in rc1 before loading the KVS.

When it encounters dangling references, it could offer to

  • unlink a key
  • pull in an older version of a key from another checkpoint

Then when it's all done it could conditionally write the new root ref as the new checkpoint. Alternatively it revert to an earlier checkpoint (which it would need to check first), for example if unlinked keys are above some threshold, or do nothing.

The advantage of operating on the content store is that any remediation doesn't actually take effect until the checkpoint is updated. That means if it encounters write errors on the content store, or if the number of errors is excessive, it can abort without side effects.

garlick avatar Apr 15 '25 15:04 garlick

(side note, thoughts / questions below are me brainstorming #6777 a bit)

you mention possibly using flux fsck in rc1. Do you imagine the process to be something like this in rc1:

flux restore dump
flux fsck || echo "content bad .... fix it" && exit 1
module load kvs

in this scenario, I imagine the user will be required to run flux fsck manually and there's an interactive thing for them

> flux fsck /my/flux/statedir
dir missing reference sha1-431473894721432 - recover? (y/n) (user hits yes)
- recovering dir from previous checkpoint 2025-01-05 3:00

and we can imagine tons of options, recover from any checkpoint, recover only from a list of checkpoints i say are ok, just list how many thingies are bad, find me the most recent checkpoint that is good, etc. etc.

As I've been thinking about this, I guess I see two things with checkpoint recovery when we have > 1 checkpoint stored

  1. investigating which checkpoint to recover from

I think this is where my head is at in regards to #6777

  1. doing the actual recovering

This is more with the fsck tool

But as I think about it more ... I'm not sure if situation 1 above matters. A) it could be in flux fsck and B) I'm not sure that people care about investigating different checkpoints outside of testing? They will gladly checkpoint to anything that is 100% good? (Edit: as I thought about this more, "investigate" might simply be "tell me if checkpoint good or bad" and that's enough)

chu11 avatar Apr 23 '25 20:04 chu11

Wanted to start a design discussion on what we would like flux-fsck to look like. First, summarizing several PRs I submitted.

  • #6935

this is a not super important extension, just to test additional checkpoints in the checkpoint history

  • #6931
  • #6941

These two issues together are what I considered to be the beginning of the "meat" of a flux-fsck tool. Look for a checkpoint that is valid. Once something is valid, consider updating the checkpoint to that root reference.

Considerations:

  • Perhaps --checkpoint should not be an option? Or perhaps it should be hidden? Or it should be "automatic" with the --scan option?

  • We should strongly consider "Yes/No" input from the user if a checkpoint is to be updated during a scan of checkpoint history. But I figured that would be for a follow on PR.

  • We may wish for specific keys, such as job stdio, to be skippable during fscks. That too could be a follow on PR.

  • And as mentioned above, perhaps some keys could be "corrected" from a prior checkpoint. Again, something to consider but I figured for a follow on PR after the above.

chu11 avatar Aug 01 '25 21:08 chu11

When we had all the problems on el cap earlier, the main thing we were doing manually was unlinking keys that had dangling blobrefs. So my first thought on remediation is to automate those fixes, sort of following the traditional fsck model, something like

  • create a lost+found directory on first bad key, if needed
  • move the bad key to the lost+found with the bad references removed
  • repeat until all metadata is valid
  • write out a new checkpoint at the end

Then make it so we can add flux-fsck to the start-up sequence before the kvs is loaded.

On rollback to a previous checkpoint, that is a very heavy handed operation that potentially sacrifices a lot of good data. it seems like it may be a useful last resort option but not a default one. I could picture maybe offering it interactively, like

flux-fsck: KEY has dangling blobref
flux-fsck: KEY has dangling blobref
...
flux-fsck: will unlink 27 keys with dangling blobrefs
flux-fsck: last checkpoint would roll back time 33 minutes
flux-fsck: what would you like to do? s=save, r=rollback, q=quit without saving (S/r/q)?

Then if rolling back, the check above would just repeat starting with the checkpoint.

Maybe we could abort startup if more than some threshold number of keys are bad and tell the sys admin to run the check manually, where these heavier options could be offered. The admin could also choose to start the instance in recovery mode and do things manually as before.

garlick avatar Aug 04 '25 13:08 garlick

write out a new checkpoint at the end

Do you think we should limit this to certain keys, i.e. job data? Or perhaps first round it can be everything. Then perhaps a Y/N to confirm "hey you ok with this?"

Edit: and i assume symlinks from the old key to the new lost+found key.

Edit2: hmmm, a lost and a found directory? B/c not everything that is lost can be recovered (e.g. a bad treeobj is simply bad ... )

chu11 avatar Aug 04 '25 17:08 chu11

As discussed offline, linking the broken key back to PATH is probably wrong since it multiplies the failure modes that have to be handled. Also, I think one lost+found directory is sufficient. If a key cannot be partially recovered it could either just be unlinked or put in lost+found with an empty value.

garlick avatar Aug 06 '25 18:08 garlick

Per a comment in #6953, wanted to discuss potential repairs to a dir with a bad``dirref` entry.

random example for discussion purposes.

{
  "data": {
    "dir1": {
      "data": [
        "sha1-f4c281ad60d2d1b03d49298bcb4c31be3bb7d10d"
      ],
      "type": "dirref",
      "ver": 1
    },
    "dir2": {
      "data": [
        "sha1-c602ad77ee3908cef9af1c649eb54958d773dc4f"
      ],
      "type": "dirref",
      "ver": 1
    },
    "dir3": {
      "data": [
        "sha1-8bea3af1aebac68c848c04bfc3d383ad145f5419"
      ],
      "type": "dirref",
      "ver": 1
    }
  },
  "type": "dir",
  "ver": 1
}

So lets say one of the references (lets say the one for dir3) is bad. Do we

A) repair by removing the bad dirref and put back in the KVS?

  • trickiness: medium to high, gotta update and repair all references back to the root.

B) repair and put into lost+found?

  • trickiness: lower than A, less need to "recurse" and update everything (I think).

  • there could be very unlikely corner cases here if users wrote some unexpected "dir" treeobjs to the KVS instead of letting KVS do everything as dirrefs.

C) do nothing

  • AFAICT, there's relatively small issues with a bad dirref in a dir (in contract to bad valref). The result of a lookup w/ a bad dirref will lead to a lookup error not so differently than a key that doesn't exist. In some confirmed cases, it'll be a ENOENT either way.

  • commands like flux kvs ls and flux kvs dir still work, but "flux kvs dir -R" would fail at some point.

    • I bet this can be maybe be worked around with some "if errno == ENOENT" checks

chu11 avatar Aug 22 '25 18:08 chu11

I think A is the sane thing to do, unlink the bad dirref. Isn't it the same problem as removing a valref?

garlick avatar Aug 22 '25 23:08 garlick

I was thinking about next steps with flux-fsck.

  1. I think we want to add it to rc1. There's questions on how to do so:
  • if there are errors, warn user or exit with error?
  • or repair?
  • or ask user if they should repair?

Since the --repair option is currently considered experimental, perhaps we could start with the lowest hanging fruit? Just run flux-fsck and output that there is data corruption?

  1. support configuration of some "special keys", where if they have errors definitely error out vs output warning. i.e. resource.eventlog needs special attention

chu11 avatar Sep 17 '25 19:09 chu11

I'd lean towards adding it to the rc1 without the repair option, and without taking any action for now - just get any output in the logs.

However, since system instances normally start from a clean dump (after garbage collection, which is automatic), and user instances normally start empty, having it in place without any issues for a while is not going to tell us much about how good it is unless we start having production system instances dying unexpectedly.

So we might want to think about improving testing to the point where we feel confident that --repair will help not hurt, and then just turn it on.

We may want to detect when we've restored from a dump and skip the check in that case by the way, since the database is brand new in that case.

garlick avatar Sep 17 '25 20:09 garlick

So we might want to think about improving testing to the point where we feel confident that --repair will help not hurt, and then just turn it on.

Outside of intentionally corrupting things like I have in current unit tests, wasn't sure what could be done.

Then I remember awhile back (I can't find an issue if we had an issue tied to it), there was a false positive corruption case on El Cap. The sqlite database was copied for backup while flux was in a bad state but still running, i.e. "cp foo.sqlite bar.sqlite". It lead to some unexpected corruption in the copied backup. Perhaps we can do something like that to try and generate a more "realistic" corruption-like scenario. (Edit: perhaps running throughput tests while copying sqlite, obviously not using a production copy)

chu11 avatar Sep 18 '25 17:09 chu11

To be more controlled another way might be to just add a content-backing.delete operation and a test tool that can mangle arbitrary KVS keys?

garlick avatar Sep 18 '25 20:09 garlick

To be more controlled another way might be to just add a content-backing.delete operation and a test tool that can mangle arbitrary KVS keys?

I brought that up in another issue somewhere, where it would be convenient if a flux content remove existed. I initially hesitated as it seemed to have a limited use case (i.e. for just a few testing scenarios).

Are you thinking we should add it so we could do a far more rigorous test, e.g. corrupt 50% of the KVS entries in some stress-like test?

chu11 avatar Sep 18 '25 21:09 chu11

No I just figured that we had very few tests because it's such a pain to set up a corrupted entry, and that as a result maybe we didn't have appropriate coverage. However, maybe we do!

What do you feel like we need to get comfortable with --repair? If we're there already then maybe we should just turn it on.

garlick avatar Sep 19 '25 17:09 garlick

What do you feel like we need to get comfortable with --repair? If we're there already then maybe we should just turn it on.

To be honest, I'm not sure. An additional "stress test" (i.e. ALOT of errors over a very big KVS) might be good, but I think there's a finite amount that can be done in unit tests. Could definitely try to generate corrupted sqlites in ways described above for some additional coverage.

chu11 avatar Sep 23 '25 00:09 chu11

There might be some good ideas here :-)

https://sqlite.org/howtocorrupt.html

garlick avatar Sep 23 '25 16:09 garlick

As an aside, it might be nice to have a --job-aware option or something that translates the KVS keys in the jobs directory to f58, and potentially moves whole job directories to lost+found instead of the individual keys. But that would be a nice to have not a must have IMHO.

comment from @garlick in #7082

chu11 avatar Oct 02 '25 20:10 chu11

Something for the future: would it be helpful if flux fsck dumped a report somewhere on failure (e.g. timestamped in statedir)?

Comment from @grondo in #7114

chu11 avatar Oct 07 '25 16:10 chu11

i think flux-fsck is far enough along that we can close this issue. I'll open up new issues for many of the features / add-ons that have been discussed above.

chu11 avatar Oct 07 '25 16:10 chu11