stratisd Handle error case when device(s) show up that belong to an existing activated pool

It is theoretically possible that the following scenario could occur:

Pool is modified, a subset of the disks meta data are updated
stratisd is started and a subset of the disks are available which only have the the outdated metadata
At some point later the device(s) become available which have the newer metadata

Potential outcome(s):

If stratisd doesn't have udev add support it is blissfully unaware of the issue and continues to operate under the false assumption that all is good.
If stratisd does have udev add support it gets the udev add, evaluates that the device pool id is already active and ignores it, yielding the same outcome as if stratisd didn't have udev add support.

Ultimately we need to be able to identify this error case and correctly figure out what actions to take to correct it. At the moment we could identify it with udev add support, but we don't know what action(s) we should take. As one previous colleague of mine stated, "Don't check for error conditions you don't know how to handle" :-)

Feb 09 '18 17:02 tasleson

Note that stratisd can currently distinguish between whether the device is already in the pool, but was found again, and the device is not yet in the pool, and has been found for the first time.

The device is already in the pool. a. All our data about the device matches the data we already have. Maybe this means everything is fine. Or maybe not. b. Something doesn't match between the blockdev we have now and the one in the complete pool. This is a definite panic, I would think.
The device is not already in the pool. a. It has metadata newer than the pools. In that case the pool must be wrong. We should remove the pool from pools, reinsert it into incomplete pools, and rerun the whole setup thing to see if it is now fine. b. Its metadata is older than the pool's metadata. So it belonged to the pool and was removed. Currently, that is impossible, so a panic is the right choice. Later, it may be more possible, as we will be able to remove cache devs from the pool.

Feb 09 '18 18:02 mulkieran

The device is not already in the pool. a. It has metadata newer than the pools. In that case the pool must be wrong. We should remove the pool from pools, reinsert it into incomplete pools, and rerun the whole setup thing to see if it is now fine.

Once a pool is up and active and IO has been done on it, there is a good case that data loss/corruption has already occurred. Once cannot simply go from one state to the another and back without ramifications. IMHO we need to prevent this case from happening, not try to deal with it when it does. We should know without any ambiguity that a pool is complete or not.

If I understand this correctly, this all stems from an optimization where we choose to not write the latest metadata to all disks, to speed up metadata updates. IMHO We need to revisit that discussion and determine how we can close this error case.

Feb 09 '18 19:02 tasleson

OK, so can we just write metadata to all blockdevs and worry about this at the time we officially start supporting enough blockdevs to make writing to all blockdevs a bad idea? Such a time might never come...

Feb 09 '18 19:02 agrover

I think we can theorize a scenario where we are writing to all storage devices and still end up in this situation. This is akin to solving the RAID5 write hole.

Feb 12 '18 16:02 tasleson

Now that stratisd is responding to Change as well as Add events, block_evaluate is encountering quite benign instances where the block device being evaluated is already in a complete and fully functioning pool. I think that it's time to address this bug, in order to avoid pumping out warn messages about an utterly normal situation.

Sep 04 '18 15:09 mulkieran

stratisd stratisd copied to clipboard

Handle error case when device(s) show up that belong to an existing activated pool

stratisd
stratisd copied to clipboard