zfs We could handle failing to load some non-essential data better

Describe the feature would like to see added to OpenZFS

Someone rolled up in IRC with a failure mode of hitting the spa_load_failed case in here:

        error = spa_dir_prop(spa, DMU_POOL_L2CACHE,
            &spa->spa_l2cache.sav_object, B_FALSE);
        if (error != 0 && error != ENOENT)
                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
        if (error == 0 && type != SPA_IMPORT_ASSEMBLE) {
                ASSERT(spa_version(spa) >= SPA_VERSION_L2CACHE);
                if (load_nvlist(spa, spa->spa_l2cache.sav_object,
                    &spa->spa_l2cache.sav_config) != 0) {
                        spa_load_failed(spa, "error loading l2cache nvlist");
                        return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
                }

                spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
                spa_load_l2cache(spa);
                spa_config_exit(spa, SCL_ALL, FTAG);
        } else if (error == 0) {
                spa->spa_l2cache.sav_sync = B_TRUE;
        }

Which, while not good, seems like we could just pave over the l2arc object, and move on in life, in at least some cases, similar to how we can do zpool import -m with missing slogs.

I don't know that this is an especially common case, since I could find literally no other reports of anyone ever hitting this, but I think handling this more nicely is probably reasonable, even if it doesn't come up too often.

How will this feature improve OpenZFS?

If we can throw out the vdev, failing to load its metadata probably should not, on its own, eat the pool.

Additional context

Apr 06 '24 20:04 rincebrain

It seems not a missing device case, but actually a pool metadata corruption. But yea, if we could recover it clean, it could be nice. Just unlike missing device it should be much more rare scenario.

Apr 07 '24 01:04 amotin

Sure, I wasn't saying the device was the corrupted part.

But I cut a horrible patch that let them import the pool and drop the device, so I know it's possible for this to work better without the pool turning into a pile of ash, and this feels like we shouldn't fail on this specific case, since it's not like the L2ARC can ever contain data we must have.

Apr 07 '24 01:04 rincebrain

Yeah, my first thought before I read the entire description was "import -m but for cache devices". Seems straightforward and worthwhile.

Apr 07 '24 04:04 robn