ceph-nvmeof Guarantee data integrity when doing path failover and path failback

Background The nvmeof GW should support multi-path. Right now the plan is to support 2 paths from the initiator to the namespace. For that, we will use 2 GWs. So the initiator will be able to use a namespace by utilizing anyone of the 2 GWs (after we will add ANA support, one of the paths will be the optimized, and the other will be the non-optimized). This issue is discussing how to implement the path failover (or failback), without causing data integrity issues. For example, this is a scenario that can lead to data corruption:

Network nodes: Initiator/host - storage consumer GW1 - gateway 1 GW2 - gateway 2 Ceph - storage cluster

Network Links: ig1 - link from Initiator to GW1 ig2 - link from Initiator to GW2 g1c - link GW1 to Ceph g2c - link GW2 to Ceph

Failure scenario Initial state: Initiator connected via ig1 to GW1 all network links are up Test summary: Initiator writes (lbaX, data0) (lbaX, data1) Expected outcome: Ceph data (lbaX, data1) Actual outcome: Ceph data (lbaX, data0)

Timeflow Initiator sends to GW1 io: write (lbaX, data0) link g1c goes down, GW1 is still holding the io (lbaX, data0) in the Rados/RBD lib link ig1 goes down, failover trigger Initiator performs failover, connects to GW2 via ig2 Initiator sends via GW2 re-queues uncommitted (lbaX, data0), writes new data (lbaX, data1) Ceph commits the both io writes, Ceph state (lbaX, data1) link g1c goes up, Rados/RBD lib on GW1 commits to Ceph io: write (lbaX, data0), which got stuck in step #2 Ceph commits the IO, Ceph state (lbaX, data1) ← data corruption__

Solution options So far I don't think that there is a good solution. I'm opening this issue for discussion.

Use exclusive lock on the volume - there are few issues with this approach:

In an active-active scenario (like ESX clusters), if one of the ESX initiators cannot utilize the optimized path, it will need to overtake the lock, which means that the locks will start jumping from GW to GW. I don't think this is very efficient in performance.
In round robin multipath mode, the initiator might want to spread the IOs across the paths, which also brings to the lock travelling constantly from GW to GW.

Self 'blocklist'. The idea here is based on an assumption that when the host decides to failover, it will first destroy (i.e. disconnect) from the controller, or that the controller will have a Keep alive time out from this initiator. We are checking if this assumption is right for Linux initiators. Assuming this is right, we could potentially somehow know from the SPDK that the controller is being destroyed, and then we can check to see if we have any inflight IOs in librbd. If we do have, we can blocklist the volumes that have inflight IOs. Currently we can think of some issues with this approach as well:

We need to verify host behavioral - Linux and ESX.
Do we have any SPDK 'hook' to know that the controller is being destroyed and to write a code in that context?
What if GW doesn't have a connection to Ceph, how can we blocklist then?
What if the initiator don't have any connection to the GW?

Remote 'blocklist'. I heard @idryomov raising that idea, but I'm not sure how it should work. Ilya can you elaborate.

I currently believe that option 2 is the right one conceptually. But there are gaps to implement it that we need to discuss.

@sdpeters @oritwas @idryomov @leonidc @baum, @jdurgin

Aug 23 '23 10:08 caroav

In NVMe, if the host ever has two writes to the same LBA in flight simultaneously, the behavior is undefined. NVMe does not guarantee commands complete in order. The only exception is compare and write (a single fused NVMe command). See NVMe Base Spec 2.0c, section 3.4.1. Even without any link failures, the write of data1 above could complete before the write of data0 (or lbaX could get part of data0 and part of data1). So let's modify your scenario so the host does not submit the write(lbaX, data1) until its write(lbaX, data0) has completed.

Your scenario also has two separate link failures (g1c and ig1). Let's consider those separately and then see what happens when they occur together.

The g1c failure is the simplest. The failure of g1c is not directly visible to the host. If the host submits write(lbaX, data0) via ig1 (to GW1), then g1c fails before the write completes, all the host sees is a write that's still not complete. If GW1 can't restore g1c quickly, it should fail all in-flight commands with one of the NVMe error codes indicating it should be retried on another controller. Of course before it does that it must ensure that any librbd writes it has in flight for those commands will not complete after the fail response is sent to the host.

The ig1 failure has some more room for uncertainty that needs to be addressed.

That scenario would look like:

host write(lbaX, data0) via ig1 (GW1) host sees ig1 fail (e.g. link down on local NIC) (GW1 does not see connection to host fail yet, first write(lbaX, data0) still in flight) host retries write(lbaX, data0) via ig2 (GW2) GW1 completes write(lbaX, data0) to host (via ig2) (Has the write(lbaX, data0) via ig1 completed by this time? This is the potential race here) host write(lbaX, data1) via ig2 (GW2)

The retried write(lbaX, data0) starts at GW2 at some time after the original write(lbaX, data0) started on GW1. Normally they should take about the same time to complete, so the original (via GW1) should be competed before the retried write (via GW2) is completed. This is not guaranteed, though.

NVMe includes parameters for the delays applied to retried IOs, and reestablishing connections to controllers. On first inspection it appears a host may be allowed to retry an IO on an alternate controller immediately in this scenario. I'll consult with some people who have actually implemented this and see if I'm misinterpreting this. It may be that in this example the host would have to (try to) reestablish ig1 and retry that IO to the same controller. If that was the case, we probably wouldn't have a race.

I don't think this pattern is unique to NVMe, or to multi-node block storage servers. Anything with two paths might have commands in flight through one of them after its initiator lost that connection.

Back to your scenario where ig1 and g1c fail near the same time: Whether the host sees ig1 fail before the gateway sees g1c fail or vice-versa, this reduces to the ig1 fail case above.

Aug 24 '23 22:08 sdpeters

The idea here is based on an assumption that when the host decides to failover, it will first destroy (i.e. disconnect) from the controller, or that the controller will have a Keep alive time out from this initiator.

When a host disconnects from a controller there's a process that includes completing or aborting all in-flight commands, deleting the IO and admin queues, and terminating the transport connection. When all that's done there will be no in-flight commands in the controller.

Both NVMe/TCP and NVMe/RDMA require the use of keep alive time out (KATO). The length and granularity of KATO are negotiated. The keep alive timer should expire in both the host and controller within the KATO time granularity.

Section 3.9 of the NVMe base spec says that on KATO expire a controller stops processing commands from this transport connection and terminates the transport connection. The spec doesn't say what to do about commands that are still in progress in the controller.

When KATO occurs on the host, section 3.9 says "the host assumes all outstanding commands are not completed and re-issues commands as appropriate". If the host notices the connection has failed before the KATO expires, I don't see a requirement in the spec that the host waits for KATO to expire before it re-issues the commands that were in-flight to that controller. Hosts should probably do that.

Of course if the host notices the connection fail way before KATO expires, and manages to re-establish a connection to that same controller soon after that, I think it can resubmit all those IOs without introducing any races.

We are checking if this assumption is right for Linux initiators. Assuming this is right, we could potentially somehow know from the SPDK that the controller is being destroyed, and then we can check to see if we have any inflight IOs in librbd. If we do have, we can blocklist the volumes that have inflight IOs.

If KATO expires SPDK knows the controller is being destroyed. Until then, unless the host is still connected and starts winding down the connection, I don't see any way for the GW SPDK app to know the host thinks the connection is failed.

In the host-initiated disconnect above the in-flight commands are cleaned up with an NVMe abort command. I've just spent a little while staring at the SPDK code that handles KATO and destroys all the queue pairs. I was looking for where the NVMF code would abort the bdev IOs for in-progress transport requests on a qpair it's destroying. I see where it destroys all the transport request objects, but don't see it invoking the bdev abort function. That's odd, because that seems to leave in-progress bdev IOs with a dangling reference to a destroyed transport request. I must be missing something.

I was hoping this could be the signal we need to handle the in-flight librbd writes. If NVMF invoked the abort function in all the bdev IOs in progress, the RBD bdev could either cancel the librbd requests for those or determine it can't do that within the abort timeout and fence (blocklist) itself. Currently the RBD bdev doesn't implement the abort function.

Aug 25 '23 03:08 sdpeters

@sdpeters , thanks for deep analysis.

From our fail-over experiments, host when it experiences IO delay issue and having at least 2 available paths, just internally disconnects the Gw1 (Gw1 spdk transport gets socket errors on all QPairs of controller ) and immediately switches to Gw2. (my measurements were done using internal iobench tool that dumps IO statistics each 2 seconds, I haven't seen zero stats )

I believe the problem may be solved internally in spdk - bdev, those IOs( XFERS) that are sent to ceph do not present an issue.

I agree with @sdpeters and think that cancelling of outstanding bdev IOs would fix the major data integrity issue during host fail-over.

File nvmf.c code extract

spdk_nvmf_qpair_disconnect . . . . . . { if (!TAILQ_EMPTY(&qpair->outstanding)) { SPDK_DTRACE_PROBE2(nvmf_poll_group_drain_qpair, qpair, spdk_thread_get_id(group->thread)); qpair->state_cb = _nvmf_qpair_destroy; qpair->state_cb_arg = qpair_ctx; nvmf_qpair_abort_pending_zcopy_reqs(qpair); nvmf_qpair_free_aer(qpair);
<== here we may add generic handler for cancelling outstanding bdev IOs if (!TAILQ_EMPTY(&qpair->outstanding)) { // Q is not empty after synchronous deletion of aen and zero-copy IOs // TODO } return 0; } }

Aug 25 '23 10:08 leonidc

The g1c failure is the simplest. The failure of g1c is not directly visible to the host. If the host submits write(lbaX, data0) via ig1 (to GW1), then g1c fails before the write completes, all the host sees is a write that's still not complete. If GW1 can't restore g1c quickly, it should fail all in-flight commands with one of the NVMe error codes indicating it should be retried on another controller. Of course before it does that it must ensure that any librbd writes it has in flight for those commands will not complete after the fail response is sent to the host.

@sdpeters I think that the main issue is that librbd might have inflight IOs in its queues, and if we could know that the host is going to retry these on another path, we could take an action. The issue is - how do we know? That's why we were thinking to rely on the KATO/disconnect -> and then get some indication from SPDK, and do what you describe above.

The retried write(lbaX, data0) starts at GW2 at some time after the original write(lbaX, data0) started on GW1. Normally they should take about the same time to complete, so the original (via GW1) should be competed before the retried write (via GW2) is completed. This is not guaranteed, though.

@sdpeters I think we should not be concerned about this case. I believe that RBD should guarantee the atomicity of a single write (also something we need to confirm). But if it does promise the atomicity of a single write, then I don't see why this case in a concern.

Aug 27 '23 06:08 caroav

Think it is possible to distinguish "regular" host's Disconnect from disconnect caused by IO delayed in librbd.

Add new "io_start_tick" parameter to the struct spdk_nvmf_request. Initialize it.
Define new virtual function "handle_outstanding_io" in rbd_fn_table. This function calls the rbd blacklist. if IO latency is bigger than some defined tolerance(20 sec).
In the same place that I pointed in the previous comment (instead of TODO ) the new function nvmf_qpair_handle_outstanding_io is called . It gets bdev and ns_descriptor from request.
nvmf_qpair_handle_outstanding_io calls function from the bdev module that calls bdev->fn_table->handle_outstanding_io
Implement de-bouncing logic in fn_table->handle_outstanding_io for protection from multiple calls for the same bdev.

Concerns: What happens to outstanding IOs of Qpair after blocklist is invoked? Would IOs complete callback of nvmf domain be called?

Aug 28 '23 16:08 leonidc

The retried write(lbaX, data0) starts at GW2 at some time after the original write(lbaX, data0) started on GW1. Normally they should take about the same time to complete, so the original (via GW1) should be competed before the retried write (via GW2) is completed. This is not guaranteed, though.

@sdpeters I think we should not be concerned about this case. I believe that RBD should guarantee the atomicity of a single write (also something we need to confirm). But if it does promise the atomicity of a single write, then I don't see why this case in a concern.

@caroav what I meant there was not that the two writes of data0 would interfere with each other, but that the host isn't actually guaranteed the first write of data0 via GW1 is complete when it sees the second attempt to write data0 via GW2 complete. The host needs that guarantee to ensure its write of data1 doesn't race with its first attempt to write data0.

It's very likely that the first write(lbaX, data0) attempt through GW1 will have completed (or failed) by the time the write via GW2 completes, but I don't see how that is guaranteed in NVMe.

For that matter I don't recall how iSCSI or FC guarantee this either. I suspect this issue is addressed somehow but I can't find it documented anywhere.

Aug 28 '23 19:08 sdpeters

Think it is possible to distinguish "regular" host's Disconnect from disconnect caused by IO delayed in librbd.

Add new "io_start_tick" parameter to the struct spdk_nvmf_request. Initialize it.

Define new virtual function "handle_outstanding_io" in rbd_fn_table. This function calls the rbd blacklist. if IO latency is bigger than some defined tolerance(20 sec).

In the same place that I pointed in the previous comment (instead of TODO ) the new function nvmf_qpair_handle_outstanding_io is called . It gets bdev and ns_descriptor from request.

nvmf_qpair_handle_outstanding_io calls function from the bdev module that calls bdev->fn_table->handle_outstanding_io

Implement de-bouncing logic in fn_table->handle_outstanding_io for protection from multiple calls for the same bdev.

@leonidc Something similar already exists: spdk_bdev_set_timeout(). That will invoke a callback on each IO that hasn't completed on the specified bdev within the specified time. That could be used to trigger the self blocklist action.

We probably also want the abort on KATO behavior described above. That requires some NVMF changes and the rbd bdev to implement bdev io abort. This way the bdev can consider taking the self blocking action on KATO or an io timeout, whichever happens first.

Concerns: What happens to outstanding IOs of Qpair after blocklist is invoked? Would IOs complete callback of nvmf domain be called?

Yes, I think the rbd bdev has to complete these bdev IOs with a failure code that NVMF will translate into an NVME status code that tells the host to retry that IO (like "transient transport error", or "command interrupted"). It's important that these bdev IOs don't result in the NVMe command failing with some permanent failure status.

It looks like a bdev can do that by calling spdk_bdev_io_complet_nvme_status(). That takes some NVMe specific completion status info and stashes it in the bdev IO structure. NVMF will use that when constructing the NVMe CQE for that command.

This self blocklist action will affect all in-flight operations via that bdev. If multiple hosts are connected to that NVMe namespace, they will all be affected. So it seems unavoidable that when a namespace is used by multiple hosts a connection failure form one of them will impact all the IOs in flight from all the other hosts via that gateway. Best case they all get IO failures indicating a retry should be attempted, and that retry quickly succeeds. It's TBD whether they should try the same controller again (how long till it's ready to accept more IO for that RBD image?), or use another path.

Does the blocklist action affect a single RBD image from a specific process in a specific node, or is it's effect wider? Would all the rbd bdevs sharing a single cluster context be blocked? If so, a connection failure from one host to a gateway could impact IO from many hosts to many different namespaces via that gateway.

Aug 28 '23 20:08 sdpeters

Remote 'blocklist'. I heard @idryomov raising that idea, but I'm not sure how it should work. Ilya can you elaborate.

@caroav In your example, "remote" blocklist would refer to GW2 performing the blocklist before serving any failed over writes. The reason I raised it is that self-blocklist (i.e. GW1 performing the blocklist) is really unreliable. There are two fundamental issues:

In order to add an entry to the blocklist table -- update the osdmap -- one needs to have connectivity to the Ceph cluster (specifically to the monitor(s)).
Just adding an entry to the blocklist table not enough. One also needs to ensure that all OSDs get the updated osdmap before processing any post-blocklist I/O. This is because osdmaps are distributed via a gossip protocol -- at the time ceph osd blocklist add command returns none of the OSDs could have the updated osdmap and therefore would happily process pre-blocklist I/O.

So in the example timeflow

Initiator sends to GW1 io: write (lbaX, data0) link g1c goes down, GW1 is still holding the io (lbaX, data0) in the Rados/RBD lib link ig1 goes down, failover trigger Initiator performs failover, connects to GW2 via ig2 Initiator sends via GW2 re-queues uncommitted (lbaX, data0), writes new data (lbaX, data1) Ceph commits the both io writes, Ceph state (lbaX, data1) link g1c goes up Rados/RBD lib on GW1 commits to Ceph io: write (lbaX, data0), which got stuck in step #2

self-blocklist would go through only at this point and very likely "behind" the problematic write, meaning that

Ceph commits the IO, Ceph state (lbaX, data1) ← data corruption__

would still happen. And I'm not even considering more exotic net splits where GW1 can talk to (some) OSDs but not the monitors, etc.

Aug 28 '23 22:08 idryomov

Does the blocklist action affect a single RBD image from a specific process in a specific node, or is it's effect wider? Would all the rbd bdevs sharing a single cluster context be blocked? If so, a connection failure from one host to a gateway could impact IO from many hosts to many different namespaces via that gateway.

@sdpeters Blocklist is not tied to a process or a node -- it affects a RADOS client instance (~ cluster context in SPDK parlance). One can have multiple RADOS clients instantiated in a single process, but it's not free (mainly in terms of the number of threads, but also file descriptions, sockets, etc).

If all RBD bdevs are sharing a single RADOS client instance, then all of them would be blocked.

Aug 28 '23 22:08 idryomov

I believe that RBD should guarantee the atomicity of a single write (also something we need to confirm).

@caroav RBD doesn't provide such a guarantee in the general case. Any two-sector (and larger) write can straddle a RADOS object boundary, unless it's naturally aligned. This is handled with (at least) two separate RADOS ops, potentially going to separate OSDs. In the face of multiple writers on the client side that is definitely not atomic, by design.

Aug 28 '23 22:08 idryomov

There's an NVMe technical proposal in progress that seems to address most or all of the gaps in the KATO mechanism noted here. I'll send the details by email to the NVM Express members here.

Some future version of the NVMe spec may guarantee that the gateway controller will have a given amount of time to stop processing commands after it detects KATO, and the host may not be allowed to resubmit those IOs to another controller before then.

That would mean the gateway controller would definitely have some time to act. It should at least be enough time to see if any of the incomplete librbd IOs were writes. Only if there are writes does it need to take any action.

If it can't blocklist itself, and can't ask another gateway to blocklist it, what options are left?

Aug 29 '23 02:08 sdpeters

Just adding an entry to the blocklist table not enough. One also needs to ensure that all OSDs get the updated osdmap before processing any post-blocklist I/O. This is because osdmaps are distributed via a gossip protocol -- at the time ceph osd blocklist add command returns none of the OSDs could have the updated osdmap and therefore would happily process pre-blocklist I/O.

@idryomov is there anyway to ensure that via the API? If there is not way, so why does it matter if GW1 or GW2 is doing it? If there is a way to ensure that, then in self fencing, we can check, and if there is no certainty, we can ask the GW to commit suicide? That probably will kill anything pending IO in librbd?

self-blocklist would go through only at this point and very likely "behind" the problematic write, meaning that

@idryomov can you explain why? It depends on host failover behavior. We will do some testing and look at the host code.

I guess we can consider maybe another solution, which is maybe what @idryomov meant. Assuming that GW1 is reporting optimized for that NS, and GW2 is reporting non-optimized for that NS, then host will never write through the non-optimized, unless something bad happened to the optimized path. If this is true, then in the first IO coming through the non optimized, we could blocklist GW1. Ugly enough, but maybe works in failover? Then what are we going to do on failback? I guess that first IO on failback to GW1, will need to blocklist GW2.

Aug 29 '23 07:08 caroav

What about resetting the ceph context (or destroying and creating a new one), Instead of blocklisting? This should be enough to abort all pending requests but as we use a single ceph context per gateway it will abort all the requests. We can have a ceph context per namespace but this uses lots of resources and will limit the number of namespaces per gateway we can support. Another option @caroav suggested is having a ceph context per initiator, this will solve the resources concern and allow us simply to abort all pending I/Os per initiator. This will require a change to bdev_rbd to support accessing the same namespace through different initiators (required by vSphere cluster for example).

Aug 29 '23 12:08 oritwas

Another option @caroav suggested is having a ceph context per initiator, this will solve the resources concern and allow us simply to abort all pending I/Os per initiator. This will require a change to bdev_rbd to support accessing the same namespace through different initiators (required by vSphere cluster for example).

The SPDK bdev interface doesn't provide a mechanism to pass caller-specific properties like this. Here NVMF is the caller of the rbd bdev, and we'd like the rbd bdev to know what NVMe-oF host submitted this IO to NVMF. We'd need some new SPDK plumbing to enable this in a general way.

Aug 29 '23 18:08 sdpeters

@idryomov is there anyway to ensure that via the API? If there is not way, so why does it matter if GW1 or GW2 is doing it?

Yes, after adding an entry to the blocklist table with rados_mon_command(), one can call rados_wait_for_latest_osdmap() to ensure that that RADOS client gets the post-blocklist osdmap. From that point on, using that RADOS client for submitting I/O is safe -- if an I/O reaches an OSD which doesn't have the post-blocklist osdmap, the OSD would know that it needs to refresh its osdmap before processing the I/O.

Note the emphasis on that. If GW1 blocklists itself and calls rados_wait_for_latest_osdmap(), that would have no effect on GW2. If an I/O sent by GW2 reaches an OSD which doesn't have the post-blocklist osdmap, the OSD would happily process it, thus violating the barrier.

If there is a way to ensure that, then in self fencing, we can check, and if there is no certainty, we can ask the GW to commit suicide? That probably will kill anything pending IO in librbd?

"Asking" the gateway to commit suicide would take care of anything pending in librbd or librados but wouldn't affect anything pending in the lower layers (kernel socket buffers and/or the network itself).

self-blocklist would go through only at this point and very likely "behind" the problematic write, meaning that

@idryomov can you explain why? It depends on host failover behavior. We will do some testing and look at the host code.

Why self-blocklist would go through only at that point? Because that point immediately follows the "link g1c goes up" step in your example.

Why self-blocklist would likely go "behind" the problematic write? Just because it's a losing race: the write op would be going to an OSD while the blocklist command would be going to a monitor. Even if the blocklist command makes it first, the updated osdmap would still need to propagate to the OSDs. In very rough terms, it's one hop vs at least two.

Aug 31 '23 19:08 idryomov

What about resetting the ceph context (or destroying and creating a new one), Instead of blocklisting? This should be enough to abort all pending requests but as we use a single ceph context per gateway it will abort all the requests.

There is no way to "reset" the cluster context. Even if there was a "I want to destroy the cluster context object, make it happen no matter what" operation, it wouldn't abort ops that are already in-flight (as opposed to ops just sitting in the queue, which is of course easy to do).

Aug 31 '23 19:08 idryomov

ceph-nvmeof ceph-nvmeof copied to clipboard

Guarantee data integrity when doing path failover and path failback

ceph-nvmeof
ceph-nvmeof copied to clipboard