spec Data caching problem during node stage/unstage and publish/unpublish

In current csi spec, it does not have any requirements related to data caching. One problem we notice from Windows GCE PD tests is that after a pod writes data to a file and gets killed, the data might not have yet written to the disk before disk gets detached.

Here in this issue, we want to initiate a discussion whether it is driver's responsibility to flush data during NodeUnpublishVolume or NodeUnstageVolume before volume is detached from the node.

Sep 06 '19 23:09 jingxu97

cc @msau @ddebroy @KnicKnic

Sep 06 '19 23:09 jingxu97

I think that this is the responsibility of the application running in the container to guarantee that it's data is flushed. And that the volume should respect the application's flush calls. However there are bugs that occur from relying that on NodeUnpublishVolume that the data is flushed.

The reason I believe this is there is no guarantee that the node does not crash after the application completes and before the flush call occurs.

Scenario: I schedule a job to write some data to a PV. The job exits successfully. I assume that the job has completed successfully my data has been fully committed. However if that app relies on the unmount of the NodeUnpublishVolume and it crashes before it reaches that stage to flush the data, then I am out of luck.

Second issue: If we are using VM isolated containers, such as Kata containers. If you consider scenarios where you take a block device and attach it to the VM & mount it inside the VM, before giving to containers. There will be a file system cache inside the VM. If the VM fails or aborts, we cannot rely on NodeUnpublishVolume to flush this cache.

Sep 07 '19 01:09 KnicKnic

In order to be resilient to node crashes/ungraceful volume detach scenarios, I agree with @KnicKnic that an application needs to either [1] call appropriate OS/File-system specific buffer flush APIs at suitable intervals (as done by databases) or [2] initialize file handles with appropriate parameters to make sure all write API calls do get persisted on disk upon the write API calls returning successfully.

For graceful NodeUnpublishVolume (as well as volume detach for in-tree plugins), at least today, a OS specific dismount call (that syncs all cached data for the volume) is issued for Linux nodes:

In-tree typically through CleanupMountPoint -> mounter.Unmount that ends up here
In a CSI driver like GCE-PD that effectively calls the above.

As @jingxu97 pointed out, the above is absent for Windows . I have added this in the upcoming SIG-Storage agenda as a discussion point: why Linux chose explicit dismounts (it does "feel" like the right thing in graceful scenarios) and whether Windows should align with that and update with the discussion notes here.

Sep 07 '19 23:09 ddebroy

I tend to agree that the application + OS bear most of the responsibility here. It's not clear to me that CSI should make any guarantees w/ respect to "flush to disk" on a per-workload-lifecycle basis (node publish/unpublish). CSI could make a recommendation that node-unstage should execute flush on a best-effort basis, but that's still not reliable. Plus, node-stage/unstage are optional and there's no guarantee re: how soon node-unstage will be called w/ respect to any particular workload (it could be immediate, it could be days later during some GC operation executed by the CO).

For applications where data sync is critical it's up to the app/OS developers to fix. e.g. https://lwn.net/Articles/752063/

Sep 10 '19 13:09 jdef

spec spec copied to clipboard

Data caching problem during node stage/unstage and publish/unpublish

spec
spec copied to clipboard