linstor-csi icon indicating copy to clipboard operation
linstor-csi copied to clipboard

kubevirt: allow-two-primaries must be set to perform live-migration

Open kvaps opened this issue 3 years ago • 7 comments

To reproduce:

Install KubeVirt and enable featureGates: LiveMigration and HotplugVolumes

apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  annotations:
    kubevirt.io/latest-observed-api-version: v1
    kubevirt.io/storage-observed-api-version: v1alpha3
  name: kubevirt
  namespace: kubevirt
spec:
  configuration:
    developerConfiguration:
      featureGates:
      - LiveMigration
      - HotplugVolumes
  imagePullPolicy: IfNotPresent

Create VM:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  annotations:
    kubevirt.io/latest-observed-api-version: v1
    kubevirt.io/storage-observed-api-version: v1alpha3
  name: testvm1
  namespace: default
spec:
  running: true
  template:
    metadata:
      creationTimestamp: null
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: containerdisk
          - disk:
              bus: virtio
            name: cloudinitdisk
          interfaces:
          - masquerade: {}
            name: default
        machine:
          type: q35
        resources:
          requests:
            memory: 1024M
      networks:
      - name: default
        pod: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - containerDisk:
          image: kubevirt/fedora-cloud-container-disk-demo:latest
        name: containerdisk
      - cloudInitNoCloud:
          userData: |-
            #cloud-config
            password: fedora
            chpasswd: { expire: False }
        name: cloudinitdisk

Create volume:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Block
  storageClassName: linstor-data-r2
  resources:
    requests:
      storage: 100Gi

attach volume to the vm:

virtctl addvolume testvm --volume-name my-pvc

run live-migration:

virtctl migrate testvm1

You'll see that hotplug-volume pod will be failed to run on second node:

hp-volume-29d84               1/1     Running             0          2m4s
hp-volume-6lvvm               0/1     ContainerCreating   0          77s

Due to:

Events:
  Type     Reason                  Age               From                     Message
  ----     ------                  ----              ----                     -------
  Normal   Scheduled               86s               default-scheduler        Successfully assigned default/hp-volume-6lvvm to hf-kubevirt-01
  Normal   SuccessfulAttachVolume  85s               attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-2a81bf2c-65dd-4782-b011-3c37af658575"
  Warning  FailedMapVolume         5s (x8 over 70s)  kubelet                  MapVolume.MapPodDevice failed for volume "pvc-2a81bf2c-65dd-4782-b011-3c37af658575" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-2a81bf2c-65dd-4782-b011-3c37af658575: failed to set source device readwrite: exit status 1

If you'd go on csi-node pod and run this command, you'll see the error:

# blockdev --setrw /dev/drbd1000
blockdev: cannot open /dev/drbd1000: Wrong medium type

But after adding this option:

linstor rd sp pvc-2a81bf2c-65dd-4782-b011-3c37af658575 DrbdOptions/Net/allow-two-primaries yes

everything is starting working as it should.

Thus I think that we need to set allow-two-primaries automatically for any volume with accessMode: ReadWriteMany and type: Block. We have two options how we can do that:

  • Set it on volume creation
  • Set it on volume attachment

Which option do you prefer more?

kvaps avatar Apr 06 '22 23:04 kvaps

This is like 100 years ago, but back then I just set it on the storage class. This does not work any more? Not for you? https://github.com/piraeusdatastore/linstor-csi/tree/master/examples/kubevirt

rck avatar Apr 07 '22 06:04 rck

First thing: I'm surprised it even lets you have a ReadWriteMany resource at all...

Secondly: I'd prefer it if the user had to explicitly opt-in (maybe through a new enable-live-migration parameter on the storage class?). Then, on volume attach, if CSI can reasonably think it is in a live-migration situation: enable allow-two-primaries. And on detach on the old node: set it back to false.

WanzenBug avatar Apr 07 '22 06:04 WanzenBug

@rck sorry, I didn't see that doc, but unfortunately this is not working for me anyway:

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: linstor-data-r2
parameters:
  linstor.csi.linbit.com/placementCount: "2"
  linstor.csi.linbit.com/storagePool: data
  DrbdOptions/Net/allow-two-primaries: "yes"
provisioner: linstor.csi.linbit.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Block
  storageClassName: linstor-data-r2
  resources:
    requests:
      storage: 100Gi
# linstor rd lp pvc-e9e280d1-b648-40d5-a4bf-35bd976fa24a
╭─────────────────────────────────────────────────────────╮
┊ Key                               ┊ Value               ┊
╞═════════════════════════════════════════════════════════╡
┊ Aux/csi-provisioning-completed-by ┊ linstor-csi/v0.18.0 ┊
┊ DrbdOptions/Resource/on-no-quorum ┊ io-error            ┊
┊ DrbdOptions/Resource/quorum       ┊ majority            ┊
┊ DrbdOptions/auto-verify-alg       ┊ crct10dif-pclmul    ┊
┊ DrbdPrimarySetOn                  ┊ HF-KUBEVIRT-01      ┊
╰─────────────────────────────────────────────────────────╯

kvaps avatar Apr 07 '22 09:04 kvaps

Have you checked the actual drbd resource (drbdsetup show --show-defaults)? Because whatever you set in the storage class is put on the resource group and inherited by the resource definition

WanzenBug avatar Apr 07 '22 09:04 WanzenBug

I'd prefer it if the user had to explicitly opt-in (maybe through a new enable-live-migration parameter on the storage class?). Then, on volume attach, if CSI can reasonably think it is in a live-migration situation: enable allow-two-primaries. And on detach on the old node: set it back to false.

@WanzenBug thus, I think it's better to implement it the next way:

  • nodePublishVolume: add check:

    • if volume has type: Block and accessMode: ReadWriteMany;
    • if it is already InUse at one other node;

    then enable allow-two-primaries flag

  • nodeUnpublishVolume:

    • if allow-two-primaries is set;

    then remove allow-two-primaries flag

kvaps avatar Apr 07 '22 09:04 kvaps

Have you checked the actual drbd resource (drbdsetup show --show-defaults)? Because whatever you set in the storage class is put on the resource group and inherited by the resource definition

Yeah, my bad, its there:

linstor rg lp sc-5c7dc25c-6740-5d79-b635-df965659aa11
╭─────────────────────────────────────────────╮
┊ Key                                 ┊ Value ┊
╞═════════════════════════════════════════════╡
┊ DrbdOptions/Net/allow-two-primaries ┊ yes   ┊
┊ StorPoolName                        ┊ data  ┊
╰─────────────────────────────────────────────╯
drbdsetup show --show-defaults pvc-e9e280d1-b648-40d5-a4bf-35bd976fa24a
resource "pvc-e9e280d1-b648-40d5-a4bf-35bd976fa24a" {
    options {
        cpu-mask        	""; # default
        on-no-data-accessible	io-error; # default
        auto-promote    	yes; # default
        peer-ack-window 	4096s; # bytes, default
        peer-ack-delay  	100; # milliseconds, default
        twopc-timeout   	300; # 1/10 seconds, default
        twopc-retry-timeout	1; # 1/10 seconds, default
        auto-promote-timeout	20; # 1/10 seconds, default
        max-io-depth    	8000; # default
        quorum          	majority;
        on-no-quorum    	io-error;
        quorum-minimum-redundancy	off; # default
    }
    _this_host {
        node-id			0;
        volume 0 {
            device			minor 1000;
            disk			"/dev/linstor_data/pvc-e9e280d1-b648-40d5-a4bf-35bd976fa24a_00000";
            meta-disk			internal;
            disk {
                size            	0s; # bytes, default
                on-io-error     	detach; # default
                disk-barrier    	no; # default
                disk-flushes    	yes; # default
                disk-drain      	yes; # default
                md-flushes      	yes; # default
                resync-after    	-1; # default
                al-extents      	1237; # default
                al-updates      	yes; # default
                discard-zeroes-if-aligned	yes; # default
                disable-write-same	no; # default
                disk-timeout    	0; # 1/10 seconds, default
                read-balancing  	prefer-local; # default
                rs-discard-granularity	0; # bytes, default
            }
        }
    }
    connection {
        _peer_node_id 1;
        path {
            _this_host ipv4 192.168.242.35:7004;
            _remote_host ipv4 192.168.242.38:7004;
        }
        net {
            transport       	""; # default
            protocol        	C; # default
            timeout         	60; # 1/10 seconds, default
            max-epoch-size  	2048; # default
            connect-int     	10; # seconds, default
            ping-int        	10; # seconds, default
            sndbuf-size     	0; # bytes, default
            rcvbuf-size     	0; # bytes, default
            ko-count        	7; # default
            allow-two-primaries	yes;
            cram-hmac-alg   	"sha1";
            shared-secret   	"2qySzH5tKuuFF65JoUdY";
            after-sb-0pri   	disconnect; # default
            after-sb-1pri   	disconnect; # default
            after-sb-2pri   	disconnect; # default
            always-asbp     	no; # default
            rr-conflict     	disconnect; # default
            ping-timeout    	5; # 1/10 seconds, default
            data-integrity-alg	""; # default
            tcp-cork        	yes; # default
            on-congestion   	block; # default
            congestion-fill 	0s; # bytes, default
            congestion-extents	1237; # default
            csums-alg       	""; # default
            csums-after-crash-only	no; # default
            verify-alg      	"crct10dif-pclmul";
            use-rle         	yes; # default
            socket-check-timeout	0; # default
            fencing         	dont-care; # default
            max-buffers     	2048; # default
            allow-remote-read	yes; # default
            _name           	"hf-kubevirt-02";
        }
        volume 0 {
            disk {
                resync-rate     	250k; # bytes/second, default
                c-plan-ahead    	20; # 1/10 seconds, default
                c-delay-target  	10; # 1/10 seconds, default
                c-fill-target   	100s; # bytes, default
                c-max-rate      	102400k; # bytes/second, default
                c-min-rate      	250k; # bytes/second, default
                bitmap          	yes; # default
            }
        }
    }
    connection {
        _peer_node_id 2;
        path {
            _this_host ipv4 192.168.242.35:7004;
            _remote_host ipv4 192.168.242.37:7004;
        }
        net {
            transport       	""; # default
            protocol        	C; # default
            timeout         	60; # 1/10 seconds, default
            max-epoch-size  	2048; # default
            connect-int     	10; # seconds, default
            ping-int        	10; # seconds, default
            sndbuf-size     	0; # bytes, default
            rcvbuf-size     	0; # bytes, default
            ko-count        	7; # default
            allow-two-primaries	yes;
            cram-hmac-alg   	"sha1";
            shared-secret   	"2qySzH5tKuuFF65JoUdY";
            after-sb-0pri   	disconnect; # default
            after-sb-1pri   	disconnect; # default
            after-sb-2pri   	disconnect; # default
            always-asbp     	no; # default
            rr-conflict     	disconnect; # default
            ping-timeout    	5; # 1/10 seconds, default
            data-integrity-alg	""; # default
            tcp-cork        	yes; # default
            on-congestion   	block; # default
            congestion-fill 	0s; # bytes, default
            congestion-extents	1237; # default
            csums-alg       	""; # default
            csums-after-crash-only	no; # default
            verify-alg      	"crct10dif-pclmul";
            use-rle         	yes; # default
            socket-check-timeout	0; # default
            fencing         	dont-care; # default
            max-buffers     	2048; # default
            allow-remote-read	yes; # default
            _name           	"hf-kubevirt-03";
        }
        volume 0 {
            disk {
                resync-rate     	250k; # bytes/second, default
                c-plan-ahead    	20; # 1/10 seconds, default
                c-delay-target  	10; # 1/10 seconds, default
                c-fill-target   	100s; # bytes, default
                c-max-rate      	102400k; # bytes/second, default
                c-min-rate      	250k; # bytes/second, default
                bitmap          	no;
            }
        }
    }
}

kvaps avatar Apr 07 '22 09:04 kvaps

  • nodePublishVolume: add check: if volume has type: Block and accessMode: ReadWriteMany; if it is already InUse at one other node; then enable allow-two-primaries flag

  • nodeUnpublishVolume: if allow-two-primaries is set; then remove allow-two-primaries flag

I'd probably put it in controllerPublishVolume/controllerUnpublishVolume. Reason being that I don't want to make any unnecessary API requests from the node agents (ideally, they wouldn't need to talk to linstor at all, but we are not there yet, I think)

WanzenBug avatar Apr 07 '22 10:04 WanzenBug