linstor-csi
linstor-csi copied to clipboard
kubevirt: allow-two-primaries must be set to perform live-migration
To reproduce:
Install KubeVirt and enable featureGates: LiveMigration and HotplugVolumes
apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
annotations:
kubevirt.io/latest-observed-api-version: v1
kubevirt.io/storage-observed-api-version: v1alpha3
name: kubevirt
namespace: kubevirt
spec:
configuration:
developerConfiguration:
featureGates:
- LiveMigration
- HotplugVolumes
imagePullPolicy: IfNotPresent
Create VM:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
annotations:
kubevirt.io/latest-observed-api-version: v1
kubevirt.io/storage-observed-api-version: v1alpha3
name: testvm1
namespace: default
spec:
running: true
template:
metadata:
creationTimestamp: null
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: containerdisk
- disk:
bus: virtio
name: cloudinitdisk
interfaces:
- masquerade: {}
name: default
machine:
type: q35
resources:
requests:
memory: 1024M
networks:
- name: default
pod: {}
terminationGracePeriodSeconds: 30
volumes:
- containerDisk:
image: kubevirt/fedora-cloud-container-disk-demo:latest
name: containerdisk
- cloudInitNoCloud:
userData: |-
#cloud-config
password: fedora
chpasswd: { expire: False }
name: cloudinitdisk
Create volume:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc
spec:
accessModes:
- ReadWriteMany
volumeMode: Block
storageClassName: linstor-data-r2
resources:
requests:
storage: 100Gi
attach volume to the vm:
virtctl addvolume testvm --volume-name my-pvc
run live-migration:
virtctl migrate testvm1
You'll see that hotplug-volume pod will be failed to run on second node:
hp-volume-29d84 1/1 Running 0 2m4s
hp-volume-6lvvm 0/1 ContainerCreating 0 77s
Due to:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 86s default-scheduler Successfully assigned default/hp-volume-6lvvm to hf-kubevirt-01
Normal SuccessfulAttachVolume 85s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-2a81bf2c-65dd-4782-b011-3c37af658575"
Warning FailedMapVolume 5s (x8 over 70s) kubelet MapVolume.MapPodDevice failed for volume "pvc-2a81bf2c-65dd-4782-b011-3c37af658575" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-2a81bf2c-65dd-4782-b011-3c37af658575: failed to set source device readwrite: exit status 1
If you'd go on csi-node pod and run this command, you'll see the error:
# blockdev --setrw /dev/drbd1000
blockdev: cannot open /dev/drbd1000: Wrong medium type
But after adding this option:
linstor rd sp pvc-2a81bf2c-65dd-4782-b011-3c37af658575 DrbdOptions/Net/allow-two-primaries yes
everything is starting working as it should.
Thus I think that we need to set allow-two-primaries automatically for any volume with accessMode: ReadWriteMany and type: Block.
We have two options how we can do that:
- Set it on volume creation
- Set it on volume attachment
Which option do you prefer more?
This is like 100 years ago, but back then I just set it on the storage class. This does not work any more? Not for you? https://github.com/piraeusdatastore/linstor-csi/tree/master/examples/kubevirt
First thing: I'm surprised it even lets you have a ReadWriteMany resource at all...
Secondly: I'd prefer it if the user had to explicitly opt-in (maybe through a new enable-live-migration parameter on the storage class?). Then, on volume attach, if CSI can reasonably think it is in a live-migration situation: enable allow-two-primaries. And on detach on the old node: set it back to false.
@rck sorry, I didn't see that doc, but unfortunately this is not working for me anyway:
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: linstor-data-r2
parameters:
linstor.csi.linbit.com/placementCount: "2"
linstor.csi.linbit.com/storagePool: data
DrbdOptions/Net/allow-two-primaries: "yes"
provisioner: linstor.csi.linbit.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc
spec:
accessModes:
- ReadWriteMany
volumeMode: Block
storageClassName: linstor-data-r2
resources:
requests:
storage: 100Gi
# linstor rd lp pvc-e9e280d1-b648-40d5-a4bf-35bd976fa24a
╭─────────────────────────────────────────────────────────╮
┊ Key ┊ Value ┊
╞═════════════════════════════════════════════════════════╡
┊ Aux/csi-provisioning-completed-by ┊ linstor-csi/v0.18.0 ┊
┊ DrbdOptions/Resource/on-no-quorum ┊ io-error ┊
┊ DrbdOptions/Resource/quorum ┊ majority ┊
┊ DrbdOptions/auto-verify-alg ┊ crct10dif-pclmul ┊
┊ DrbdPrimarySetOn ┊ HF-KUBEVIRT-01 ┊
╰─────────────────────────────────────────────────────────╯
Have you checked the actual drbd resource (drbdsetup show --show-defaults)? Because whatever you set in the storage class is put on the resource group and inherited by the resource definition
I'd prefer it if the user had to explicitly opt-in (maybe through a new enable-live-migration parameter on the storage class?). Then, on volume attach, if CSI can reasonably think it is in a live-migration situation: enable allow-two-primaries. And on detach on the old node: set it back to false.
@WanzenBug thus, I think it's better to implement it the next way:
-
nodePublishVolume: add check:
- if volume has
type: BlockandaccessMode: ReadWriteMany; - if it is already
InUseat one other node;
then enable
allow-two-primariesflag - if volume has
-
nodeUnpublishVolume:
- if
allow-two-primariesis set;
then remove
allow-two-primariesflag - if
Have you checked the actual drbd resource (
drbdsetup show --show-defaults)? Because whatever you set in the storage class is put on the resource group and inherited by the resource definition
Yeah, my bad, its there:
linstor rg lp sc-5c7dc25c-6740-5d79-b635-df965659aa11
╭─────────────────────────────────────────────╮
┊ Key ┊ Value ┊
╞═════════════════════════════════════════════╡
┊ DrbdOptions/Net/allow-two-primaries ┊ yes ┊
┊ StorPoolName ┊ data ┊
╰─────────────────────────────────────────────╯
drbdsetup show --show-defaults pvc-e9e280d1-b648-40d5-a4bf-35bd976fa24a
resource "pvc-e9e280d1-b648-40d5-a4bf-35bd976fa24a" {
options {
cpu-mask ""; # default
on-no-data-accessible io-error; # default
auto-promote yes; # default
peer-ack-window 4096s; # bytes, default
peer-ack-delay 100; # milliseconds, default
twopc-timeout 300; # 1/10 seconds, default
twopc-retry-timeout 1; # 1/10 seconds, default
auto-promote-timeout 20; # 1/10 seconds, default
max-io-depth 8000; # default
quorum majority;
on-no-quorum io-error;
quorum-minimum-redundancy off; # default
}
_this_host {
node-id 0;
volume 0 {
device minor 1000;
disk "/dev/linstor_data/pvc-e9e280d1-b648-40d5-a4bf-35bd976fa24a_00000";
meta-disk internal;
disk {
size 0s; # bytes, default
on-io-error detach; # default
disk-barrier no; # default
disk-flushes yes; # default
disk-drain yes; # default
md-flushes yes; # default
resync-after -1; # default
al-extents 1237; # default
al-updates yes; # default
discard-zeroes-if-aligned yes; # default
disable-write-same no; # default
disk-timeout 0; # 1/10 seconds, default
read-balancing prefer-local; # default
rs-discard-granularity 0; # bytes, default
}
}
}
connection {
_peer_node_id 1;
path {
_this_host ipv4 192.168.242.35:7004;
_remote_host ipv4 192.168.242.38:7004;
}
net {
transport ""; # default
protocol C; # default
timeout 60; # 1/10 seconds, default
max-epoch-size 2048; # default
connect-int 10; # seconds, default
ping-int 10; # seconds, default
sndbuf-size 0; # bytes, default
rcvbuf-size 0; # bytes, default
ko-count 7; # default
allow-two-primaries yes;
cram-hmac-alg "sha1";
shared-secret "2qySzH5tKuuFF65JoUdY";
after-sb-0pri disconnect; # default
after-sb-1pri disconnect; # default
after-sb-2pri disconnect; # default
always-asbp no; # default
rr-conflict disconnect; # default
ping-timeout 5; # 1/10 seconds, default
data-integrity-alg ""; # default
tcp-cork yes; # default
on-congestion block; # default
congestion-fill 0s; # bytes, default
congestion-extents 1237; # default
csums-alg ""; # default
csums-after-crash-only no; # default
verify-alg "crct10dif-pclmul";
use-rle yes; # default
socket-check-timeout 0; # default
fencing dont-care; # default
max-buffers 2048; # default
allow-remote-read yes; # default
_name "hf-kubevirt-02";
}
volume 0 {
disk {
resync-rate 250k; # bytes/second, default
c-plan-ahead 20; # 1/10 seconds, default
c-delay-target 10; # 1/10 seconds, default
c-fill-target 100s; # bytes, default
c-max-rate 102400k; # bytes/second, default
c-min-rate 250k; # bytes/second, default
bitmap yes; # default
}
}
}
connection {
_peer_node_id 2;
path {
_this_host ipv4 192.168.242.35:7004;
_remote_host ipv4 192.168.242.37:7004;
}
net {
transport ""; # default
protocol C; # default
timeout 60; # 1/10 seconds, default
max-epoch-size 2048; # default
connect-int 10; # seconds, default
ping-int 10; # seconds, default
sndbuf-size 0; # bytes, default
rcvbuf-size 0; # bytes, default
ko-count 7; # default
allow-two-primaries yes;
cram-hmac-alg "sha1";
shared-secret "2qySzH5tKuuFF65JoUdY";
after-sb-0pri disconnect; # default
after-sb-1pri disconnect; # default
after-sb-2pri disconnect; # default
always-asbp no; # default
rr-conflict disconnect; # default
ping-timeout 5; # 1/10 seconds, default
data-integrity-alg ""; # default
tcp-cork yes; # default
on-congestion block; # default
congestion-fill 0s; # bytes, default
congestion-extents 1237; # default
csums-alg ""; # default
csums-after-crash-only no; # default
verify-alg "crct10dif-pclmul";
use-rle yes; # default
socket-check-timeout 0; # default
fencing dont-care; # default
max-buffers 2048; # default
allow-remote-read yes; # default
_name "hf-kubevirt-03";
}
volume 0 {
disk {
resync-rate 250k; # bytes/second, default
c-plan-ahead 20; # 1/10 seconds, default
c-delay-target 10; # 1/10 seconds, default
c-fill-target 100s; # bytes, default
c-max-rate 102400k; # bytes/second, default
c-min-rate 250k; # bytes/second, default
bitmap no;
}
}
}
}
nodePublishVolume: add check: if volume has type: Block and accessMode: ReadWriteMany; if it is already InUse at one other node; then enable allow-two-primaries flag
nodeUnpublishVolume: if allow-two-primaries is set; then remove allow-two-primaries flag
I'd probably put it in controllerPublishVolume/controllerUnpublishVolume. Reason being that I don't want to make any unnecessary API requests from the node agents (ideally, they wouldn't need to talk to linstor at all, but we are not there yet, I think)