longhorn
longhorn copied to clipboard
[BUG] Disaster Recovery: Export from Replica in Containerd environment fails
Describe the bug
I am simulating disaster scenario to test disaster recovery from Longhorn:
- On a k8s cluster with longhorn installed, I have created a pod and a
ReadWriteManyPVC attached to it using mylonghornstorage class - I have then proceeded to a hard shutdown of a node where a replica was placed
- then (purposefully) deleted the Pod and PVC via
kubectl. The PV is then deleted, and the Longhorn volume too. All the actual on-disk replicas on all other nodes have successfully been deleted. - I finally turn the "faulty" node back on, and as expected can see its own replica still present on disk at
/var/lib/longhorn/pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-3afae9ce/
I am then trying to follow the procedure explained in Export from Replica, however it only provides a command line for docker, and there are a lot of issues in converting it to a ctr command.
This is the command I eventually managed to use so that the container runs (correctly removed the 8-bit uuid suffix, and correct size used as per volume.meta file) :
ctr run --privileged --mount type=bind,src=/dev,dst=/host/dev,options=rbind --mount type=bind,src=/proc,dst=/host/proc,options=rbind --mount type=bind,src=/var/lib/longhorn/replicas/pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-3afae9ce,dst=/volume,options=rbind docker.io/longhornio/longhorn-engine:v1.6.1 longhorn-recovery /tini -- launch-simple-longhorn pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd 524288000
But there are some warning/error messages, all entangled in a very cryptic and complicated log output:
+ set -e
+ mount --rbind /host/dev /dev
+ volume=pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd
+ size=524288000
+ frontend=
+ '[' -z pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd ']'
+ '[' -z 524288000 ']'
+ '[' -z ']'
+ echo Use default frontend TGT block device
+ frontend=tgt-blockdev
Use default frontend TGT block device
+ exec longhorn-instance-manager daemon
+ start
+ set +e
+ true
+ /usr/local/bin/grpc_health_probe -addr localhost:8500
time="2024-06-22T02:07:59Z" level=info msg="Storing process logs at path: /var/log/instances" func=util.SetUpLogger file="log.go:63"
[longhorn-instance-manager] time="2024-06-22T02:07:59Z" level=info msg="Creating gRPC server with no auth" func=cmd.start file="start.go:204"
[longhorn-instance-manager] time="2024-06-22T02:07:59Z" level=info msg="Debug pprof server listening on :6060" func=cmd.start.func2 file="start.go:210"
[longhorn-instance-manager] time="2024-06-22T02:07:59Z" level=info msg="process-manager gRPC server listening to localhost:8500" func=cmd.start.func4 file="start.go:311"
[longhorn-instance-manager] time="2024-06-22T02:07:59Z" level=info msg="instance gRPC server listening to localhost:8503" func=cmd.start.func4 file="start.go:311"
[longhorn-instance-manager] time="2024-06-22T02:07:59Z" level=info msg="proxy gRPC server listening to localhost:8501" func=cmd.start.func4 file="start.go:311"
[longhorn-instance-manager] time="2024-06-22T02:07:59Z" level=info msg="disk gRPC server listening to localhost:8502" func=cmd.start.func4 file="start.go:311"
timeout: failed to connect service "localhost:8500" within 1s
+ [[ 2 -eq 0 ]]
+ sleep 1
+ true
+ /usr/local/bin/grpc_health_probe -addr localhost:8500
status: SERVING
+ [[ 0 -eq 0 ]]
+ echo longhorn instance manager is ready
+ break
+ set -e
longhorn instance manager is ready
+ tgtd -f
+ tee /var/log/tgtd.log
+ longhorn-instance-manager process create --name pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-r --binary /usr/local/bin/longhorn --port-count 15 --port-args --listen,localhost: -- replica /volume/ --size 524288000
tgtd: work_timer_start(146) use timer_fd based scheduler
tgtd: bs_init(393) use pthread notification
time="2024-06-22T02:08:01Z" level=info msg="Creating process" func="client.(*ProcessManagerClient).ProcessCreate" file="process_manager.go:96" args="[replica /volume/ --size 524288000]" binary=/usr/local/bin/longhorn name=pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-r portArgs="[--listen,localhost:]" portCount=15
[longhorn-instance-manager] time="2024-06-22T02:08:01Z" level=info msg="Process Manager: prepare to create process pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-r" func="process.(*Manager).ProcessCreate" file="process_manager.go:184"
[longhorn-instance-manager] time="2024-06-22T02:08:01Z" level=info msg="Process Manager: created process pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-r" func="process.(*Manager).ProcessCreate" file="process_manager.go:221"
{
"spec": {
"name": "pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-r",
"binary": "/usr/local/bin/longhorn",
"args": [
"replica",
"/volume/",
"--size",
"524288000",
"--listen",
"localhost:10000"
],
"port_count": 15,
"port_args": [
"--listen,localhost:"
]
},
"status": {
"state": "starting",
"port_start": 10000,
"port_end": 10014
}
}
+ sleep 5
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-r] time="2024-06-22T02:08:01Z" level=info msg="Listening on data server localhost:10001" func=cmd.startReplica.func2 file="replica.go:164"
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-r] time="2024-06-22T02:08:01Z" level=info msg="Listening on sync agent server localhost:10002" func=cmd.startReplica.func3 file="replica.go:193"
time="2024-06-22T02:08:01Z" level=info msg="Listening on gRPC Replica server localhost:10000" func=cmd.startReplica.func1 file="replica.go:156"
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-r] time="2024-06-22T02:08:01Z" level=info msg="Listening on sync localhost:10002" func=cmd.startSyncAgent file="sync_agent.go:89"
[longhorn-instance-manager] time="2024-06-22T02:08:01Z" level=info msg="Process pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-r has started at localhost:10000" func="process.(*GRPCHealthChecker).WaitForRunning" file="healthchecker.go:33"
+ longhorn-instance-manager process create --name pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e --binary /usr/local/bin/longhorn --port-count 1 --port-args --listen,localhost: -- controller pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd --frontend tgt-blockdev --size 524288000 --current-size 524288000 --replica tcp://localhost:10000
time="2024-06-22T02:08:06Z" level=info msg="Creating process" func="client.(*ProcessManagerClient).ProcessCreate" file="process_manager.go:96" args="[controller pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd --frontend tgt-blockdev --size 524288000 --current-size 524288000 --replica tcp://localhost:10000]" binary=/usr/local/bin/longhorn name=pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e portArgs="[--listen,localhost:]" portCount=1
[longhorn-instance-manager] time="2024-06-22T02:08:06Z" level=info msg="Process Manager: prepare to create process pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e" func="process.(*Manager).ProcessCreate" file="process_manager.go:184"
[longhorn-instance-manager] time="2024-06-22T02:08:06Z" level=info msg="Process Manager: created process pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e" func="process.(*Manager).ProcessCreate" file="process_manager.go:221"
{
"spec": {
"name": "pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e",
"binary": "/usr/local/bin/longhorn",
"args": [
"controller",
"pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd",
"--frontend",
"tgt-blockdev",
"--size",
"524288000",
"--current-size",
"524288000",
"--replica",
"tcp://localhost:10000",
"--listen",
"localhost:10015"
],
"port_count": 1,
"port_args": [
"--listen,localhost:"
]
},
"status": {
"state": "starting",
"port_start": 10015,
"port_end": 10015
}
}
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e] time="2024-06-22T02:08:06Z" level=info msg="Creating volume pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd controller with iSCSI target request timeout 15s and engine to replica(s) timeout 8s" func=cmd.startController file="controller.go:181"
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-r] time="2024-06-22T02:08:06Z" level=info msg="New connection from: 127.0.0.1:41470" func="rpc.(*DataServer).listenAndServeTCP" file="dataserver.go:57"
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e] time="2024-06-22T02:08:06Z" level=info msg="Starting with replicas [\"tcp://localhost:10000\"]" func=cmd.startController file="controller.go:196"
time="2024-06-22T02:08:06Z" level=info msg="Connecting to remote: localhost:10000 (tcp)" func="remote.(*Factory).Create" file="remote.go:382"
time="2024-06-22T02:08:06Z" level=info msg="Opening remote: localhost:10000" func="remote.(*Remote).open" file="remote.go:68"
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-r] time="2024-06-22T02:08:06Z" level=info msg="Opening replica: dir /volume/, size 524288000, sector size 512" func="replica.(*Server).Open" file="server.go:76"
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e] time="2024-06-22T02:08:06Z" level=info msg="Adding backend: tcp://localhost:10000" func="controller.(*replicator).AddBackend" file="replicator.go:58"
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e] time="2024-06-22T02:08:06Z" level=info msg="Start monitoring tcp://localhost:10000" func="controller.(*Controller).monitoring" file="control.go:1208"
time="2024-06-22T02:08:06Z" level=info msg="Got backend tcp://localhost:10000 UnmapMarkSnapChainRemoved false" func="controller.(*replicator).GetUnmapMarkSnapChainRemoved" file="replicator.go:467"
time="2024-06-22T02:08:06Z" level=info msg="Controller checked and corrected flag unmapMarkSnapChainRemoved=false for backend replicas" func="controller.(*Controller).checkUnmapMarkSnapChainRemoved" file="control.go:685"
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e] time="2024-06-22T02:08:06Z" level=info msg="Got backend tcp://localhost:10000 revision counter 145" func="controller.(*replicator).GetRevisionCounter" file="replicator.go:437"
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e] time="2024-06-22T02:08:06Z" level=info msg="device pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd: iSCSI device /dev/longhorn/pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd shutdown" func="longhorndev.(*LonghornDevice).shutdownFrontend" file="dev.go:209"
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e] go-iscsi-helper: tgtd is already running
tgtd: device_mgmt(246) sz:126 params:path=/var/run/longhorn-pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd.sock,bstype=longhorn,bsopts=size=524288000;request_timeout=15
tgtd: bs_thread_open(409) 16
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e] time="2024-06-22T02:08:07Z" level=info msg="go-iscsi-helper: found available target id 1" func="iscsidev.(*Device).CreateTarget" file="iscsi.go:105"
time="2024-06-22T02:08:07Z" level=info msg="New data socket connection established" func="socket.(*Socket).handleServerConnection" file="frontend.go:133"
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e] time="2024-06-22T02:08:07Z" level=error msg="Failed to startup frontend" func="controller.(*Controller).startFrontend" file="control.go:489" error="cannot find IP connect to the host"
read_header: fail to read header
receive_msg: fail to read header
response_process: Receive response returned error
lh_client_close_conn: Closing connection
[pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e] time="2024-06-22T02:08:07Z" level=error msg="failed to start up frontend: cannot find IP connect to the host" func=cmd.startController file="controller.go:205"
[longhorn-instance-manager] time="2024-06-22T02:08:07Z" level=info msg="Process Manager: process pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e error out, error msg: exit status 1" func="process.(*Process).Start.func1" file="process.go:70"
[longhorn-instance-manager] time="2024-06-22T02:08:07Z" level=warning msg="Process update: pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e: state error: errorMsg: exit status 1" func="process.(*Process).RPCResponse" file="process.go:115"
[longhorn-instance-manager] time="2024-06-22T02:08:07Z" level=info msg="Wait for gRPC service of process pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e to start at localhost:10015" func="process.(*GRPCHealthChecker).WaitForRunning" file="healthchecker.go:36"
[longhorn-instance-manager] time="2024-06-22T02:08:07Z" level=info msg="Stop waiting for gRPC service of process pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd-e to start at localhost:10015" func="process.(*GRPCHealthChecker).WaitForRunning" file="healthchecker.go:29"
I have tried first with the node up and healthy in Kubernetes, and then even tried draining and removing the node from the cluster entirely, before running the command again (believing there were bad interactions with the running longhorn-engine instance, or other longhorn component, as part of the Longhorn installation on the cluster) with the (approx.) same issues. In both cases, the volume never appears in /dev/longhorn/ on the host contrarily to what is mentioned in the docs.
To Reproduce
Already explained above
Expected behavior
Volume /dev/longhorn/pvc-4f040959-a90d-47ad-b4ae-c00db1234bcd is newly found on the host node.
Environment
- Longhorn version: 1.6.1
- Installation method: Helm
- Kubernetes distro: Kubeadm and 1.29.3
- Node config
- OS type and version: Debian 12
- Kernel version: 6.1.0-13-arm64
- CPU of node: 4
- Memory of node: 8
- Disk type (e.g. SSD/NVMe/HDD): SDXC (volumes are created on the system disks)
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
- Number of Longhorn volumes in the cluster: 1
cc @c3y1huang @shuo-wu
Thanks @derekbit for paging relevant people. So @c3y1huang, @shuo-wu do you have idea what the problem is ?
Similar request https://github.com/longhorn/longhorn/issues/7711