[BUG] Broken fuse mount point when using Alluxio v2.8.0
What is your environment(Kubernetes version, Fluid version, etc.) v0.8.0-518fce8 (with Alluxio v2.8.0)
Describe the bug Fuse mount point is broken after fuse pod launched. This leads to application pod failed to start.
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
nginx 0/1 CreateContainerError 0 7h17m
oss-tf-dataset-fuse-dzxmf 1/1 Running 0 7h17m
oss-tf-dataset-master-0 2/2 Running 0 7h19m
oss-tf-dataset-worker-0 2/2 Running 0 7h19m
$ kubectl describe pod nginx
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 42s kubelet Successfully pulled image "nginx" in 2.089036387s
Warning Failed 42s kubelet Error: failed to generate container "ad05c9203d0698b8fbe842f169b6eee0ede7cdf5bf26c8028e0e7ebb50b1297d" spec: failed to generate spec: failed to stat "/var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount": stat /var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount: input/output error
Normal Pulled 39s kubelet Successfully pulled image "nginx" in 2.247446791s
Warning Failed 39s kubelet Error: failed to generate container "847521adc61ace5e6cf16958b0106b647f7ef62e2537fe2809f9f3866ade2034" spec: failed to generate spec: failed to stat "/var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount": stat /var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount: input/output error
Warning Failed 24s kubelet Error: failed to generate container "3b71f22ea3f2f1b899f05e62ed3246c7e684579bad97e8f255f03ec9b106de38" spec: failed to generate spec: failed to stat "/var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount": stat /var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount: input/output error
Normal Pulled 24s kubelet Successfully pulled image "nginx" in 2.164586556s
Normal Pulling 10s (x4 over 44s) kubelet Pulling image "nginx"
Normal Pulled 8s kubelet Successfully pulled image "nginx" in 2.208901583s
Warning Failed 8s kubelet Error: failed to generate container "c78b85c47a07f2ea0e4675083b029c87a929b1977caeeec128f05491ce79ccda" spec: failed to generate spec: failed to stat "/var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount": stat /var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount: input/output error
The fuse is mounted but the mount point has no response:
$ mount | grep alluxio-fuse
alluxio-fuse on /runtime-mnt/alluxio/default/oss-tf-dataset/alluxio-fuse type fuse.alluxio-fuse (ro,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,max_read=131072)
alluxio-fuse on /var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount type fuse.alluxio-fuse (ro,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,max_read=131072)
$ kubectl exec -it oss-tf-dataset-fuse-dzxmf bash
bash-5.0# ls /runtime-mnt/alluxio/default/oss-tf-dataset/alluxio-fuse
ls: /runtime-mnt/alluxio/default/oss-tf-dataset/alluxio-fuse: I/O error
What you expect to happen:
Fuse mount point should not be broken so that the application pod can access data via the fuse mount point.
How to reproduce it dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: hbase
spec:
placement: Shared
mounts:
- mountPoint: https://mirrors.bit.edu.cn/apache/hbase/stable/
name: hbase
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: hbase
spec:
replicas: 1
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 2Gi
high: "0.95"
low: "0.7"
application pod yaml:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
#nodeName: cn-beijing.172.16.0.150
containers:
- name: nginx
image: nginx
volumeMounts:
- mountPath: /data
name: hbase-vol
volumes:
- name: hbase-vol
persistentVolumeClaim:
claimName: hbase
Additional Information cc @cheyang @ssz1997
I can't reproduce the problem. I used the same config and recreated the cluster and the app 5 times, and all of them succeeded. Could you help provide following information under a failure scenario?
- Could you double check the Fluid commit string? I don't find it in the Fluid master branch history.
- Before starting Alluxio cluster, set
alluxio.debug.enabled: truein alluxioruntime - Executing
ls /runtime-mnt/alluxio/default/oss-tf-dataset/alluxio-fuseand see the I/O error. Write down the time you execute this command to second. - Go into the Fuse pod,
cdinto/runtime-mnt/alluxio/default/oss-tf-dataset. Executestat alluxio-fuse. Also log the time you execute this command. -
cdintoalluxio-fuseand executestat hbase. Log the time you execute this command. - Use the fluid diagnose to collect all the logs. (double check if the alluxio logs are collected - I'm having trouble using the tool to collect logs in default namespace)
- Provide the logs.
Thank you!
Thanks @ssz1997 . Here is some extra information for the environment:
- Kubernetes 1.22 + containerd 1.5.10
- Linux kernel version 4.19.91
-
The fluid commit string can be found here. It is merged for supporting Alluxio v2.8.0
-
The logs are collected from an AlluxioRuntime with the
alluxio.debug.enabled: trueproperty. -
Executed the command
$ time ls /runtime-mnt/alluxio/default/hbase/alluxio-fuse
ls: /runtime-mnt/alluxio/default/hbase/alluxio-fuse: I/O error
real 0m0.001s
user 0m0.000s
sys 0m0.001s
- Executed the command
time stat /runtime-mnt/alluxio/default/hbase/alluxio-fuse
stat: can't stat '/runtime-mnt/alluxio/default/hbase/alluxio-fuse': I/O error
real 0m0.001s
user 0m0.000s
sys 0m0.001s
-
I can't list the
alluxio-fuse/hbasedue to the I/O error. -
Here is the logs collected by the diagnose tool: diagnose_fluid_1652176518.tar.gz
I found these in the master log:
2022-05-10 09:53:25,442 WARN AbstractUfsManager - Failed to perform initial connect to UFS https://mirrors.bit.edu.cn/apache/hbase/stable: java.io.IOException: Unsupported operation for WebUnderFileSystem.
2022-05-10 09:53:26,060 INFO MountTable - Mounting ` at /hbase
2022-05-10 09:53:28,658 ERROR HttpUtils - Failed to perform HEAD request. Status code: 404
2022-05-10 09:53:28,658 ERROR HttpUtils - Failed to perform HEAD request. Status code: 404
2022-05-10 09:53:28,659 ERROR WebUnderFileSystem - Failed to get status for url: https://mirrors.bit.edu.cn/apache/hbase/stable//apache/hbase/
java.io.IOException: Failed to getStatus: https://mirrors.bit.edu.cn/apache/hbase/stable//apache/hbase/
at alluxio.underfs.web.WebUnderFileSystem.getStatus(WebUnderFileSystem.java:161)
at alluxio.underfs.web.WebUnderFileSystem.listStatus(WebUnderFileSystem.java:297)
at alluxio.underfs.UnderFileSystemWithLogging$33.call(UnderFileSystemWithLogging.java:757)
at alluxio.underfs.UnderFileSystemWithLogging$33.call(UnderFileSystemWithLogging.java:754)
at alluxio.underfs.UnderFileSystemWithLogging.call(UnderFileSystemWithLogging.java:1237)
at alluxio.underfs.UnderFileSystemWithLogging.listStatus(UnderFileSystemWithLogging.java:754)
at alluxio.underfs.UfsStatusCache.getChildrenIfAbsent(UfsStatusCache.java:332)
at alluxio.underfs.UfsStatusCache.lambda$prefetchChildren$1(UfsStatusCache.java:376)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
It would be better to check if https://mirrors.bit.edu.cn/apache/hbase/stable is working when this error occurs.