fluid icon indicating copy to clipboard operation
fluid copied to clipboard

[BUG] Broken fuse mount point when using Alluxio v2.8.0

Open TrafalgarZZZ opened this issue 3 years ago • 3 comments

What is your environment(Kubernetes version, Fluid version, etc.) v0.8.0-518fce8 (with Alluxio v2.8.0)

Describe the bug Fuse mount point is broken after fuse pod launched. This leads to application pod failed to start.

$ kubectl get pod
NAME                                   READY   STATUS                 RESTARTS   AGE
nginx                                    0/1     CreateContainerError   0          7h17m
oss-tf-dataset-fuse-dzxmf   1/1     Running                0          7h17m
oss-tf-dataset-master-0       2/2     Running                0          7h19m
oss-tf-dataset-worker-0       2/2     Running                0          7h19m
$ kubectl describe pod nginx
Events:
  Type     Reason   Age                From     Message
  ----     ------   ----               ----     -------
  Normal   Pulled   42s                kubelet  Successfully pulled image "nginx" in 2.089036387s
  Warning  Failed   42s                kubelet  Error: failed to generate container "ad05c9203d0698b8fbe842f169b6eee0ede7cdf5bf26c8028e0e7ebb50b1297d" spec: failed to generate spec: failed to stat "/var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount": stat /var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount: input/output error
  Normal   Pulled   39s                kubelet  Successfully pulled image "nginx" in 2.247446791s
  Warning  Failed   39s                kubelet  Error: failed to generate container "847521adc61ace5e6cf16958b0106b647f7ef62e2537fe2809f9f3866ade2034" spec: failed to generate spec: failed to stat "/var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount": stat /var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount: input/output error
  Warning  Failed   24s                kubelet  Error: failed to generate container "3b71f22ea3f2f1b899f05e62ed3246c7e684579bad97e8f255f03ec9b106de38" spec: failed to generate spec: failed to stat "/var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount": stat /var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount: input/output error
  Normal   Pulled   24s                kubelet  Successfully pulled image "nginx" in 2.164586556s
  Normal   Pulling  10s (x4 over 44s)  kubelet  Pulling image "nginx"
  Normal   Pulled   8s                 kubelet  Successfully pulled image "nginx" in 2.208901583s
  Warning  Failed   8s                 kubelet  Error: failed to generate container "c78b85c47a07f2ea0e4675083b029c87a929b1977caeeec128f05491ce79ccda" spec: failed to generate spec: failed to stat "/var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount": stat /var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount: input/output error

The fuse is mounted but the mount point has no response:

$ mount | grep alluxio-fuse
alluxio-fuse on /runtime-mnt/alluxio/default/oss-tf-dataset/alluxio-fuse type fuse.alluxio-fuse (ro,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,max_read=131072)
alluxio-fuse on /var/lib/kubelet/pods/0391e9d8-3900-436a-a096-7f2e5e1cf262/volumes/kubernetes.io~csi/default-oss-tf-dataset/mount type fuse.alluxio-fuse (ro,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,max_read=131072)
$ kubectl exec -it oss-tf-dataset-fuse-dzxmf bash

bash-5.0# ls /runtime-mnt/alluxio/default/oss-tf-dataset/alluxio-fuse
ls: /runtime-mnt/alluxio/default/oss-tf-dataset/alluxio-fuse: I/O error

What you expect to happen:

Fuse mount point should not be broken so that the application pod can access data via the fuse mount point.

How to reproduce it dataset.yaml

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: hbase
spec:
  placement: Shared
  mounts:
    - mountPoint: https://mirrors.bit.edu.cn/apache/hbase/stable/
      name: hbase
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: hbase
spec:
  replicas: 1
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 2Gi
        high: "0.95"
        low: "0.7"

application pod yaml:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  #nodeName: cn-beijing.172.16.0.150
  containers:
    - name: nginx
      image: nginx
      volumeMounts:
        - mountPath: /data
          name: hbase-vol
  volumes:
    - name: hbase-vol
      persistentVolumeClaim:
        claimName: hbase

Additional Information cc @cheyang @ssz1997

TrafalgarZZZ avatar May 09 '22 13:05 TrafalgarZZZ

I can't reproduce the problem. I used the same config and recreated the cluster and the app 5 times, and all of them succeeded. Could you help provide following information under a failure scenario?

  1. Could you double check the Fluid commit string? I don't find it in the Fluid master branch history.
  2. Before starting Alluxio cluster, set alluxio.debug.enabled: true in alluxioruntime
  3. Executing ls /runtime-mnt/alluxio/default/oss-tf-dataset/alluxio-fuse and see the I/O error. Write down the time you execute this command to second.
  4. Go into the Fuse pod, cd into /runtime-mnt/alluxio/default/oss-tf-dataset. Execute stat alluxio-fuse. Also log the time you execute this command.
  5. cd into alluxio-fuse and execute stat hbase. Log the time you execute this command.
  6. Use the fluid diagnose to collect all the logs. (double check if the alluxio logs are collected - I'm having trouble using the tool to collect logs in default namespace)
  7. Provide the logs.

Thank you!

ssz1997 avatar May 09 '22 18:05 ssz1997

Thanks @ssz1997 . Here is some extra information for the environment:

  • Kubernetes 1.22 + containerd 1.5.10
  • Linux kernel version 4.19.91
  1. The fluid commit string can be found here. It is merged for supporting Alluxio v2.8.0

  2. The logs are collected from an AlluxioRuntime with the alluxio.debug.enabled: true property.

  3. Executed the command

$ time ls /runtime-mnt/alluxio/default/hbase/alluxio-fuse
ls: /runtime-mnt/alluxio/default/hbase/alluxio-fuse: I/O error

real    0m0.001s
user    0m0.000s
sys     0m0.001s
  1. Executed the command
time stat /runtime-mnt/alluxio/default/hbase/alluxio-fuse
stat: can't stat '/runtime-mnt/alluxio/default/hbase/alluxio-fuse': I/O error

real    0m0.001s
user    0m0.000s
sys     0m0.001s
  1. I can't list the alluxio-fuse/hbase due to the I/O error.

  2. Here is the logs collected by the diagnose tool: diagnose_fluid_1652176518.tar.gz

TrafalgarZZZ avatar May 10 '22 09:05 TrafalgarZZZ

I found these in the master log:

2022-05-10 09:53:25,442 WARN  AbstractUfsManager - Failed to perform initial connect to UFS https://mirrors.bit.edu.cn/apache/hbase/stable: java.io.IOException: Unsupported operation for WebUnderFileSystem.
2022-05-10 09:53:26,060 INFO  MountTable - Mounting ` at /hbase
2022-05-10 09:53:28,658 ERROR HttpUtils - Failed to perform HEAD request. Status code: 404
2022-05-10 09:53:28,658 ERROR HttpUtils - Failed to perform HEAD request. Status code: 404
2022-05-10 09:53:28,659 ERROR WebUnderFileSystem - Failed to get status for url: https://mirrors.bit.edu.cn/apache/hbase/stable//apache/hbase/
java.io.IOException: Failed to getStatus: https://mirrors.bit.edu.cn/apache/hbase/stable//apache/hbase/
	at alluxio.underfs.web.WebUnderFileSystem.getStatus(WebUnderFileSystem.java:161)
	at alluxio.underfs.web.WebUnderFileSystem.listStatus(WebUnderFileSystem.java:297)
	at alluxio.underfs.UnderFileSystemWithLogging$33.call(UnderFileSystemWithLogging.java:757)
	at alluxio.underfs.UnderFileSystemWithLogging$33.call(UnderFileSystemWithLogging.java:754)
	at alluxio.underfs.UnderFileSystemWithLogging.call(UnderFileSystemWithLogging.java:1237)
	at alluxio.underfs.UnderFileSystemWithLogging.listStatus(UnderFileSystemWithLogging.java:754)
	at alluxio.underfs.UfsStatusCache.getChildrenIfAbsent(UfsStatusCache.java:332)
	at alluxio.underfs.UfsStatusCache.lambda$prefetchChildren$1(UfsStatusCache.java:376)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

It would be better to check if https://mirrors.bit.edu.cn/apache/hbase/stable is working when this error occurs.

ssz1997 avatar May 10 '22 16:05 ssz1997