fluid icon indicating copy to clipboard operation
fluid copied to clipboard

Issues in setting up dataset jfsdemo on the Linux/ARM64 platform

Open odidev opened this issue 2 years ago • 14 comments

What is your environment(Kubernetes version, Fluid version, etc.) Kubernetes v1.25.0 Fluid v0.8.0

Describe the bug I am working with Fluid v0.8.0 on the Linux/ARM64 platform.

Following the docs here: < https://github.com/fluid-cloudnative/fluid/blob/master/docs/en/samples/arm64.md >, I setup juicefs open-source environment and created dataset and the runtime.

Dataset jfsdemo successfully got bound to runtime but failed in some time.

NAME      UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE    AGE  
jfsdemo   0.00B            0.00B    4.00GiB          0.0%                Failed   17m 

Therefore, pods are in pending or container creating state:

demo-app                 0/1     ContainerCreating   0          15m  
jfsdemo-fuse-pkvgf       0/1     Pending             0          15m  
jfsdemo-worker-0         1/1     Running             0          16m 

Kindly refer : < https://github.com/fluid-cloudnative/fluid/pull/2154 >.

What you expect to happen: Dataset should not fail.

How to reproduce it Kindly refer: https://github.com/fluid-cloudnative/fluid/blob/master/docs/en/samples/arm64.md

Additional Information

odidev avatar Sep 28 '22 06:09 odidev

@zwwhdls Could you help take a look at this issue? Thanks.

cheyang avatar Sep 29 '22 02:09 cheyang

Hi @odidev , it seems juicefs worker is not ready from log. jfsdemo-worker-0 pod keeps restarting. I guess it was oomed . Can you provide its yaml by kubectl get po jfsdemo-worker-0 -oyaml and its log by kubectl logs jfsdemo-worker-0 ?

zwwhdls avatar Sep 29 '22 02:09 zwwhdls

@odidev you can also collect logs by using https://github.com/fluid-cloudnative/fluid/blob/master/docs/en/userguide/troubleshooting.md

cheyang avatar Sep 29 '22 03:09 cheyang

Sure, kindly find the jfs-worker-0.yaml here: jfsdemo-worker-0-yaml.txt

Here are the logs:

$ kubectl logs jfsdemo-worker-0 

2022/09/29 05:24:58.995939 juicefs[7] <INFO>: Meta address: redis://redis:6379/0 [interface.go:402] 
2022/09/29 05:24:59.000948 juicefs[7] <WARNING>: AOF is not enabled, you may lose data if Redis is not shutdown properly. [info.go:83]
2022/09/29 05:24:59.001237 juicefs[7] <INFO>: Ping redis: 168.114µs [redis.go:2878] 
2022/09/29 05:24:59.002111 juicefs[7] <INFO>: Data use minio://minio:9000/test/minio/ [format.go:435] 
2022/09/29 05:24:59.026551 juicefs[7] <INFO>: Volume is formatted as { 
  "Name": "minio", 
  "UUID": "887385c5-91a8-43fe-b9a9-ff01f19d4f63", 
  "Storage": "minio", 
  "Bucket": "http://minio:9000/minio/test", 
  "AccessKey": "minioadmin", 
  "SecretKey": "removed", 
  "BlockSize": 4096, 
  "Compression": "none", 
  "KeyEncrypted": true, 
  "TrashDays": 1, 
  "MetaVersion": 1 
} [format.go:472] 

FYI, dataset “jfsdemo” was in bound state for 10-15 minutes. But soon after I created and deployed sample.yaml, the dataset “jfsdemo” failed. Before deploying sample.yaml in the last step, everything was working just fine.

Below is the output after deploying sample.yaml:

$ kubectl get dataset jfsdemo 
NAME      UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE    AGE 
jfsdemo   0.00B            0.00B    4.00GiB          0.0%                Failed   15m 

 
$ kubectl get juicefs jfsdemo 
NAME      WORKER PHASE   FUSE PHASE   AGE 
jfsdemo   Ready          NotReady     19m 

odidev avatar Sep 29 '22 05:09 odidev

@odidev I assume you have some taints on the node which the fuse is going to deploy. You can check the reason of pending. Please run kubectl describe po jfsdemo-fuse-pkvgf

jfsdemo-fuse-pkvgf       0/1     Pending             0          15m 

cheyang avatar Sep 30 '22 07:09 cheyang

@odidev , I've reproduced the issue. It`s because the shortage of memory (in the doc, the runtime needs 4GiB mem). Maybe you can reduce the runtime memory or get some servers with more memory.

wang-mask avatar Oct 12 '22 11:10 wang-mask

I tried lowering the runtime memory in runtime.yaml to 1GB, and now it seems 2 out of 3 pods are running fine.

$ kubectl get po |grep demo 

demo-app                 0/1     ContainerCreating   0          20m 
jfsdemo-fuse-dql47       1/1     Running             0          20m 
jfsdemo-worker-0         1/1     Running             0          22m 

However, the pod “demo-app” is in containerCreating state for a long time because of volume mount issue, as below:

Events: 
  Type     Reason       Age                 From               Message 
  ----     ------       ----                ----               ------- 
  Normal   Scheduled    19m                 default-scheduler  Successfully assigned default/demo-app to minikube 
  Warning  FailedMount  11m (x2 over 17m)   kubelet            Unable to attach or mount volumes: unmounted volumes=[demo], unattached volumes=[kube-api-access-hpjlz demo]: timed out waiting for the condition 
  Warning  FailedMount  2m4s (x6 over 15m)  kubelet            Unable to attach or mount volumes: unmounted volumes=[demo], unattached volumes=[demo kube-api-access-hpjlz]: timed out waiting for the condition 
  Warning  FailedMount  40s (x14 over 19m)  kubelet            MountVolume.SetUp failed for volume "default-jfsdemo" : rpc error: code = InvalidArgument desc = exit status 1 

odidev avatar Oct 21 '22 05:10 odidev

hi @odidev , can you collect logs refering to https://github.com/fluid-cloudnative/fluid/blob/master/docs/en/userguide/troubleshooting.md ?

zwwhdls avatar Oct 21 '22 06:10 zwwhdls

I collected logs for pod “demo-app” as below:

$ ./diagnose-fluid-juicefs.sh collect --name demo-app --namespace default 

Below are the error logs I found in dataset-controller.log and juicefsruntime-controller.log files:

1. /tmp/diagnose_fluid_1666935642/pods-fluid-system/juicefsruntime-controller-7c78b44bf-lbps2-manager.log : 

ERROR   juicefsctl.JuiceFSRuntime       retry/util.go:51        Failed to check the fuse healthy        {"juicefsruntime": "default/jfsdemo"} 
ERROR   juicefsctl.JuiceFSRuntime       base/syncs.go:60        The fuse is not healthy {"juicefsruntime": "default/jfsdemo", "error": "the daemonset jfsdemo-fuse in default are not ready, the unhealthy number 1"} 

  
2. /tmp/diagnose_fluid_1666935642/pods-fluid-system/dataset-controller-69c4b6fcf9-vmzlp-manager.log

2022-10-28T13:28:55.844+0800    ERROR   controller.dataset      controller/controller.go:266    Reconciler error        {"reconciler group": "data.fluid.io", "reconciler kind": "Dataset", "name": "jfsdemo", "namespace": "default", "error": "no matched controller"}github.com/fluid-cloudnative/fluid/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem 
2022-10-28T13:28:55.849+0800    ERROR   datasetctl.Dataset      controller/controller.go:114    Failed to scale out the runtime controller on demand    {"dataset": "default/jfsdemo", "RuntimeController": "default/jfsdemo", "error": "no matched controller"} 

Kindly find below, the log files:

Juicefsruntime-controller: juicefsruntime-controller-error-logs.txt

Dataset-controller: dataset-controller-error-logs.txt

odidev avatar Oct 31 '22 06:10 odidev

@zwwhdls, could you please have a look at this and share your suggestion on the same?

odidev avatar Dec 06 '22 09:12 odidev

could you please find a stable way to reproduce this case?

wang-mask avatar Jan 31 '23 01:01 wang-mask

I have followed the documentation given here and to diagnose fluid and collect logs followed this.

Please find all the logs for the same here.

odidev avatar Feb 06 '23 11:02 odidev

maybe you can uninstall the fluid, and reinstall it to try again

wang-mask avatar Feb 06 '23 11:02 wang-mask

i think it is bacause the juicefs controller is not running healthly. I gusess the dataset controller can not scale out the juicefs runtime controller due to the shortage of resource.

wang-mask avatar Feb 06 '23 12:02 wang-mask