fluid
fluid copied to clipboard
Issues in setting up dataset jfsdemo on the Linux/ARM64 platform
What is your environment(Kubernetes version, Fluid version, etc.) Kubernetes v1.25.0 Fluid v0.8.0
Describe the bug I am working with Fluid v0.8.0 on the Linux/ARM64 platform.
Following the docs here: < https://github.com/fluid-cloudnative/fluid/blob/master/docs/en/samples/arm64.md >, I setup juicefs open-source environment and created dataset and the runtime.
Dataset jfsdemo successfully got bound to runtime but failed in some time.
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
jfsdemo 0.00B 0.00B 4.00GiB 0.0% Failed 17m
Therefore, pods are in pending or container creating state:
demo-app 0/1 ContainerCreating 0 15m
jfsdemo-fuse-pkvgf 0/1 Pending 0 15m
jfsdemo-worker-0 1/1 Running 0 16m
Kindly refer : < https://github.com/fluid-cloudnative/fluid/pull/2154 >.
What you expect to happen: Dataset should not fail.
How to reproduce it Kindly refer: https://github.com/fluid-cloudnative/fluid/blob/master/docs/en/samples/arm64.md
Additional Information
@zwwhdls Could you help take a look at this issue? Thanks.
Hi @odidev , it seems juicefs worker is not ready from log. jfsdemo-worker-0 pod keeps restarting. I guess it was oomed . Can you provide its yaml by kubectl get po jfsdemo-worker-0 -oyaml
and its log by kubectl logs jfsdemo-worker-0
?
@odidev you can also collect logs by using https://github.com/fluid-cloudnative/fluid/blob/master/docs/en/userguide/troubleshooting.md
Sure, kindly find the jfs-worker-0.yaml here: jfsdemo-worker-0-yaml.txt
Here are the logs:
$ kubectl logs jfsdemo-worker-0
2022/09/29 05:24:58.995939 juicefs[7] <INFO>: Meta address: redis://redis:6379/0 [interface.go:402]
2022/09/29 05:24:59.000948 juicefs[7] <WARNING>: AOF is not enabled, you may lose data if Redis is not shutdown properly. [info.go:83]
2022/09/29 05:24:59.001237 juicefs[7] <INFO>: Ping redis: 168.114µs [redis.go:2878]
2022/09/29 05:24:59.002111 juicefs[7] <INFO>: Data use minio://minio:9000/test/minio/ [format.go:435]
2022/09/29 05:24:59.026551 juicefs[7] <INFO>: Volume is formatted as {
"Name": "minio",
"UUID": "887385c5-91a8-43fe-b9a9-ff01f19d4f63",
"Storage": "minio",
"Bucket": "http://minio:9000/minio/test",
"AccessKey": "minioadmin",
"SecretKey": "removed",
"BlockSize": 4096,
"Compression": "none",
"KeyEncrypted": true,
"TrashDays": 1,
"MetaVersion": 1
} [format.go:472]
FYI, dataset “jfsdemo” was in bound state for 10-15 minutes. But soon after I created and deployed sample.yaml, the dataset “jfsdemo” failed. Before deploying sample.yaml in the last step, everything was working just fine.
Below is the output after deploying sample.yaml:
$ kubectl get dataset jfsdemo
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
jfsdemo 0.00B 0.00B 4.00GiB 0.0% Failed 15m
$ kubectl get juicefs jfsdemo
NAME WORKER PHASE FUSE PHASE AGE
jfsdemo Ready NotReady 19m
@odidev I assume you have some taints on the node which the fuse is going to deploy. You can check the reason of pending. Please run kubectl describe po jfsdemo-fuse-pkvgf
jfsdemo-fuse-pkvgf 0/1 Pending 0 15m
@odidev , I've reproduced the issue. It`s because the shortage of memory (in the doc, the runtime needs 4GiB mem). Maybe you can reduce the runtime memory or get some servers with more memory.
I tried lowering the runtime memory in runtime.yaml to 1GB, and now it seems 2 out of 3 pods are running fine.
$ kubectl get po |grep demo
demo-app 0/1 ContainerCreating 0 20m
jfsdemo-fuse-dql47 1/1 Running 0 20m
jfsdemo-worker-0 1/1 Running 0 22m
However, the pod “demo-app” is in containerCreating state for a long time because of volume mount issue, as below:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 19m default-scheduler Successfully assigned default/demo-app to minikube
Warning FailedMount 11m (x2 over 17m) kubelet Unable to attach or mount volumes: unmounted volumes=[demo], unattached volumes=[kube-api-access-hpjlz demo]: timed out waiting for the condition
Warning FailedMount 2m4s (x6 over 15m) kubelet Unable to attach or mount volumes: unmounted volumes=[demo], unattached volumes=[demo kube-api-access-hpjlz]: timed out waiting for the condition
Warning FailedMount 40s (x14 over 19m) kubelet MountVolume.SetUp failed for volume "default-jfsdemo" : rpc error: code = InvalidArgument desc = exit status 1
hi @odidev , can you collect logs refering to https://github.com/fluid-cloudnative/fluid/blob/master/docs/en/userguide/troubleshooting.md ?
I collected logs for pod “demo-app” as below:
$ ./diagnose-fluid-juicefs.sh collect --name demo-app --namespace default
Below are the error logs I found in dataset-controller.log and juicefsruntime-controller.log files:
1. /tmp/diagnose_fluid_1666935642/pods-fluid-system/juicefsruntime-controller-7c78b44bf-lbps2-manager.log :
ERROR juicefsctl.JuiceFSRuntime retry/util.go:51 Failed to check the fuse healthy {"juicefsruntime": "default/jfsdemo"}
ERROR juicefsctl.JuiceFSRuntime base/syncs.go:60 The fuse is not healthy {"juicefsruntime": "default/jfsdemo", "error": "the daemonset jfsdemo-fuse in default are not ready, the unhealthy number 1"}
2. /tmp/diagnose_fluid_1666935642/pods-fluid-system/dataset-controller-69c4b6fcf9-vmzlp-manager.log
2022-10-28T13:28:55.844+0800 ERROR controller.dataset controller/controller.go:266 Reconciler error {"reconciler group": "data.fluid.io", "reconciler kind": "Dataset", "name": "jfsdemo", "namespace": "default", "error": "no matched controller"}github.com/fluid-cloudnative/fluid/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2022-10-28T13:28:55.849+0800 ERROR datasetctl.Dataset controller/controller.go:114 Failed to scale out the runtime controller on demand {"dataset": "default/jfsdemo", "RuntimeController": "default/jfsdemo", "error": "no matched controller"}
Kindly find below, the log files:
Juicefsruntime-controller: juicefsruntime-controller-error-logs.txt
Dataset-controller: dataset-controller-error-logs.txt
@zwwhdls, could you please have a look at this and share your suggestion on the same?
could you please find a stable way to reproduce this case?
I have followed the documentation given here and to diagnose fluid and collect logs followed this.
Please find all the logs for the same here.
maybe you can uninstall the fluid, and reinstall it to try again
i think it is bacause the juicefs controller is not running healthly. I gusess the dataset controller can not scale out the juicefs runtime controller due to the shortage of resource.