FfDL kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff

After I installed FfDl according to the prompts, check the status and get the following prompt: screenshot from 2019-01-17 16-56-34

and use helm list,get follow : screenshot from 2019-01-17 17-00-32 Then, for these incorrect pods, I use kubectl describe pods <pods Name> to view the information. The results are as follows:

Is there any friend who can give me some advice? What is the problem with me? I would like to express my heartfelt thanks.

Jan 17 '19 09:01 Earl-chen

Hi @Earl-chen, it looks like the volume configmap is not being created. Can you run the following script to generate the necessary configmap? Thanks.

pushd bin
./create_static_volumes.sh
./create_static_volumes_config.sh
popd

Jan 17 '19 17:01 Tomcli

Thank you very much for your help. @Tomcli 
I tried the method you said, and also tried to use the command` helm delete $(helm list | grep ffdl | awk '{print $1}' | head -n 1) `After deleting FfDL, use the command of` helm install .` to install FFDL. The final result is still a failure. The specific description is as follows:

When I just installed FfDL, in the first 10 seconds or so, all the STATUS was found by kubectl get pods. screenshot from 2019-01-21 17-33-55

However, after a while, ffdl-trainer-6777dd5756-rjlfl, prometheus-67fb854b59-mxdrn and ffdl-trainingdata-696b99ff5c-2hsc5 will often become CrashLoopBackOff or ERROR.

screenshot from 2019-01-21 17-34-18

After about 10 minutes, ffdl-lcm-8d555c7bf-r6fj8, ffdl-trainer-6777dd5756-rjlfl, ffdl-trainingdata-696b99ff5c-2hsc5 and prometheus-67fb854b59-mxdrn are not becoming Running. At the same time, by command: helm status $(helm list | grep ffdl | awk '{print $1}' | head -n 1) | grep STATUS: The result is also changed from the previous DEPLOYED to FAILED. screenshot from 2019-01-21 18-06-25 screenshot from 2019-01-21 17-00-30

See the details by kubectl describe pods <pods Name> as follows:

screenshot from 2019-01-21 18-08-17

I can't understand why this is happening?

Jan 21 '19 10:01 Earl-chen

Thank you for taking the time to redeploy FfDL. It looks like many of the pods failed liveness probe. Which means those microservices might not able to communicate with each other via the KubeDNS server on your cluster. Can you display some logs from your KubeDNS pod in the kube-system namespace?

Jan 21 '19 19:01 Tomcli

@Tomcli Thank you for your prompt reply. Unfortunately, I missed KubeDNS. KubeDNS is not installed. I install it now, and then give you feedback.

Jan 22 '19 01:01 Earl-chen

@Tomcli I also have the problem above _ _20190122134714

I run command $kubectl describe pods ffdl-lcm-8d555c7bf-6pg7z --namespace kube-system, the result is: _ _20190122135141 _ _20190122135203 _ _20190122135216

192.168.110.158 is the node of k8s, I run the command $pushd bin $./create_static_volumes.sh $./create_static_volumes_config.sh $popd on 192.168.110.25 (master of k8s) and 192.168.110.158 (node).
How can I solve problem above? Thank you.

Jan 22 '19 05:01 Eric-Zhang1990

@Eric-Zhang1990 ./create_static_volumes.sh and ./create_static_volumes_config.sh should able to create the static-volumes and v2 configmap for you. Do you still encounter problems with configmap "static-volumes" not found?

Jan 22 '19 17:01 Tomcli

@Tomcli I can get following info, it shows static-volumes and v2 are there: _ _20190123083352

However, I restart FfDL, and still encounter problems with "SetUp failed for volume "static-volumes-config-volume-v2" : configmap "static-volumes-v2" not found". _ _20190123093805

_ _20190123094007

_ _20190123094031

And I rerun the command ./create_static_volumes.sh and ./create_static_volumes_config.sh, I got these: _ _20190123094449

How can I solve it? Thanks. Another question is: which kind of type should I set for SHARED_VOLUME_STORAGE_CLASS? Thanks.

Jan 23 '19 01:01 Eric-Zhang1990

@Eric-Zhang1990 It looks like you deployed the static-volumes at the default namespace while FfDL is at kube-system namespace. You could deploy FfDL using helm with the namespace flag (e.g. helm install . --set namespace=default) to deploy FfDL on your default namespace.

Jan 23 '19 01:01 Tomcli

The SHARED_VOLUME_STORAGE_CLASS should be your default storageclass in your Kubernetes cluster. You can check the storageclass with kubectl get storageclass

Jan 23 '19 01:01 Tomcli

@Tomcli Does following state cause the problem above? _ _20190123104829 The status of static-volume-1 is always in pending, is something wrong?

Jan 23 '19 02:01 Eric-Zhang1990

The SHARED_VOLUME_STORAGE_CLASS should be your default storageclass in your Kubernetes cluster. You can check the storageclass with kubectl get storageclass

I run command kubectl get storageclass but get nothing. _ _20190123105514

Jan 23 '19 02:01 Eric-Zhang1990

It looks like you deployed the static-volumes at the default namespace while FfDL is at kube-system namespace. You could deploy FfDL using helm with the namespace flag (e.g. helm install . --set namespace=default) to deploy FfDL on your default namespace.

Now I change the static-volumes namespace into kube-system namespace, and I also deploy FfDL in kube-system namespace, then pod 'ffdl-lcm' now run ok, but status of pod 'ffdl-trainer' and 'ffdl-trainingdata' are not stable, _ _20190123111111

_ _20190123111122

_ _20190123112530

Which reason can cause this problem? Thank you.

Jan 23 '19 03:01 Eric-Zhang1990

@Tomcli After running one more hour, the statues of 'ffdl-trainingdata*' is still changing, sometimes is in status 'running', sometimes is 'CrashLoopBackOff'. _ _20190123145443 I run command "kubectl describe pods ffdl-trainingdata-74f7cdf66c-lkk2p", get following info: _ _20190123145718

And I run command "kubectl logs ffdl-trainingdata-74f7cdf66c-lkk2p", logs info are:

time="2019-01-23T07:06:18Z" level=debug msg="Log level set to 'debug'" time="2019-01-23T07:06:18Z" level=debug msg="Milli CPU is: 60" time="2019-01-23T07:06:18Z" level=info msg="GetTrainingDataMemInMB() returns 300" time="2019-01-23T07:06:18Z" level=debug msg="Training Data Mem in MB is: 300" time="2019-01-23T07:06:18Z" level=debug msg="No config file 'config-dev.yml' found. Using environment variables only." {"caller_info":"metrics/main.go:36 main -","level":"debug","module":"training-data-service","msg":"function entry","time":"2019-01-23T07:06:18Z"} {"caller_info":"metrics/main.go:42 main -","level":"debug","module":"training-data-service","msg":"Port is: 8443","time":"2019-01-23T07:06:18Z"} {"caller_info":"metrics/main.go:44 main -","level":"debug","module":"training-data-service","msg":"Creating dlaas-training-metrics-service","time":"2019-01-23T07:06:18Z"} {"caller_info":"service/service_impl.go:147 NewService -","level":"debug","module":"training-data-service","msg":"es address #0: http://elasticsearch:9200","time":"2019-01-23T07:06:18Z"} {"caller_info":"service/service_impl.go:885 createIndexWithLogsIfDoesNotExist -","level":"debug","module":"training-data-service","msg":"function entry","time":"2019-01-23T07:06:18Z"} {"caller_info":"service/service_impl.go:887 createIndexWithLogsIfDoesNotExist -","level":"info","module":"training-data-service","msg":"calling IndexExists for dlaas_learner_data","time":"2019-01-23T07:06:18Z"} {"caller_info":"service/service_impl.go:888 createIndexWithLogsIfDoesNotExist -","error":"Head http://elasticsearch:9200/dlaas_learner_data: dial tcp: lookup elasticsearch on 10.254.0.2:53: read udp 172.17.0.6:53791-\u003e10.254.0.2:53: i/o timeout","level":"error","module":"training-data-service","msg":"IndexExists for dlaas_learner_data failed","time":"2019-01-23T07:06:58Z"} {"caller_info":"elastic.v5/indices_create.go:31 createIndexWithLogsIfDoesNotExist -","level":"debug","module":"training-data-service","msg":"calling CreateIndex","time":"2019-01-23T07:06:58Z"} {"caller_info":"service/service_impl.go:907 createIndexWithLogsIfDoesNotExist -","error":"no available connection: no Elasticsearch node available","level":"debug","module":"training-data-service","msg":"CreateIndex failed","time":"2019-01-23T07:06:58Z"} panic: no available connection: no Elasticsearch node available

goroutine 1 [running]: github.com/IBM/FfDL/metrics/service.NewService(0xc420479f68, 0xe23640) /Users/tommyli/go/src/github.com/IBM/FfDL/metrics/service/service_impl.go:167 +0x980 main.main() /Users/tommyli/go/src/github.com/IBM/FfDL/metrics/main.go:44 +0x16c

Is the problem of "no available connection: no Elasticsearch node available"? Thanks.

Jan 23 '19 06:01 Eric-Zhang1990

Thank you for taking time to debug this. Elasticsearch should be part of the storage-0 container. It could be the Elasticsearch service didn't properly enabled. Can you run kubectl get svc to check is the elasticsearch is deployed? Also, you might want to run kubectl logs storage-0 to check is there any error related to elastic search.

Thanks.

Jan 23 '19 17:01 Tomcli

@Tomcli I check elasticsearch is deployed, and logs of storage-0 shows "Failed to find a usable hardware address from the network interfaces; using random bytes: 64:4b:61:9d:da:79:4a:d3", which reason can cause this problem? Thanks. _ _20190124090313 _ _20190124090417

Jan 24 '19 01:01 Eric-Zhang1990

@Tomcli Today I run FfDL again, I can get all compositions are running, but they all get some numbers of RESTARTS, is it all right? Can I use it for training? Thank you. _ _20190125154135

Jan 25 '19 07:01 Eric-Zhang1990

Hi @Eric-Zhang1990, Sorry for the late reply. Regrading the elastic search error, you supposed to have the following logs at the end of the storage-0 container.

[2019-01-24T01:17:28,500][WARN ][o.e.b.BootstrapChecks    ] [2cdcQJ-] max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
127.0.0.1 - - [24/Jan/2019 01:17:30] "GET / HTTP/1.1" 200 -
2019-01-24T01:17:30:WARNING:infra.pyc: Service "elasticsearch" not yet available, retrying...
[2019-01-24T01:17:31,568][INFO ][o.e.c.s.ClusterService   ] [2cdcQJ-] new_master {2cdcQJ-}{2cdcQJ-PT-OgOS1lVhqU_g}{xT1sK8mWRuiaU5zsT5R0pw}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2019-01-24T01:17:31,605][INFO ][o.e.h.n.Netty4HttpServerTransport] [2cdcQJ-] publish_address {127.0.0.1:4560}, bound_addresses {[::1]:4560}, {127.0.0.1:4560}
[2019-01-24T01:17:31,613][INFO ][o.e.n.Node               ] [2cdcQJ-] started
[2019-01-24T01:17:31,635][INFO ][o.e.g.GatewayService     ] [2cdcQJ-] recovered [0] indices into cluster_state
127.0.0.1 - - [24/Jan/2019 01:17:33] "GET / HTTP/1.1" 200 -
Ready.
[2019-01-24T01:17:53,424][INFO ][o.e.c.m.MetaDataCreateIndexService] [2cdcQJ-] [dlaas_learner_data] creating index, cause [api], templates [], shards [5]/[1], mappings []
[2019-01-24T01:17:53,996][INFO ][o.e.c.m.MetaDataMappingService] [2cdcQJ-] [dlaas_learner_data/uZblTWoeQBurTMiFYUU9Ng] create_mapping [logline]
[2019-01-24T01:17:54,039][INFO ][o.e.c.m.MetaDataMappingService] [2cdcQJ-] [dlaas_learner_data/uZblTWoeQBurTMiFYUU9Ng] create_mapping [emetrics]

The above logs will indicate the elastic search schema table is created, then the ffdl-trainingdata service pod should be functional after that.

Since I see all your pods is running today, you can go ahead and starting use it for training. I can follow up on it if you encounter any further question. Thank you.

Jan 25 '19 17:01 Tomcli

@Tomcli Thank you for your patient reply. I check the log of storage-0 container, it shows the same info as yours. _ _20190129110221 However, the status of these pods are still not stable, like that: _ _20190129110343 _ _20190129110233 I describe the container prometheus and found that although it is running, but it has error info "Readiness probe failed: ":, does this error info have effect on other pods? _ _20190129110316

One more thing: I run FfDL on 2 servers, and they are on local area network, does network have effect on deployment of FfDL? Thank you.

Jan 29 '19 03:01 Eric-Zhang1990

Hi @Eric-Zhang1990, it looks like some internal connections are either refused or timed out. If you local area network has low bandwidth, I recommend to deploy FfDL without any monitoring service to reduce the network throughput. e.g.

helm install . --set prometheus.deploy=false

Jan 30 '19 17:01 Tomcli

@Tomcli I run command 'helm install . --set prometheus.deploy=false' and find ffdl-trainer is also CrashLoopBackOff or running, and it always shows "Back-off restarting failed container". _ _20190131083638 I run "kubectl describe po ffdl-trainer-7b44999975-d2b7g" and get this: _ _20190131083834

_ _20190131083948 I delete the pod ffdl-trainer and it can run correctly for a while. _ _20190131083921 I find ffdl-lcm is running: _ _20190131085257 but I run "kubectl describe po ffdl-lcm-7f69876c98-lrqjj" and get this: _ _20190131085107

Jan 31 '19 00:01 Eric-Zhang1990

@Tomcli Thanks, it seems like the issue of internal connections, I can run correctly on one server, but on two server, the status is unstable.

Feb 01 '19 00:02 Eric-Zhang1990

@Tomcli Sorry for bothering you. I have the same problem after I deploy FfDL on other two servers (192.168.110.158 and 192.168.110.76 as node, 192.168.110.25 as master). screenshot from 2019-02-19 15-51-59 And log of ffdl-trainer is:

Is it also the internal connections issue between pods in defferent servers? I don't know where problem is, thanks.

Feb 19 '19 07:02 Eric-Zhang1990

Hi @Eric-Zhang1990, it looks like some of the services are not reachable between two of your worker nodes. Also, the errors you have before that fails the liveness test also indicates that the GRPC protocols are not reachable between the microservices that are in different nodes.

Since FfDL is using KubeDNS to discover and communicate between each microservice, it could be your KubeDNS wasn't setup correctly. Another reason could be something is blocking the internode communication (e.g. firewall setting, VLAN, etc...).

Feb 19 '19 17:02 Tomcli

@Tomcli Thank you for your kind reply, I also think the issue is communication problem, after many times try, I delete k8s and deploy it in kubeadm tool, and now it runs correctly.

Feb 22 '19 02:02 Eric-Zhang1990

@Tomcli hello, I have some similar but not same question when I deploy FFdl, There are three pods is CrashLoopBackOff, and static-volume-1 is pending because of the follow reason

And after I clean up FdDL and rebuild(make deploy-plugin), it shows Error from server (AlreadyExists): configmaps "static-volumes-v2" already exists

Jan 18 '20 06:01 ZepengW

You can check the list of storageclass on your cluster by running kubectl get storageclass. Then you can run export SHARED_VOLUME_STORAGE_CLASS="<storageclass>" to use your desire storageclass as FfDL's persistent storage. If you don't have any storageclass, you will need to run export SHARED_VOLUME_STORAGE_CLASS="" and create a static pv using host path. (e.g.

kubectl create -f - <<EOF
kind: PersistentVolume
apiVersion: v1
metadata:
  name: pv-volume
spec:
  storageClassName:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt/pv"
EOF

Once you completed with the above steps, you can continue with make deploy-plugin and make quickstart-deploy

Jan 21 '20 17:01 Tomcli

FfDL FfDL copied to clipboard

kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff

FfDL
FfDL copied to clipboard