fate-operator icon indicating copy to clipboard operation
fate-operator copied to clipboard

FateCluster Fails reconciliation

Open fbalicchia opened this issue 3 years ago • 10 comments

Hi, after deployed Fate operator and apply kubefate and fatecluster Fate operator seems to fail during reconciliation phase maintaining fatecluster crd in status creating. here request that fails

Here a error log from controller side

2021-03-24T07:44:49.001Z DEBUG controllers.FateCluster request info {"url": "http://kubefate-kubefate-kubefate-sample.kube-fate:8080/v1/cluster/8e4c85be-4428-4f51-a55d-bac3db91816c", "type": "GET", "body": ""}

and here from service side

2021/03/24 07:52:25 /workspace/pkg/modules/cluster_db.go:135 record not found [0.611ms] [rows:0] SELECT * FROM clustersWHERE uuid = '8e4c85be-4428-4f51-a55d-bac3db91816c' ANDclusters.deleted_atIS NULL ORDER BYclusters.id LIMIT 1 2021-03-24T07:52:25Z ERR workspace/pkg/api/cluster.go:152 > get cluster error error="record not found" uuid=8e4c85be-4428-4f51-a55d-bac3db91816c 2021-03-24T07:52:25Z ERR usr/local/go/src/net/http/server.go:1919 > Request ip=10.244.0.5 latency=1.1971 method=GET path=/v1/cluster/8e4c85be-4428-4f51-a55d-bac3db91816c status=500 user-agent=Go-http-client/1.1

Do I need to run some init actions before use examples ?

Thanks

fbalicchia avatar Mar 24 '21 08:03 fbalicchia

It seems the FATE cluster is deploying, and the log from controller is a debug message. Can everything works after the FATE crd created? Or can we describe the pod status of FATE cluster and see if any error there?

LaynePeng avatar Mar 25 '21 07:03 LaynePeng

The problem seems that crd stay stuck. After applied ./config/samples/app_v1beta1_fatecluster.yaml crd remain in status creating cause probably controller can't close reconcile ? Thanks for help

fbalicchia avatar Mar 26 '21 17:03 fbalicchia

Hi there @LaynePeng did you managed to investigate ?

fbalicchia avatar Apr 09 '21 12:04 fbalicchia

Hi there @LaynePeng did you managed to investigate ?

We still cannot reproduce this problem? Any other tips can be found in logs? @owlet42 Have you any idea on this problem?

LaynePeng avatar Apr 09 '21 12:04 LaynePeng

It may be an accident, if there is more log information, maybe it can be solved.

owlet42 avatar Apr 10 '21 12:04 owlet42

Hi there, I haven't many logs than you see above but I can reproduce problem easily with

cat clusterconfig-1.18.yaml << EOF > clusterconfig-1.18.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.18.8
  extraPortMappings:
  - containerPort: 31080
    hostPort: 80
  - containerPort: 31443
    hostPort: 443
EOF


kind create cluster --config clusterconfig-1.18.yaml --name fate-operator
from fate-operator root folder
export IMG=federatedai/fate-controller:bc5420bbe25
make docker-build-without-test
kind load docker-image federatedai/fate-controller:bc5420bbe25  --name fate-operator
make deploy

k apply -f config/samples/rbac-config.yaml
k apply -f config/samples/kubefate-secret.yaml
k create ns fate-9999

k create -f ./config/samples/app_v1beta1_kubefate.yaml
k get pods -n kube-fate
kubectl create -f ./config/samples/app_v1beta1_fatecluster.yaml
kubectl get fatecluster -A

kubectl get fatecluster -A

fate-9999   fatecluster-sample   9999      Creating

k logs fate-operator-controller-manager-86b58ffc9b-666sh manage -n fate-operator-system

021-04-11T11:02:36.886Z	DEBUG	controllers.FateCluster	retry	{"retry": 3}
2021-04-11T11:02:36.887Z	DEBUG	controllers.FateCluster	request info	{"url": "http://kubefate-kubefate-kubefate-sample.kube-fate:8080/v1/cluster/562481ab-6c84-4279-888a-ff81b5e7e965", "type": "GET", "body": ""}
2021-04-11T11:02:37.641Z	DEBUG	controllers.FateCluster	request code	{"Type": "GET", "Path": "cluster/562481ab-6c84-4279-888a-ff81b5e7e965", "respCode": 500, "respBody": "{\"error\":\"record not found\"}"}

fbalicchia avatar Apr 11 '21 11:04 fbalicchia

Hi @owlet42 did you managed to investigate ?

fbalicchia avatar Apr 20 '21 05:04 fbalicchia

Any new update about this issue?

LaynePeng avatar Jun 02 '21 19:06 LaynePeng

Hi @LaynePeng not from my side. I haven't see any relevant commit Do I need to replicate test ?

fbalicchia avatar Jun 08 '21 09:06 fbalicchia

Any new update about this issue?

fbalicchia avatar Jul 24 '21 07:07 fbalicchia