gravity icon indicating copy to clipboard operation
gravity copied to clipboard

Gravity installation fails at post-install-hook

Open pachalk opened this issue 3 years ago • 5 comments

Description

Please check the below logs and let me know the reason why it is failing

Pod "gravity-install-5afffe-hjw7f" in namespace "kube-system", has changed state from "Running" to "Succeeded". Container "gravity-install" changed status from "running" to "terminated, exit code 0".

Sat Apr 10 16:19:36 UTC [INFO] [gravity-01] Executing postInstall hook for site:7.0.12. Created Pod "site-app-post-install-dcc7f6-dn8nq" in namespace "kube-system".

Container "post-install-hook" created, current state is "waiting, reason PodInitializing".

Pod "site-app-post-install-dcc7f6-dn8nq" in namespace "kube-system", has changed state from "Pending" to "Running". Container "post-install-hook" changed status from "waiting, reason PodInitializing" to "running".

[ERROR]: failed connecting to https://gravity-site.kube-system.svc.cluster.local:3009/healthz Get https://gravity-site.kube-system.svc.cluster.local:3009/healthz: dial tcp 100.100.244.216:3009: connect: connection refused Container "post-install-hook" changed status from "running" to "terminated, exit code 255".

Container "post-install-hook" restarted, current state is "running".

[ERROR]: failed connecting to https://gravity-site.kube-system.svc.cluster.local:3009/healthz Get https://gravity-site.kube-system.svc.cluster.local:3009/healthz: dial tcp 100.100.244.216:3009: connect: connection refused Container "post-install-hook" changed status from "running" to "terminated, exit code 255".

Container "post-install-hook" changed status from "terminated, exit code 255" to "waiting, reason CrashLoopBackOff".

Container "post-install-hook" restarted, current state is "running".

[ERROR]: failed connecting to https://gravity-site.kube-system.svc.cluster.local:3009/healthz Get https://gravity-site.kube-system.svc.cluster.local:3009/healthz: dial tcp 100.100.244.216:3009: connect: connection refused Container "post-install-hook" changed status from "running" to "terminated, exit code 255".

Container "post-install-hook" changed status from "terminated, exit code 255" to "waiting, reason CrashLoopBackOff".

Pod "site-app-post-install-dcc7f6-dn8nq" in namespace "kube-system", has changed state from "Running" to "Succeeded". Container "post-install-hook" restarted, current state is "terminated, exit code 0".

has completed, 58 seconds elapsed. Sat Apr 10 16:20:34 UTC [DEBUG] [gravity-01] Application gravitational.io/site:7.0.12 does not have status hook. Sat Apr 10 16:20:35 UTC [INFO] [gravity-01] Executing phase: /runtime/kubernetes. Sat Apr 10 16:20:35 UTC [DEBUG] [gravity-01] Application gravitational.io/kubernetes:7.0.12 does not have install hook. Sat Apr 10 16:20:35 UTC [DEBUG] [gravity-01] Application gravitational.io/kubernetes:7.0.12 does not have postInstall hook. Sat Apr 10 16:20:35 UTC [DEBUG] [gravity-01] Application gravitational.io/kubernetes:7.0.12 does not have status hook. Created Pod "install-89f805-shxct" in namespace "kube-system".

Container "install" created, current state is "waiting, reason PodInitializing".

Sat Apr 10 16:20:36 UTC [INFO] [gravity-01] Executing phase: /app/application. Sat Apr 10 16:20:36 UTC [INFO] [gravity-01] Executing install hook for application:0.2.2. Pod "install-89f805-shxct" in namespace "kube-system", has changed state from "Pending" to "Running". Container "install" changed status from "waiting, reason PodInitializing" to "running".

namespace/ingress-nginx created namespace/metallb-system created clusterrole.rbac.authorization.k8s.io/use-privileged-psp created clusterrolebinding.rbac.authorization.k8s.io/privileged-role-bind created podsecuritypolicy.extensions/privileged-new created role.rbac.authorization.k8s.io/ingress-nginx.psp-role created rolebinding.rbac.authorization.k8s.io/ingress-nginx.psp-rolebinding created

Sat Apr 10 16:53:56 UTC [ERROR] [gravity-01] Phase execution failed: gravitational.io/application:0.2.2 install hook failed Job was active longer than specified deadline.

What happened: Running gravity install command to install my application, I am getting this issue sometimes

What you expected to happen: Install all gravity and application resources

How to reproduce it (as minimally and precisely as possible): ./gravity install --advertise-addr= --token= --values=config.yaml

Environment

  • Gravity version : Edition: enterprise Version: 7.0.12 Git Commit: 8a1dbd9 Helm Version: v2.15

  • OS : 4.15.0-140-generic #144-Ubuntu SMP

  • Platform [e.g. Vmware, AWS]: VMware Browser environment

  • Browser Version (for UI-related issues):

  • Install tools:

  • Others:

Relevant Debug Logs If Applicable

pachalk avatar Apr 13 '21 14:04 pachalk

Do you set the deadline for your install hook? If not, it defaults to 20min - has the job been running for that long?

a-palchikov avatar Jun 24 '21 10:06 a-palchikov

Do you set the deadline for your install hook? If not, it defaults to 20min - has the job been running for that long?

Yes set to 2000 in deadline

pachalk avatar Oct 21 '21 01:10 pachalk

It means the hook job ran longer than that. Can you check the /var/log/gravity-install.log for any hints about why the job took that longer (and possible would've run longer)? The log file should contain the job's log if there was anything logged in the Pod.

a-palchikov avatar Oct 21 '21 11:10 a-palchikov

It means the hook job ran longer than that. Can you check the /var/log/gravity-install.log for any hints about why the job took that longer (and possible would've run longer)? The log file should contain the job's log if there was anything logged in the Pod.

The above pasted logs are from gravity-install.log, sometimes gravity fails intermittently with random errors couldn't able to figure out. Right now we are using 7.0.34 gravity version same scenario fails randomly but application is installed successfully I tried running app phase with below commands to change the gravity status of cluster ./gravity plan rollback --phase=/app/application ./gravity plan complete --phase=/app/application It changed status to complete but coming ./gravity upgrade this scenario doesn't work the cluster will be in updating status we can reset the status but cannot able to move /app phase to complete

pachalk avatar Oct 21 '21 21:10 pachalk

Sorry - I should've looked at the original description. Yes, it looks like the install hook is not logging enough to see where it's stuck at times when this fails with the deadline exceeded error. Is there any way you can add more logging to make this more visible?

As for the commands, the plan complete is not the command you want for completing a specific step - it completes the whole operation even if it hasn't run to completion itself, so it's dangerous in this regard. What you're looking for is plan set --phase=/app/applicaton --state=completed but even that is suspicious given the circumstances - the hook should run to successful completion to have no issue, if the hook might get stuck, you might want to inspect where exactly it gets stuck (i.e. by adding additional logging).

a-palchikov avatar Oct 22 '21 20:10 a-palchikov