kuma icon indicating copy to clipboard operation
kuma copied to clipboard

Add ability to wait for sidecar container

Open michaelkoro opened this issue 3 years ago • 24 comments

Summary

When deploying Kong on top of EKS into a kuma-based namespace, we noticed that the db migration job that is executed before the pods are started, is sometimes stuck with the following error -

Error: [PostgreSQL error] failed to retrieve server_version_num: host or service not provided, or not known

After trying a few things (changing the db address, upgrading docker version) we noticed that the envoy proxy sometimes is started after the application container (in this case, the migrations container) which I guess causes network errors.

Once we deployed the chart in a namespaces that wasn't managed by Kuma, everything worked fine. Is there a way to tell kuma to first start the envoy proxy, and only then the application itself ?

Thanks

Kuma Chart version - 0.6.0 EKS - 1.19 Kong - 2.1.4

Additional Details & Logs

link to a related ticket in kong - https://github.com/Kong/kong/issues/4363

michaelkoro avatar Aug 03 '21 12:08 michaelkoro

Did anyone encounter this kind of issue ?

michaelkoro avatar Aug 08 '21 12:08 michaelkoro

xref https://github.com/kubernetes/kubernetes/issues/65502

jpeach avatar Aug 08 '21 22:08 jpeach

The problem is, unless kuma DP is up and running the pod has no network, and as per K8s it appears that sidecar lifecycle is more complicated than it was thought and it is a waiting game.

On the other hand, if you can wrap the main application command or entrypoint, you can use this logic (install netcat in ubuntu or debian, alpine has nc command by default installed)

## Check Network when Service Mesh is enabled
while true
do
  nc -vz www.google.com 443
  ret_code=$?
  if [ $ret_code  -ne 0 ] ; then
    echo "Network Not ready"
    sleep 3
  else
    echo "Network Ready"
  break
  fi
done

echo "starting {{.Chart.Name}} service"
MAIN COMMAND

Btw, if you have vault integration and you have a init container which runs , it will not init , to overcome , just add this annotation vault.hashicorp.com/agent-init-first: "true"

skaravad avatar Aug 13 '21 19:08 skaravad

@skaravad I remember when working with Istio that they managed to solve the issue. I think when deploying Istio you had to add a flag which basically tells the app container to wait for the proxy. Are you familiar with that ? Is there a way to implement this solution in kuma as well ?

michaelkoro avatar Aug 17 '21 09:08 michaelkoro

xref #2571

jpeach avatar Aug 17 '21 23:08 jpeach

@michaelkoro with ISTIO it was a annotation https://github.com/istio/istio/issues/11130

annotations:
  proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }'

But I don't think there was a closure, I think unless K8s has a way to order the containers scheduling in pod , these are just workarounds.

In case of Kuma, it appears that the issue is with only DNS that starts with DP , though you can disable DNS on the DP and use DNS via CP ( @jpeach please correct me if I'm wrong), I think it was not best practice.

skaravad avatar Aug 18 '21 02:08 skaravad

@skaravad I actually noticed now that when deploying kong to a kuma-managed namespace, we are getting the following error from the kuma sidecar container, which fails the kong deployment:

Error: could not read file /var/run/secrets/kubernetes.io/serviceaccount/token: stat /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory

Which service account is it looking for ?

michaelkoro avatar Aug 18 '21 15:08 michaelkoro

Ok, I have my guesses. When injecting kuma, at the beginning there is kuma-init init container started, which is installing transparent proxying, which is also redirecting all DNS traffic to kuma-dp DNS server (by default), as the server starts with the envoy in kuma-sidecar container, DNS traffic won't work in the duration between kuma-init will finish and kuma-dp DNS server would start. I'm not sure how to fix this at this point yet, without disabling kuma-dp DNS servers.

bartsmykla avatar Aug 24 '21 08:08 bartsmykla

@michaelkoro we use service account token as authentication mechanism between kuma-dp and kuma-cp.

jakubdyszkiewicz avatar Aug 24 '21 08:08 jakubdyszkiewicz

actually, we discussed it with @jakubdyszkiewicz and it's not even a DNS thing, as all traffic is redirected then, so kuma-dp has to be fully running

bartsmykla avatar Aug 24 '21 08:08 bartsmykla

@bartsmykla Yea what I ended up doing to avoid the problem was disabling the kuma injection on the kong pre and post migration jobs, just so it could work properly. Not sure why, but the kong pod itself managed to connect to the DB (meaning network was set up), but the pre migration job (which is the same kong image) couldn’t.

michaelkoro avatar Aug 24 '21 12:08 michaelkoro

Someone also mentionned: https://medium.com/@marko.luksa/delaying-application-start-until-sidecar-is-ready-2ec2d21a7b74

lahabana avatar Nov 23 '21 10:11 lahabana

This issue was inactive for 30 days it will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant please comment on it promptly or attend the next triage meeting.

github-actions[bot] avatar Dec 25 '21 08:12 github-actions[bot]

There's some research required here as it might not be straight forward.

lahabana avatar Jan 31 '22 15:01 lahabana

This issue was inactive for 30 days it will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant please comment on it promptly or attend the next triage meeting.

github-actions[bot] avatar Mar 04 '22 08:03 github-actions[bot]

And the same problem on pod shutdown. sidecar dies faster/first and the main container loses network connection.

alt-dima avatar Apr 18 '22 09:04 alt-dima

@alt-dima We also started experiencing this issue. From time to time when a pod dies, kuma receives the SIGTERM and closes all connections, which causes many "network error" logs from our application, until the application pod is terminated.

michaelkoro avatar Apr 20 '22 09:04 michaelkoro

This issue was inactive for 30 days it will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant please comment on it promptly or attend the next triage meeting.

github-actions[bot] avatar May 22 '22 08:05 github-actions[bot]

I believe we've fixed the shutdown issue you are mentioning in the coming release of Kuma @jakubdyszkiewicz can confirm

lahabana avatar May 23 '22 07:05 lahabana

Release 1.7.0 ?

michaelkoro avatar Jun 05 '22 19:06 michaelkoro

Yes releasing early next week

lahabana avatar Jun 06 '22 06:06 lahabana

This issue was inactive for 30 days it will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant please comment on it promptly or attend the next triage meeting.

github-actions[bot] avatar Jul 07 '22 08:07 github-actions[bot]

Seems like we need:

  1. Make sidecar first in the list of containers
  2. Add a PostStart hook on the sidecar that waits for the sidecar to be ready (this could be a http call)
  1. we're always good to make sidecar be the first container (atm it's last and there's no determinism so switching will be fine).
  2. I don't think calling envoy admin is right, we probably want to have this be a combination with the actual DP process.

lahabana avatar Sep 15 '22 10:09 lahabana

@johnharris85 thinks that maybe the order of containers doesn't matter.

lahabana avatar Sep 20 '22 14:09 lahabana

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant, please comment on it or attend the next triage meeting.

github-actions[bot] avatar Dec 20 '22 08:12 github-actions[bot]

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant, please comment on it or attend the next triage meeting.

github-actions[bot] avatar Mar 21 '23 07:03 github-actions[bot]

xref: https://github.com/kumahq/kuma/issues/6082

lahabana avatar Apr 17 '23 18:04 lahabana

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant, please comment on it or attend the next triage meeting.

github-actions[bot] avatar Jul 17 '23 07:07 github-actions[bot]