kubecf Following the docs for kind+Eirini result in a non working cluster

Describe the bug Follow the docs there: https://kubecf.io/docs/tutorials/deploy-kind/

(take this PR into account: https://github.com/cloudfoundry-incubator/kubecf-docs/pull/37)

When all pods are up and running, try to push an app (e.g. https://github.com/scf-samples/dizzylizard)

After staging is done, the app pods don't start. They fail with Error: ErrImagePull

Events:
  Type     Reason     Age                   From                           Message
  ----     ------     ----                  ----                           -------
  Normal   Scheduled  <unknown>                                            Successfully assigned eirini/dizzy-default-0b5c24131a-3 to kubecf-control-plane
  Normal   Pulling    23m (x4 over 24m)     kubelet, kubecf-control-plane  Pulling image "127.0.0.1:31666/cloudfoundry/1056bd51-07bd-4e84-9a2c-f907e68d73b4:362379573001a18c2b4671ae9fbeb9ba17be290e"
  Warning  Failed     23m (x4 over 24m)     kubelet, kubecf-control-plane  Failed to pull image "127.0.0.1:31666/cloudfoundry/1056bd51-07bd-4e84-9a2c-f907e68d73b4:362379573001a18c2b4671ae9fbeb9ba17be290e": rpc error: code = Unknown desc = failed to pull and unpack image "127.0.0.1:31666/cloudfoundry/1056bd51-07bd-4e84-9a2c-f907e68d73b4:362379573001a18c2b4671ae9fbeb9ba17be290e": failed to resolve reference "127.0.0.1:31666/cloudfoundry/1056bd51-07bd-4e84-9a2c-f907e68d73b4:362379573001a18c2b4671ae9fbeb9ba17be290e": unexpected status code [manifests 362379573001a18c2b4671ae9fbeb9ba17be290e]: 400 Bad Request
  Warning  Failed     23m (x4 over 24m)     kubelet, kubecf-control-plane  Error: ErrImagePull
  Warning  Failed     9m58s (x64 over 24m)  kubelet, kubecf-control-plane  Error: ImagePullBackOff
  Normal   BackOff    4m57s (x86 over 24m)  kubelet, kubecf-control-plane  Back-off pulling image "127.0.0.1:31666/cloudfoundry/1056bd51-07bd-4e84-9a2c-f907e68d73b4:362379573001a18c2b4671ae9fbeb9ba17be290e"

To Reproduce See above

Expected behavior The application pod should be up and running

Environment kubecf v2.6.1 on kind v0.9.0 go1.15.2 linux/amd64

Additional context Add any other context about the problem here.

Nov 04 '20 08:11 jimmykarily

I hit this same issue, kubecf v2.6.1 on kind v0.9.0 go1.13 linux/amd64. Was wondering if it had something do to with the cert instructions. The command does succeed but differs from that at the bottom of https://kubecf.io/docs/tutorials/deploy-k3s (cert location is from a bits-service-ssl secret rather than the node's system). Could be a red herring, kind vs k3s, but thought it worth mentioning.

Nov 04 '20 09:11 richard-cox

I replaced the opi image in the eirini pod with jimmykarily/opi which is an image I built using the same code but disabling the code that deletes the staging job after it's done (more here: https://github.com/cloudfoundry-incubator/kubecf/issues/1323#issuecomment-692530753). It seems that the uploader init container never succeeds:

                                                                                                                                                                                               
┌────────────────────────────────────────────────────────────────────── Containers(eirini/dizzylizard-default-6cmpb)[3] ──────────────────────────────────────────────────────────────────────┐
│ NAME↑                    PF     IMAGE                                                            READY      STATE          INIT           RESTARTS PROBES(L:R)      PORTS      AGE          │
│ opi-task-downloader      ●      registry.suse.com/cap-staging/recipe-downloader:1.8.0-24.56      true       Completed      true                  0 off:off                     5m27s        │
│ opi-task-executor        ●      registry.suse.com/cap-staging/recipe-executor:1.8.0-24.56        true       Completed      true                  0 off:off                     5m27s        │
│ opi-task-uploader        ●      registry.suse.com/cap-staging/recipe-uploader:1.8.0-24.56        false      Completed      false                 0 off:off                     5m27s        │
│    
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘                                                                                                                                                                                         │

but it doesn't print an error either.

Nov 04 '20 12:11 jimmykarily

Can't tell for sure if the staging container worked or not. What I do see though is an error in the singleton blobstore pod:

~/dizzylizard (master)*$ kubectl  logs -n kubecf singleton-blobstore-0  -c blobstore-nginx
nginx: [alert] could not open error log file: open() "/var/vcap/packages/nginx_webdav/logs/error.log" failed (13: Permission denied)

execing in the pod shows that the nginx process is started by user vcap but that dir is owned by root. Not sure if it's relevant to the failed staging but something to look at for sure (may have to do with changes in the stemcell).

Also this may be relevant:

https://github.com/cloudfoundry-incubator/kubecf/commit/31dc889eaf7b71b02e396d59f959275f9927756b#diff-6cef92220d2f63c1a73bbbeca21e2d1a7c08d210fbf109a08ee08bae91e723a1

Nov 04 '20 13:11 jimmykarily

I tried deploying v2.6.1 using the make targets in kubecf:

$ git checkout v2.6.1
$ make kind-start
$ make all

and after all pods are up and running, pushing the example app (dizzylizard) works. So, I realized it must have something to do with how kind is setup, because make kind-start is doing some preparation to the cluster, other than simply calling kind create cluster: https://github.com/cloudfoundry-incubator/kubecf/blob/master/scripts/kind-start.sh

To verify, I created a fresh cluster with make kind-start and then I followed the docs to deploy kubecf like in the description (thus the only difference to the reproduction steps was the way I created the kind cluster). Pushing the app works in this case.

So it seems that something in what make kind-start is doing, is necessary to make things work. We need to find out what that is and document that in the documentation page.

Nov 05 '20 19:11 jimmykarily

ok new data: Simply using kind create cluster --image "kindest/node:v1.17.5" and then following the docs makes it work. The problem is with the k8s version. The make target is pulling 1.17.5 while the command from the docs simply pulls latest (which doesn't work). Is 1.19 already supported by kubecf @viovanov ? If not then I will simply update the docs to use a supported version (preferably 1.17.5 which is known to work).

Nov 06 '20 10:11 jimmykarily