cf-for-k8s
cf-for-k8s copied to clipboard
istio-proxy CrashLoopBackOff cf-blobstore-minio, cf-db-postgresql, log-cache
Describe the bug
The 3 containers (cf-blobstore-minio, cf-db-postgresql, log-cache) remain in a loop during the installation, they hang due to an istio error. The error is the same between the main branch and the istio 1.7.4 branch. Exemplary shown at cf-db-postgresql-0:
NAMESPACE NAME READY STATUS RESTARTS AGE
cf-blobstore cf-blobstore-minio-7467c77cbf-jqk66 0/2 CrashLoopBackOff 8 9m28s
cf-db cf-db-postgresql-0 1/2 CrashLoopBackOff 4 9m28s
cf-system log-cache-7789cbc896-gmwzf 3/5 CrashLoopBackOff 4 9m28s
To Reproduce*
Steps to reproduce the behavior:
- kapp Installation of
cf-4-k8slike described here: https://github.com/cloudfoundry/cf-for-k8s/blob/develop/docs/getting-started-tutorial.md - See pod error
Expected behavior
A running cf-for-k8s installation.
Additional context
kubectl pod cf-db-postgresql-0 -n cf-db
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned cf-db/cf-db-postgresql-0 to kw-bb79ae50-0000
Normal Pulling 4m17s kubelet, kw-bb79ae50-0000 Pulling image "gcr.io/cf-routing/proxyv2:1.7.4"
Normal Created 4m16s kubelet, kw-bb79ae50-0000 Created container istio-init
Normal Started 4m16s kubelet, kw-bb79ae50-0000 Started container istio-init
Normal Pulled 4m16s kubelet, kw-bb79ae50-0000 Successfully pulled image "gcr.io/cf-routing/proxyv2:1.7.4"
Warning FailedPostStartHook 3m15s kubelet, kw-bb79ae50-0000 Exec lifecycle hook ([pilot-agent wait]) for Container "istio-proxy" in Pod "cf-db-postgresql-0_cf-db(cbe24c2a-4f41-4944-9523-2ed483cd7cc8)" failed - error: command 'pilot-agent wait' exited with 255: Error: timeout waiting for Envoy proxy to become ready. Last error: Get "http://localhost:15021/healthz/ready": dial tcp 127.0.0.1:15021: connect: connection refused
, message: "2020-11-12T18:48:38.803425Z\tinfo\tWaiting for Envoy proxy to be ready (timeout: 60 seconds)...\n2020-11-12T18:49:38.874103Z\terror\ttimeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp 127.0.0.1:15021: connect: connection refused\nError: timeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp 127.0.0.1:15021: connect: connection refused\n"
Normal Pulled 3m9s kubelet, kw-bb79ae50-0000 Container image "index.docker.io/bitnami/postgresql@sha256:0f76a419cfd9996036e3a53672f50cf69ed7699f1241cbf8e20af17bbbdf0683" already present on machine
Normal Started 3m9s kubelet, kw-bb79ae50-0000 Started container cf-db-postgresql
Normal Created 3m9s kubelet, kw-bb79ae50-0000 Created container cf-db-postgresql
Warning FailedPostStartHook 2m8s kubelet, kw-bb79ae50-0000 Exec lifecycle hook ([pilot-agent wait]) for Container "istio-proxy" in Pod "cf-db-postgresql-0_cf-db(cbe24c2a-4f41-4944-9523-2ed483cd7cc8)" failed - error: command 'pilot-agent wait' exited with 255: Error: timeout waiting for Envoy proxy to become ready. Last error: Get "http://localhost:15021/healthz/ready": dial tcp 127.0.0.1:15021: connect: connection refused
, message: "2020-11-12T18:49:45.307327Z\tinfo\tWaiting for Envoy proxy to be ready (timeout: 60 seconds)...\n2020-11-12T18:50:45.381055Z\terror\ttimeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp 127.0.0.1:15021: connect: connection refused\nError: timeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp 127.0.0.1:15021: connect: connection refused\n"
Normal Killing 2m8s (x2 over 3m15s) kubelet, kw-bb79ae50-0000 FailedPostStartHook
Warning BackOff 2m1s (x2 over 2m2s) kubelet, kw-bb79ae50-0000 Back-off restarting failed container
Normal Created 109s (x3 over 4m15s) kubelet, kw-bb79ae50-0000 Created container istio-proxy
Normal Pulled 109s (x3 over 4m15s) kubelet, kw-bb79ae50-0000 Successfully pulled image "gcr.io/cf-routing/proxyv2:1.7.4"
Normal Pulling 109s (x3 over 4m15s) kubelet, kw-bb79ae50-0000 Pulling image "gcr.io/cf-routing/proxyv2:1.7.4"
Normal Started 109s (x3 over 4m15s) kubelet, kw-bb79ae50-0000 Started container istio-proxy
kubectl logs cf-db-postgresql-0 istio-proxy -n cf-db
2020-11-12T19:01:13.718995Z info FLAG: --concurrency="2"
2020-11-12T19:01:13.719028Z info FLAG: --disableInternalTelemetry="false"
2020-11-12T19:01:13.719032Z info FLAG: --domain="cf-db.svc.cluster.local"
2020-11-12T19:01:13.719034Z info FLAG: --help="false"
2020-11-12T19:01:13.719035Z info FLAG: --id=""
2020-11-12T19:01:13.719037Z info FLAG: --ip=""
2020-11-12T19:01:13.719040Z info FLAG: --log_as_json="false"
2020-11-12T19:01:13.719042Z info FLAG: --log_caller=""
2020-11-12T19:01:13.719045Z info FLAG: --log_output_level="default:info"
2020-11-12T19:01:13.719047Z info FLAG: --log_rotate=""
2020-11-12T19:01:13.719050Z info FLAG: --log_rotate_max_age="30"
2020-11-12T19:01:13.719052Z info FLAG: --log_rotate_max_backups="1000"
2020-11-12T19:01:13.719055Z info FLAG: --log_rotate_max_size="104857600"
2020-11-12T19:01:13.719058Z info FLAG: --log_stacktrace_level="default:none"
2020-11-12T19:01:13.719067Z info FLAG: --log_target="[stdout]"
2020-11-12T19:01:13.719070Z info FLAG: --meshConfig="./etc/istio/config/mesh"
2020-11-12T19:01:13.719072Z info FLAG: --mixerIdentity=""
2020-11-12T19:01:13.719074Z info FLAG: --outlierLogPath=""
2020-11-12T19:01:13.719075Z info FLAG: --proxyComponentLogLevel="misc:error"
2020-11-12T19:01:13.719077Z info FLAG: --proxyLogLevel="warning"
2020-11-12T19:01:13.719079Z info FLAG: --serviceCluster="postgresql.cf-db"
2020-11-12T19:01:13.719080Z info FLAG: --serviceregistry="Kubernetes"
2020-11-12T19:01:13.719082Z info FLAG: --stsPort="0"
2020-11-12T19:01:13.719084Z info FLAG: --templateFile=""
2020-11-12T19:01:13.719085Z info FLAG: --tokenManagerPlugin="GoogleTokenExchange"
2020-11-12T19:01:13.719087Z info FLAG: --trust-domain="cluster.local"
2020-11-12T19:01:13.719115Z info Version 1.7.4-4ce531ff1823a3abb9f42fa9d35523b0436e2d04-Clean
2020-11-12T19:01:13.719257Z info Obtained private IP [10.233.85.101]
2020-11-12T19:01:13.719327Z info Apply proxy config from env {"proxyMetadata":{"DNS_AGENT":""}}
2020-11-12T19:01:13.720167Z info Effective config: binaryPath: /usr/local/bin/envoy
concurrency: 2
configPath: ./etc/istio/proxy
controlPlaneAuthPolicy: MUTUAL_TLS
discoveryAddress: istiod.istio-system.svc:15012
drainDuration: 45s
envoyAccessLogService: {}
envoyMetricsService: {}
parentShutdownDuration: 60s
proxyAdminPort: 15000
proxyMetadata:
DNS_AGENT: ""
serviceCluster: postgresql.cf-db
statNameLength: 189
statusPort: 15020
terminationDrainDuration: 5s
tracing:
zipkin:
address: zipkin.istio-system:9411
2020-11-12T19:01:13.720198Z info Proxy role: &model.Proxy{Type:"sidecar", IPAddresses:[]string{"10.233.85.101"}, ID:"cf-db-postgresql-0.cf-db", Locality:(*envoy_config_core_v3.Locality)(nil), DNSDomain:"cf-db.svc.cluster.local", ConfigNamespace:"", Metadata:(*model.NodeMetadata)(nil), SidecarScope:(*model.SidecarScope)(nil), PrevSidecarScope:(*model.SidecarScope)(nil), MergedGateway:(*model.MergedGateway)(nil), ServiceInstances:[]*model.ServiceInstance(nil), IstioVersion:(*model.IstioVersion)(nil), ipv6Support:false, ipv4Support:false, GlobalUnicastIP:"", XdsResourceGenerator:model.XdsResourceGenerator(nil), Active:map[string]*model.WatchedResource(nil), ActiveExperimental:map[string]*model.WatchedResource(nil), RequestedTypes:struct { CDS string; EDS string; RDS string; LDS string }{CDS:"", EDS:"", RDS:"", LDS:""}}
2020-11-12T19:01:13.720201Z info JWT policy is first-party-jwt
2020-11-12T19:01:13.720234Z info PilotSAN []string{"istiod.istio-system.svc"}
2020-11-12T19:01:13.720237Z info MixerSAN []string{"spiffe://cluster.local/ns/istio-system/sa/istio-mixer-service-account"}
2020-11-12T19:01:13.720269Z info sa.serverOptions.CAEndpoint == istiod.istio-system.svc:15012
2020-11-12T19:01:13.720273Z info Using user-configured CA istiod.istio-system.svc:15012
2020-11-12T19:01:13.720275Z info istiod uses self-issued certificate
2020-11-12T19:01:13.720312Z info the CA cert of istiod is: -----BEGIN CERTIFICATE-----
MIIC3TCCAcWgAwIBAgIQapJx+9diCFILU4GdGIwEOTANBgkqhkiG9w0BAQsFADAY
MRYwFAYDVQQKEw1jbHVzdGVyLmxvY2FsMB4XDTIwMTExMjE4NDgyNFoXDTMwMTEx
MDE4NDgyNFowGDEWMBQGA1UEChMNY2x1c3Rlci5sb2NhbDCCASIwDQYJKoZIhvcN
AQEBBQADggEPADCCAQoCggEBAN/DXVHDimEqNIJrIyw8WPB87RRFEbvtpWxGxtK+
nTQsPTXosCZqelJUdE6sCUD/fCa1AGFa2hbTvSdEGQwi2LeW7U3V7h4RrZkFE/JF
o8uhNeh//AFx3iGlzHrQnu7n5S/DAKrVL/BxOZZIC33Q8gorNeuQoHcb0HkdegTO
vThDNo0iJlbfhUjiGavRuOfC1pSJkXUZWfz++HtpnWb5XX+KI2kf6amd1JB0VROk
AJagGlgXnCyBU7ZwZGVq/uWuOBR1jRIVYx2iTH2d7k2Qm9oC3r/lO4KaxUEStzIF
tL2IzDZrWlTWPfrJAAc/BnEqNw2xdr+T7fw6D5Zf+Ko/DO8CAwEAAaMjMCEwDgYD
VR0PAQH/BAQDAgIEMA8GA1UdEwEB/wQFMAMBAf8wDQYJKoZIhvcNAQELBQADggEB
AGfEzOhlfwZcVTjcVDBnNbj2VGLYQX5ca/VTwWxN4paFHdyKdht+ksiKWl3sYfU9
jypwoKn1E3MYPGpP5KQ9/h/7xm2ugZCUcH+2F9KtfPdpOwinUgZIUR2R2a2M/hAe
YpAzyPLgv+H2AlLhKohCztMhqTjKjPwWzuIyPtnCrHZrFlkVBRpSZWTrkWS8fO21
e44zvxrS3JBWBvsICvE6+BNBhDoa/j3KhsO2eCL70pWhb6OwpwnFyGcD+e9udX4q
ifwbe/RJU0fCKJVvrbI6aF3uov+/3KSB2rPTLllGz38zIHXgne/rltshbGM0sBMZ
dCjVQJyHmGbxrK9x4RQX9sQ=
-----END CERTIFICATE-----
2020-11-12T19:01:13.757302Z info sds SDS gRPC server for workload UDS starts, listening on "./etc/istio/proxy/SDS"
2020-11-12T19:01:13.757417Z info Starting proxy agent
2020-11-12T19:01:13.757477Z info Opening status port 15020
2020-11-12T19:01:13.757543Z info sds Start SDS grpc server
2020-11-12T19:01:13.757569Z info Received new config, creating new Envoy epoch 0
2020-11-12T19:01:13.757631Z info Epoch 0 starting
2020-11-12T19:01:15.760254Z info Envoy command: [-c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster postgresql.cf-db --service-node sidecar~10.233.85.101~cf-db-postgresql-0.cf-db~cf-db.svc.cluster.local --local-address-ip-version v4 --log-format-prefix-with-location 0 --log-format %Y-%m-%dT%T.%fZ %l envoy %n %v -l warning --component-log-level misc:error --concurrency 2]
2020-11-12T19:01:15.797727Z warning envoy runtime Unable to use runtime singleton for feature envoy.reloadable_features.activate_fds_next_event_loop
2020-11-12T19:01:15.828047Z warning envoy config StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
2020-11-12T19:01:15.828075Z warning envoy config Unable to establish new stream
2020-11-12T19:01:16.156522Z warning envoy config StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
2020-11-12T19:01:16.156556Z warning envoy config Unable to establish new stream
2020-11-12T19:01:17.989351Z warning envoy config StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
2020-11-12T19:01:17.989387Z warning envoy config Unable to establish new stream
2020-11-12T19:01:27.533580Z warning envoy config StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
2020-11-12T19:01:27.533742Z warning envoy config Unable to establish new stream
2020-11-12T19:01:56.955688Z warning envoy config StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
2020-11-12T19:01:56.955727Z warning envoy config Unable to establish new stream
2020-11-12T19:02:13.468761Z warning envoy config StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
2020-11-12T19:02:13.468793Z warning envoy config Unable to establish new stream
2020-11-12T19:02:13.928066Z info Watcher has successfully terminated
2020-11-12T19:02:13.928062Z info Agent draining Proxy
2020-11-12T19:02:13.928053Z info Status server has successfully terminated
2020-11-12T19:02:13.928258Z error accept tcp [::]:15020: use of closed network connection
2020-11-12T19:02:13.928906Z info Graceful termination period is 5s, starting...
2020-11-12T19:02:18.929006Z info Graceful termination period complete, terminating remaining proxies.
2020-11-12T19:02:18.929046Z warn Aborting epoch 0...
2020-11-12T19:02:18.929050Z warn Aborted all epochs
2020-11-12T19:02:18.929052Z info Agent has successfully terminated
2020-11-12T19:02:18.929339Z warning envoy main caught SIGTERM
Deploy instructions
$ ./hack/generate-values.sh -d vcap.me > ${TMP_DIR}/cf-values.yml
$ cat << EOF >> ${TMP_DIR}/cf-values.yml
app_registry:
hostname: https://index.docker.io/v1/
repository_prefix: "<my_username>"
username: "<my_username>"
password: "<my_password>"
use_first_party_jwt_tokens: true
$ kapp deploy -a cf -f <(ytt -f config -f ${TMP_DIR}/cf-values.yml)
Cluster information
kubespray 2.14
CLI versions
ytt --version: 0.30.0kapp --version: 0.34.0kubectl version: Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:04:18Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}cf version: 7.1.0+4c3168f9a.2020-09-09
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/175707579
The labels on this github issue will be updated when the story is started.
okay, as far as i can see it's a problem with the network policies and, after we (@nrekretep) changed them a bit, all pods start. we investigate further.
Hello @macevil, your issue seems suspiciously similar to this istio issue. I am curious to know what changes you've made to the network policies that made the pods start? Thanks
cc: @astrieanna
Hello @kauana and @astrieanna , so we believe that the standard network policy set prevents necessary communications. we have deleted network policies until all pods are started:
now we have reinstalled cf and investigate the iptables for violations.
Hey @macevil , thanks for submitting this. We're wondering what's different about your environment that's causing this problem. We're currently testing across GKE, AKS, EKS, minikube, and kind and not seeing this issue in any of those environments.
We suspect that in addition to use_first_party_jwt_tokens: true, you may also want to try enable_automount_service_account_token: true
Could you please try that out and let us know if you could use additional help from us?
good morning @jamespollard8, i have set the flag enable_automount_service_account_token: true additionally, but no, the problem remains the same. The application only starts when I delete the network policies.
Hi @macevil,
Thanks for the giving that a try! With that, we're out of ideas for next troubleshooting steps short of actually digging into kubespray ourselves. Given the prioritization of IaaS managed clusters, we don't expect to be able to prioritize investigating kubespray further. In the meantime we're leaving this issue open.
We do want to state that while deleting the network policies may get the installation to complete now, we recommend keeping the network policies enabled for any production environments.
Thanks, Andrew and @jamespollard8
Do you have any updates on this issues? I have tried in Azure VM and experience same issues
`6:56:42AM: L ok: waiting on replicaset/cf-blobstore-minio-65cc549448 (apps/v1) namespace: cf-blobstore 6:56:42AM: L ongoing: waiting on pod/cf-blobstore-minio-65cc549448-shqn9 (v1) namespace: cf-blobstore 6:56:42AM: ^ Condition Ready is not True (False) 6:56:49AM: fail: reconcile deployment/kpack-controller (apps/v1) namespace: kpack 6:56:49AM: ^ Deployment is not progressing: ProgressDeadlineExceeded (message: ReplicaSet "kpack-controller-6c555dc47f" has timed out progressing.)
kapp: Error: waiting on reconcile deployment/kpack-controller (apps/v1) namespace: kpack: Finished unsuccessfully (Deployment is not progressing: ProgressDeadlineExceeded (message: ReplicaSet "kpack-controller-6c555dc47f" has timed out progressing.))`
For me solution were kind create cluster --config=./deploy/kind/cluster.yml --image kindest/node:v1.20.2 instead of kind create cluster --config=./deploy/kind/cluster.yml --image kindest/node:v1.21.1
Means - v1.20.2 version is wroking