flink-on-k8s-operator icon indicating copy to clipboard operation
flink-on-k8s-operator copied to clipboard

Helm Chart install uses self-signed cert

Open technomage opened this issue 5 years ago • 10 comments

The Helm chart creates a self-signed cert which is being rejected by kubectl apply when I try to create a job cluster.

michael@michael:~$ sudo kubectl apply -f cog/authoring/test.yaml [sudo] password for michael: Error from server (InternalError): error when creating "cog/authoring/test.yaml": Internal error occurred: failed calling webhook "mflinkcluster.flinkoperator.k8s.io": Post https://flink-operator-webhook-service.flink-operator-system.svc:443/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster?timeout=30s: x509: certificate signed by unknown authority

technomage avatar Jul 08 '20 15:07 technomage

Is this a helm chart specific problem? Does it manifest with make deploy?

functicons avatar Jul 09 '20 15:07 functicons

What does cog/authoring/test.yaml do exactly? Can you attach the exact install commands you've run in CLI?

hongyegong avatar Jul 10 '20 17:07 hongyegong

I am also seeing this issue. It seems possible that the webhook certs are being overwritten after a helm upgrade? I'm using helm3. see below, clientConfig.caBundle had a val then is set to blank Cg==

e.g.:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"admissionregistration.k8s.io/v1beta1","kind":"MutatingWebhookConfiguration","metadata":{"annotations":{},"creationTimestamp":null,"name":"flink-operator-mutating-webhook-configuration"},"webhooks":[{"clientConfig":{"caBundle":"LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM2ekNDQWRNQ0ZCL0ZDekZOWk1Za2hIQmhjeGszZzJmMTB3UnpNQTBHQ1NxR1NJYjNEUUVCQ3dVQU1Db3gKS0RBbUJnTlZCQU1NSDBGa2JXbHpjMmx2YmlCRGIyNTBjbTlzYkdWeUlGZGxZbWh2YjJzZ1EwRXdIaGNOTWpBdwpOekUyTURBek5UUXlXaGNOTWpBd09ERTFNREF6TlRReVdqQTZNVGd3TmdZRFZRUUREQzltYkdsdWF5MXZjR1Z5CllYUnZjaTEzWldKb2IyOXJMWE5sY25acFkyVXVabXhwYm1zdGMzbHpkR1Z0TG5OMll6Q0NBU0l3RFFZSktvWkkKaHZjTkFRRUJCUUFEZ2dFUEFEQ0NBUW9DZ2dFQkFMN1g2Znk5YnZwZHRkYlYrVm5pdVZWSFNCVzhsbmM1VmYxMgpMMUpMcGFzYjhSc0pyak91eXJ4SkkxSGFmMVczczRWM2tqTE84ZnptQ2FPUVJFeWxaRElpaXc2S3dDdTgwSmV2CmdWc2RGd0twY1ZhOW1JWUJJVHZZTGpqdDNSbHBTZ3U3ZStURzgzcUUxYkhlV2lCa1IyQVdvb3ZVbVYvakU0dUMKREFhUHJzOTJKMG1Xbm9QMWErODh4Z2g2eE5zZ2xYRjlZQmk3RzBGL04ybFZnNXJnczNvTXpEdXU0cWlmV0d6bwoxNENpSFowWGwrbnAwdDRuY3pJRk4rck5yN3RMd3B0SDJzL0pTaUYwSlk3eUhMVVZBeFhPSWh4d2RoelFjQ0FiClU1UTRRWVU1Z0IvL1RjamNIRWVtUkUzYUQwTFY3TDlkRkplRVFIZ25ydjR6VDU3VUZLRUNBd0VBQVRBTkJna3EKaGtpRzl3MEJBUXNGQUFPQ0FRRUFhYmhxaG1wcWdOd2pIQmJPa1BwSC8zeGhVVERYaGVzQXd2Yi8wdDhCaVRILwpJWU1xcGpNRWRVR2hKclZ1cGVWNGVGRkxFNG51VUd6SmVxcmY4NFBtdUZaN0EweVkyd3czV2RKZ3gvN0xFcmJYCi90MEhMMWVHMzhyR3FFenFvZk5mWFUvSytnclVkWW8rWGdqWFluZnY5WXBKbHJnYzRIeDZqN0ZjN2thVmJKR0cKcnY3bHF4M0pOYlNIZkI0b2JtYnc1dFpLODRRbEhuN215aFoxdHowbDNsblF4TGVHdUdTdXN0STBxaVFtcU5SQQpiN0tVSGkwMUxxN0l1MkgvdXJ4OXJwamJzYjNPd2xEYUlaS2pOTUYySmxoYS8xT0FCc0x5T1B1NFJtaExiUCt0CnIvUHFWcnR3WTRNYUZpd2Vjc0VaR1BLWHZmaE1pRW1DdVZYVXRUenk0dz09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K","service":{"name":"flink-operator-webhook-service","namespace":"flink-system","path":"/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster"}},"failurePolicy":"Fail","name":"mflinkcluster.flinkoperator.k8s.io","rules":[{"apiGroups":["flinkoperator.k8s.io"],"apiVersions":["v1beta1"],"operations":["CREATE","UPDATE"],"resources":["flinkclusters"]}]}]}
  creationTimestamp: "2020-07-16T00:35:30Z"
  generation: 3
  name: flink-operator-mutating-webhook-configuration
  resourceVersion: "162635766"
  selfLink: /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/flink-operator-mutating-webhook-configuration
  uid: f1fb4a9c-b6ae-47c1-8cf4-f502d1ddf9b6
webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    caBundle: Cg==
    service:
      name: flink-operator-webhook-service
      namespace: flink-system
      path: /mutate-flinkoperator-k8s-io-v1beta1-flinkcluster
      port: 443
  failurePolicy: Fail
  matchPolicy: Exact
  name: mflinkcluster.flinkoperator.k8s.io
  namespaceSelector: {}
  objectSelector: {}
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - flinkoperator.k8s.io
    apiVersions:
    - v1beta1
    operations:
    - CREATE
    - UPDATE
    resources:
    - flinkclusters
    scope: '*'
  sideEffects: Unknown
  timeoutSeconds: 30

jaredstehler avatar Jul 16 '20 13:07 jaredstehler

The helm chart creates certs in a task in the chart as self-signed certs. It should use the cluster CA to generate the cert at least.

technomage avatar Jul 17 '20 15:07 technomage

I am also seeing this issue. It seems possible that the webhook certs are being overwritten after a helm upgrade? I'm using helm3. see below, clientConfig.caBundle had a val then is set to blank Cg==

e.g.:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"admissionregistration.k8s.io/v1beta1","kind":"MutatingWebhookConfiguration","metadata":{"annotations":{},"creationTimestamp":null,"name":"flink-operator-mutating-webhook-configuration"},"webhooks":[{"clientConfig":{"caBundle":"LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM2ekNDQWRNQ0ZCL0ZDekZOWk1Za2hIQmhjeGszZzJmMTB3UnpNQTBHQ1NxR1NJYjNEUUVCQ3dVQU1Db3gKS0RBbUJnTlZCQU1NSDBGa2JXbHpjMmx2YmlCRGIyNTBjbTlzYkdWeUlGZGxZbWh2YjJzZ1EwRXdIaGNOTWpBdwpOekUyTURBek5UUXlXaGNOTWpBd09ERTFNREF6TlRReVdqQTZNVGd3TmdZRFZRUUREQzltYkdsdWF5MXZjR1Z5CllYUnZjaTEzWldKb2IyOXJMWE5sY25acFkyVXVabXhwYm1zdGMzbHpkR1Z0TG5OMll6Q0NBU0l3RFFZSktvWkkKaHZjTkFRRUJCUUFEZ2dFUEFEQ0NBUW9DZ2dFQkFMN1g2Znk5YnZwZHRkYlYrVm5pdVZWSFNCVzhsbmM1VmYxMgpMMUpMcGFzYjhSc0pyak91eXJ4SkkxSGFmMVczczRWM2tqTE84ZnptQ2FPUVJFeWxaRElpaXc2S3dDdTgwSmV2CmdWc2RGd0twY1ZhOW1JWUJJVHZZTGpqdDNSbHBTZ3U3ZStURzgzcUUxYkhlV2lCa1IyQVdvb3ZVbVYvakU0dUMKREFhUHJzOTJKMG1Xbm9QMWErODh4Z2g2eE5zZ2xYRjlZQmk3RzBGL04ybFZnNXJnczNvTXpEdXU0cWlmV0d6bwoxNENpSFowWGwrbnAwdDRuY3pJRk4rck5yN3RMd3B0SDJzL0pTaUYwSlk3eUhMVVZBeFhPSWh4d2RoelFjQ0FiClU1UTRRWVU1Z0IvL1RjamNIRWVtUkUzYUQwTFY3TDlkRkplRVFIZ25ydjR6VDU3VUZLRUNBd0VBQVRBTkJna3EKaGtpRzl3MEJBUXNGQUFPQ0FRRUFhYmhxaG1wcWdOd2pIQmJPa1BwSC8zeGhVVERYaGVzQXd2Yi8wdDhCaVRILwpJWU1xcGpNRWRVR2hKclZ1cGVWNGVGRkxFNG51VUd6SmVxcmY4NFBtdUZaN0EweVkyd3czV2RKZ3gvN0xFcmJYCi90MEhMMWVHMzhyR3FFenFvZk5mWFUvSytnclVkWW8rWGdqWFluZnY5WXBKbHJnYzRIeDZqN0ZjN2thVmJKR0cKcnY3bHF4M0pOYlNIZkI0b2JtYnc1dFpLODRRbEhuN215aFoxdHowbDNsblF4TGVHdUdTdXN0STBxaVFtcU5SQQpiN0tVSGkwMUxxN0l1MkgvdXJ4OXJwamJzYjNPd2xEYUlaS2pOTUYySmxoYS8xT0FCc0x5T1B1NFJtaExiUCt0CnIvUHFWcnR3WTRNYUZpd2Vjc0VaR1BLWHZmaE1pRW1DdVZYVXRUenk0dz09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K","service":{"name":"flink-operator-webhook-service","namespace":"flink-system","path":"/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster"}},"failurePolicy":"Fail","name":"mflinkcluster.flinkoperator.k8s.io","rules":[{"apiGroups":["flinkoperator.k8s.io"],"apiVersions":["v1beta1"],"operations":["CREATE","UPDATE"],"resources":["flinkclusters"]}]}]}
  creationTimestamp: "2020-07-16T00:35:30Z"
  generation: 3
  name: flink-operator-mutating-webhook-configuration
  resourceVersion: "162635766"
  selfLink: /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/flink-operator-mutating-webhook-configuration
  uid: f1fb4a9c-b6ae-47c1-8cf4-f502d1ddf9b6
webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    caBundle: Cg==
    service:
      name: flink-operator-webhook-service
      namespace: flink-system
      path: /mutate-flinkoperator-k8s-io-v1beta1-flinkcluster
      port: 443
  failurePolicy: Fail
  matchPolicy: Exact
  name: mflinkcluster.flinkoperator.k8s.io
  namespaceSelector: {}
  objectSelector: {}
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - flinkoperator.k8s.io
    apiVersions:
    - v1beta1
    operations:
    - CREATE
    - UPDATE
    resources:
    - flinkclusters
    scope: '*'
  sideEffects: Unknown
  timeoutSeconds: 30

cg== you are seeing is supposed to be replaced by self generate cert, it was a move to decouple flink operator with cert-manager. Job cluster was able to be created successfully at the time last chart release was made. It could be some components in helm chart is outdated since and needs to be synced with most up-to-date operator code base. I'll try to make a new chart see if it will solve the problem.

hongyegong avatar Jul 20 '20 21:07 hongyegong

I'm running into this same issue. Initially the certificate is correct, then if you update anything it becomes unset in both the MutatingWebhookConfiguration and ValidatingWebhookConfiguration.

Removing the operator chart and reinstalling it sets the correct certificate.

I believe the issue is that the Job is running every update. Helm 3 has hooks where you can specify the condition for when a Job should run.

By adding the annotation below to the cert-job, this issue should be resolved for fresh installs.

annotations:
    "helm.sh/hook": post-install

KamalAman avatar Aug 26 '20 20:08 KamalAman

If you're running this from the Helm repo then the released version of the chart is unable to create secrets and will fail, the version on master is fine.

stevehipwell avatar Sep 01 '20 09:09 stevehipwell

I'm still seeing this behavior (version from master) Uninstall then reinstall like @KamalAman said did solve this, but I think its important to figure out why this keeps happening, any one has a suggestion? @functicons , @hongyegong maybe?

shashken avatar Jan 18 '21 15:01 shashken

The two WebhookConfiguration are first created by the helm template :

  • https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/helm-chart/flink-operator/templates/flink-operator.yaml#L11
  • https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/helm-chart/flink-operator/templates/flink-operator.yaml#L386

The cert-job is then started, and it applies a version of the webhookConfig with the caBundle set, using the "envsubst templates" stored in the webhook-configMap (https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/helm-chart/flink-operator/templates/generate-cert.yaml#L67).

If you run helm again, the cert-job will not run since its still in the cluster (in state completed), but the helm template versions with \n caBundle will be applied since it differs from whats in the cluster.

If you delete the Job before rerunning helm, the job will be re-created and overwrite the caBundle with a proper value and everything is works again.

A possible solution is to remove the two WebhookConfigs from the flink-operator.yaml and leave up to the the cert job to create them. A pre-delete hook is then needed to remove them on uninstall, as helm is no longer maintaining them directly.

jalkjaer avatar Jan 29 '21 01:01 jalkjaer

After first deployment of Flink operator I can't apply job manifest too due certificate validation error. I extracted the generated secret with crt and key, and import into trust store -- after that I be able to create job with mflinkcluster.flinkoperator.k8s.io what did i wrong?

pashtet04 avatar Jun 22 '21 12:06 pashtet04