faas-netes Gateway Pod crashes when Profiles CRD is deleted

My actions before raising this issue

[ ] Followed the troubleshooting guide
[ ] Read/searched the docs
[ ] Searched past issues

@alexellis Posting here and plan to edit and follow-up, so that this issue doesn't get lost.

Expected Behaviour

I don't think that this should crash the gateway, but I do think an error should be logged.

Current Behaviour

Gateway crashes if you try to create a function with a profile; this seems initially related to using the operator and CRD flags set to true on the faas-netes/openfaas helm chart.

Are you a GitHub Sponsor (Yes/No?)

Check at: https://github.com/sponsors/openfaas

[ ] Yes
[ ] No
[x] No, but I sponsor Alex

List All Possible Solutions and Workarounds

Which Solution Do You Recommend?

Steps to Reproduce (for bugs)

Context

Crashing our gateway instances rather than simply not creating functions.

FaaS-CLI version ( Full output from: faas-cli version ): 0.13.13
Docker version docker version (e.g. Docker 17.0.05 ): 20.10.8
Which deployment method do you use?:

[x] OpenFaaS on Kubernetes
[ ] faasd

server v1.21.3 client v1.22.2

Operating System and version (e.g. Linux, Windows, MacOS): MacOS
Code example or link to GitHub repo or gist to reproduce problem:
Other diagnostic information / logs from troubleshooting guide

Next steps

You may join Slack for community support.

Oct 30 '21 00:10 kevin-lindsay-1

The gateway itself won't crash due to profiles since it doesn't have anything to do with K8s, this is likely the operator container in the same pod. I'm moving this over to the faas-netes repo so someone can look into it.

Oct 30 '21 09:10 alexellis

@LucasRoesler please could you take a look into this?

Oct 30 '21 09:10 alexellis

If anyone else should get here first, the chart has to be installed with operator.create to match Kevin's setup. Here's how to define a profile -> https://docs.openfaas.com/reference/profiles/

The K8s version Kevin's using is 1.21, but this may also be an issue with changes in 1.22?

Install with helm or arkade.

Oct 30 '21 09:10 alexellis

@alexellis taking a look now

Oct 30 '21 10:10 LucasRoesler

i see this warning and looks like the pod restarted, but did not crashed

✦ $ k logs -n openfaas gateway-f4dc6c956-w8hbk faas-netes -f
2021/10/30 13:03:24 Version: 0.13.8     commit: c0a8c3cba4156fac9847953a83cea03bf54e42ef
W1030 13:03:24.564036       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2021/10/30 13:03:24 HTTP Read Timeout: 1m0s
2021/10/30 13:03:24 HTTP Write Timeout: 1m0s
2021/10/30 13:03:24 ImagePullPolicy: Always
2021/10/30 13:03:24 DefaultFunctionNamespace: openfaas-fn
2021/10/30 13:03:24 Starting controller
I1030 13:03:24.587189       1 shared_informer.go:223] Waiting for caches to sync for faas-netes:deployments
I1030 13:03:24.699939       1 shared_informer.go:230] Caches are synced for faas-netes:deployments
I1030 13:03:24.700054       1 shared_informer.go:223] Waiting for caches to sync for faas-netes:endpoints
I1030 13:03:24.800314       1 shared_informer.go:230] Caches are synced for faas-netes:endpoints
I1030 13:03:24.800468       1 shared_informer.go:223] Waiting for caches to sync for faas-netes:profiles
I1030 13:03:24.901292       1 shared_informer.go:230] Caches are synced for faas-netes:profiles
W1030 13:09:27.636346       1 reflector.go:404] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: no kind \"Profile\" is registered for version \"openfaas.com/v1\" in scheme \"github.com/openfaas/faas-netes/pkg/client/clientset/versioned/scheme/register.go:20\"") has prevented the request from su
cceeding

using k8s 1.22.1 running on a Kind cluster

Oct 30 '21 13:10 cpanato

@alexellis i can not reproduce the issue as described:

I can reproduce the original bug that causes the gateway pod to crash when the Profile CRD is missing.

Regarding the bug I can reproduce. I can ensure that the Gateway starts by checking for the Profile CRD during startup. However, now we need to discuss error edge cases.

When using the controller (ie classic faas-netes), I can also add errors to the API when it sees Profiles trying to be used in a cluster that has not enabled them. These seems fine, but we do have error cases that can happen when the CRD is deleted after startup. If we want to be very very safe, i could check for the CRD on every function deploy/update request, but those deploys will also fail with a error anyway, so i am not sure the extra check is really needed. We should probably change some of the logic so that the profile client is only used when the current deployment or the current request reference Profiles.

The operator has the same problem, but with a twist: we don't currently have any kind of validation webhook/controller, we need to handle Function objects that reference profiles even though the cluster doesn't support it.

I see four options

allow all functions to be deployed, completely ignoring Profiles, but simply log a warning when we see functions requesting profiles. I am not a fan of this, it could lead to bugs that are very hard for people to debug, I am only including the option for completeness.
create a validation webhook endpoint in the controller so that we can return validation messages for CRD users as well, we can then just mimic the same behavior as the classic faas-netes.
make the Profile CRD a hard requirement, the Operator should crash with a clear message and tell the admin to deploy both Function and Profile CRDs, we can add this to the helm chart install flow as well. But this approach doesn't really solve the problem in the case where someone removes the CRD after the opeator starts. We would then have error cases that are not surfaced to the developer/client.
If Profiles is disabled, modify the Function deployment so that the function is not scheduleable. One interesting way to do this is to use a Profile that we know will cause the function to be broken. We have two options, (a) apply a bad RuntimeClass or (b) add a toleration for an unlikely to exist taint.

The way that this works is that the the GetProfiles method would return the DisabledPodProfile if the profiles feature is disabled (because we can't find the CRD). Conversely, the GetProfilesToRemove would include the DisabledPodProfile when the profiles feature is enabled. THis means that you have the following possibilities

profiles disabled: GetProfiles returns DisabledPodProfile
profiles disabled: GetProfilesToRemove returns nil or the empty list
profiles enabled: GetProfiles returns the list of profiles (as it would behave today) which may be emtpy
profiles enabled: GetProfilesToRemove returns the list of profiles (or an empty value) and we always append the DisablePodProfile

This combination of behaviors ensures that we disable profile dependent functions when the CRD is missing and we will reenable these functions once the CRD exists.

Option (a) looks like this

// DisablePodProfile is used when the profiles feature is disabled. 
// 
// The cluster admin can fix this function by applying the Profile CRD, restarting the gateway pod, 
// and then redeploying the effected Functions.
var DisablePodProfile = Profile{
	RuntimeClassName: "of-profiles-disabled",
}

Option (b) looks like this

// DisablePodProfile is used when the profiles feature is disabled. The toleration `openfass-profiles=disabled`
// will be added to the Function so that it is not schedulable by default.
//
// The cluster admin can fix this function by  applying the Profile CRD, restarting the gateway pod, 
// and then redeploying the effected Functions.
//
//  Alternatively, the cluster admin can override this by adding the taint 
//     `openfaas-pfoiles=disabled:NoSchedule` 
// to the cluster nodes.
var DisablePodProfile = Profile{
	Tolerations: []corev1.Toleration{
		{
			Key:      "openfaas-profiles",
			Value:    "disabled",
			Operator: corev1.TolerationOpEqual,
		},
	},
}

I can then check if the profiles client is configured and include this profile in the Add/Remove checks. The benefit of using a Toleration, is that a cluster admin could decide (for some reason) to ignore the profiles completely and allow functions to be scheduled. Additionally, they can also just fix the cluster by deploying the Profile CRD and restarting the controller/operator. It will then remove this profile the next time the functions are updated/redeployed.

Oct 30 '21 13:10 LucasRoesler

For reference, i reproduce the crash using this

kind create cluster --image kindest/node:v1.22.1 --config=cluster.yaml
arkade install openfaas -a=false --operator
kubectl -n openfaas rollout status deploy/gateway
kubectl delete crd profiles.openfaas.com
kubectl -n openfaas rollout restart deploy/gateway
kubectl -n openfaas get po -w

where cluster.yaml is this file

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
  - containerPort: 31112 # this is the NodePort created by the helm chart
    hostPort: 8080 # this is your port on localhost
    protocol: TCP

Oct 30 '21 13:10 LucasRoesler

Thank you for this detailed analysis Lucas.

Can you confirm what the steps are to reproduce this problem?

Oct 30 '21 14:10 alexellis

@kevin-lindsay-1 i am a little lost with this issue. The original issue mentioned that the Profile CRD is missing and this caused the crash. Here you mention that the CRD exists, but there are not enough detail to actually verify that and i can't reproduce it in my own cluster.

Can you provide more details about how everything was installed and verification steps? Also what kind of end result are we expecting? As you can see in https://github.com/openfaas/faas-netes/issues/868#issuecomment-955221629, there are several things to consider about edge case and error handling especially in the operator.

Oct 30 '21 15:10 LucasRoesler

@LucasRoesler yeah I'm kinda surprised on the triage speed of this issue, considering I basically threw it in here last night with more or less a post-it note saying "please don't close me; reminder to self to actually write this ticket when I have time".

Oct 30 '21 15:10 kevin-lindsay-1

It'd probably be better to sit on things that don't have a full repro, since several of us have now spent limited time investigating this issue.

One thing that @kevin-lindsay-1 mentioned was that they sometimes delete OpenFaaS and install it again in their staging environment, perhaps there's an ordering problem?

Nov 01 '21 09:11 alexellis

Lucas, I think that if the CRD has been removed, and the gateway pod crashes, that would be working as expected.

It's a mandatory part of the project, even if it's marked as disabled or not used by the installation.

https://github.com/openfaas/faas-netes/issues/868#issuecomment-955237988

Nov 01 '21 09:11 alexellis

I plan on doing a full repro and updating this issue to whatever I identify to be the suspected problem; I do not intend to let this stale, just juggling over here.

Nov 02 '21 22:11 kevin-lindsay-1

@alexellis and @kevin-lindsay-1 i created a proposed change to faas-netes that will have it check for the profiles CRD during deploy/update and block the function with an explicit error if the CRD is missing and it requires the profiles feature. It will also crash as startup with an explicit message about the missing CRD. This should cover both cases, can't find at startup and when the CRD is deleted after startup, and make it easier to debug.

Let me know what you think

Nov 06 '21 13:11 LucasRoesler

@LucasRoesler sounds fine with me, maybe our deployment had profiles: false and we didn't realize because of the error in the gateway.

This sounds like a good feature, as it would potentially let a developer know that devops accidentally forgot to turn on a feature.

I doubt I originally set profiles: true in the helm chart, because I didn't really notice the feature, and the values.yaml doesn't have a comment iirc.

Nov 06 '21 17:11 kevin-lindsay-1

I took a quick look at #872 and added comments.

Nov 19 '21 20:11 alexellis

Is this fixed?

May 12 '22 15:05 kevin-lindsay-1

@kevin-lindsay-1 the PR exists, it just needs design approval from @alexellis

May 13 '22 14:05 LucasRoesler

I'm trying to implement profiles for tolerations and affinity, and I can confirm that I have the CRD enabled, the profiles are being created and do exist, and I am receiving this error. The gateway pod is not crashing.

Logs:

I1004 19:53:37.791376       1 shared_informer.go:247] Caches are synced for faas-netes:deployments 
I1004 19:53:37.791506       1 shared_informer.go:240] Waiting for caches to sync for faas-netes:endpoints
I1004 19:53:37.891907       1 shared_informer.go:247] Caches are synced for faas-netes:endpoints 
I1004 19:53:37.891983       1 shared_informer.go:240] Waiting for caches to sync for faas-netes:profiles
I1004 19:53:37.992824       1 shared_informer.go:247] Caches are synced for faas-netes:profiles 
W1005 04:09:34.856603       1 reflector.go:436] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: no kind \"Profile\" is registered for version \"openfaas.com/v1\" in scheme \"github.com/openfaas/faas-netes/pkg/client/clientset/versioned/scheme/register.go:20\"") has prevented the request from succeeding
W1005 04:09:50.249779       1 reflector.go:436] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: no kind \"Profile\" is registered for version \"openfaas.com/v1\" in scheme \"github.com/openfaas/faas-netes/pkg/client/clientset/versioned/scheme/register.go:20\"") has prevented the request from succeeding
W1005 04:10:40.397332       1 reflector.go:436] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: no kind \"Profile\" is registered for version \"openfaas.com/v1\" in scheme \"github.com/openfaas/faas-netes/pkg/client/clientset/versioned/scheme/register.go:20\"") has prevented the request from succeeding
W1005 04:10:52.872676       1 reflector.go:436] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: no kind \"Profile\" is registered for version \"openfaas.com/v1\" in scheme \"github.com/openfaas/faas-netes/pkg/client/clientset/versioned/scheme/register.go:20\"") has prevented the request from succeeding
W1005 04:11:51.809675       1 reflector.go:436] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: no kind \"Profile\" is registered for version \"openfaas.com/v1\" in scheme \"github.com/openfaas/faas-netes/pkg/client/clientset/versioned/scheme/register.go:20\"") has prevented the request from succeeding
2022/10/05 04:11:52 failed create Deployment spec: profile.openfaas.com "solution-python38-profile-live" not found
2022/10/05 04:11:53 failed create Deployment spec: profile.openfaas.com "solution-python38-profile-sandbox" not found

Output from function creation job:

profile.openfaas.com/solution-python38-profile-live created
+ kubectl wait --timeout=60s '--for=jsonpath={.metadata.name}=solution-python38-profile-live' profile/solution-python38-profile-live -n openfaas
profile.openfaas.com/solution-python38-profile-live condition met
+ faas-cli deploy -f ./solution-python38.yml --namespace openfaas-fn-live
Deploying: solution-python38.

Unexpected status: 400, message: unable update Deployment: solution-python38.openfaas-fn-live, error: profile.openfaas.com "solution-python38-profile-live" not found


Function 'solution-python38' failed to deploy with status code: 400

This appears to happen after creating a number of functions. I was running a script to recreate 79 different functions with one each for our live and sandbox environments, for a total of 158, twith two being created approximately every 45 seconds -- this is to get around limitations of our docker registry and was not related to OpenFaaS specifically. The deployment started failing after about 90 functions were created, 45 in each namespace. I also saw the error earlier in the day while experimenting, but restarted the pods after that.

The functions are being created in two separate namespaces, but the profiles are in the openfaas namespace. I didn't see a way to separate them in the documentation. Each function has its own profile due to pod anti affinity rules.

Profile:

apiVersion: openfaas.com/v1
metadata:
  name: "solution-python38-profile-live"
  namespace: "openfaas"
spec:
    tolerations:
    - effect: NoSchedule
      key: company.com/node-group
      operator: Equal
      value: actionnodes
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: company.com/app
                operator: In
                values:
                - "solution-python38"
            topologyKey: kubernetes.io/hostname

Oct 05 '22 04:10 pype-leila

Hi @pype-leila thanks for your interest in OpenFaaS

You will need to raise your own issue with all the repro instructions. If you delete any of the template, unfortunately, we will close your issue as invalid.

Alex

Oct 05 '22 11:10 alexellis

/lock: stale issue

Oct 05 '22 11:10 alexellis

faas-netes faas-netes copied to clipboard

Gateway Pod crashes when Profiles CRD is deleted

My actions before raising this issue

Expected Behaviour

Current Behaviour

Are you a GitHub Sponsor (Yes/No?)

List All Possible Solutions and Workarounds

Which Solution Do You Recommend?

Steps to Reproduce (for bugs)

Context

Next steps

faas-netes
faas-netes copied to clipboard