faas-netes
faas-netes copied to clipboard
Gateway Pod crashes when Profiles CRD is deleted
My actions before raising this issue
- [ ] Followed the troubleshooting guide
- [ ] Read/searched the docs
- [ ] Searched past issues
@alexellis Posting here and plan to edit and follow-up, so that this issue doesn't get lost.
Expected Behaviour
I don't think that this should crash the gateway, but I do think an error should be logged.
Current Behaviour
Gateway crashes if you try to create a function with a profile; this seems initially related to using the operator and CRD flags set to true
on the faas-netes/openfaas
helm chart.
Are you a GitHub Sponsor (Yes/No?)
Check at: https://github.com/sponsors/openfaas
- [ ] Yes
- [ ] No
- [x] No, but I sponsor Alex
List All Possible Solutions and Workarounds
Which Solution Do You Recommend?
Steps to Reproduce (for bugs)
Context
Crashing our gateway instances rather than simply not creating functions.
-
FaaS-CLI version ( Full output from:
faas-cli version
): 0.13.13 -
Docker version
docker version
(e.g. Docker 17.0.05 ): 20.10.8 -
Which deployment method do you use?:
- [x] OpenFaaS on Kubernetes
- [ ] faasd
server v1.21.3 client v1.22.2
-
Operating System and version (e.g. Linux, Windows, MacOS): MacOS
-
Code example or link to GitHub repo or gist to reproduce problem:
-
Other diagnostic information / logs from troubleshooting guide
Next steps
You may join Slack for community support.
The gateway itself won't crash due to profiles since it doesn't have anything to do with K8s, this is likely the operator container in the same pod. I'm moving this over to the faas-netes repo so someone can look into it.
@LucasRoesler please could you take a look into this?
If anyone else should get here first, the chart has to be installed with operator.create to match Kevin's setup. Here's how to define a profile -> https://docs.openfaas.com/reference/profiles/
The K8s version Kevin's using is 1.21, but this may also be an issue with changes in 1.22?
Install with helm or arkade.
@alexellis taking a look now
i see this warning and looks like the pod restarted, but did not crashed
✦ $ k logs -n openfaas gateway-f4dc6c956-w8hbk faas-netes -f
2021/10/30 13:03:24 Version: 0.13.8 commit: c0a8c3cba4156fac9847953a83cea03bf54e42ef
W1030 13:03:24.564036 1 client_config.go:543] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2021/10/30 13:03:24 HTTP Read Timeout: 1m0s
2021/10/30 13:03:24 HTTP Write Timeout: 1m0s
2021/10/30 13:03:24 ImagePullPolicy: Always
2021/10/30 13:03:24 DefaultFunctionNamespace: openfaas-fn
2021/10/30 13:03:24 Starting controller
I1030 13:03:24.587189 1 shared_informer.go:223] Waiting for caches to sync for faas-netes:deployments
I1030 13:03:24.699939 1 shared_informer.go:230] Caches are synced for faas-netes:deployments
I1030 13:03:24.700054 1 shared_informer.go:223] Waiting for caches to sync for faas-netes:endpoints
I1030 13:03:24.800314 1 shared_informer.go:230] Caches are synced for faas-netes:endpoints
I1030 13:03:24.800468 1 shared_informer.go:223] Waiting for caches to sync for faas-netes:profiles
I1030 13:03:24.901292 1 shared_informer.go:230] Caches are synced for faas-netes:profiles
W1030 13:09:27.636346 1 reflector.go:404] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: no kind \"Profile\" is registered for version \"openfaas.com/v1\" in scheme \"github.com/openfaas/faas-netes/pkg/client/clientset/versioned/scheme/register.go:20\"") has prevented the request from su
cceeding
using k8s 1.22.1 running on a Kind cluster
@alexellis i can not reproduce the issue as described:
I can reproduce the original bug that causes the gateway pod to crash when the Profile CRD is missing.
Regarding the bug I can reproduce. I can ensure that the Gateway starts by checking for the Profile CRD during startup. However, now we need to discuss error edge cases.
When using the controller (ie classic faas-netes), I can also add errors to the API when it sees Profiles trying to be used in a cluster that has not enabled them. These seems fine, but we do have error cases that can happen when the CRD is deleted after startup. If we want to be very very safe, i could check for the CRD on every function deploy/update request, but those deploys will also fail with a error anyway, so i am not sure the extra check is really needed. We should probably change some of the logic so that the profile client is only used when the current deployment or the current request reference Profiles.
The operator has the same problem, but with a twist: we don't currently have any kind of validation webhook/controller, we need to handle Function objects that reference profiles even though the cluster doesn't support it.
I see four options
- allow all functions to be deployed, completely ignoring Profiles, but simply log a warning when we see functions requesting profiles. I am not a fan of this, it could lead to bugs that are very hard for people to debug, I am only including the option for completeness.
- create a validation webhook endpoint in the controller so that we can return validation messages for CRD users as well, we can then just mimic the same behavior as the classic faas-netes.
- make the Profile CRD a hard requirement, the Operator should crash with a clear message and tell the admin to deploy both Function and Profile CRDs, we can add this to the helm chart install flow as well. But this approach doesn't really solve the problem in the case where someone removes the CRD after the opeator starts. We would then have error cases that are not surfaced to the developer/client.
- If Profiles is disabled, modify the Function deployment so that the function is not scheduleable. One interesting way to do this is to use a Profile that we know will cause the function to be broken. We have two options, (a) apply a bad RuntimeClass or (b) add a toleration for an unlikely to exist taint.
The way that this works is that the the GetProfiles
method would return the DisabledPodProfile
if the profiles feature is disabled (because we can't find the CRD). Conversely, the GetProfilesToRemove
would include the DisabledPodProfile
when the profiles feature is enabled. THis means that you have the following possibilities
- profiles disabled:
GetProfiles
returnsDisabledPodProfile
- profiles disabled:
GetProfilesToRemove
returnsnil
or the empty list - profiles enabled:
GetProfiles
returns the list of profiles (as it would behave today) which may be emtpy - profiles enabled:
GetProfilesToRemove
returns the list of profiles (or an empty value) and we always append theDisablePodProfile
This combination of behaviors ensures that we disable profile dependent functions when the CRD is missing and we will reenable these functions once the CRD exists.
Option (a) looks like this
// DisablePodProfile is used when the profiles feature is disabled.
//
// The cluster admin can fix this function by applying the Profile CRD, restarting the gateway pod,
// and then redeploying the effected Functions.
var DisablePodProfile = Profile{
RuntimeClassName: "of-profiles-disabled",
}
Option (b) looks like this
// DisablePodProfile is used when the profiles feature is disabled. The toleration `openfass-profiles=disabled`
// will be added to the Function so that it is not schedulable by default.
//
// The cluster admin can fix this function by applying the Profile CRD, restarting the gateway pod,
// and then redeploying the effected Functions.
//
// Alternatively, the cluster admin can override this by adding the taint
// `openfaas-pfoiles=disabled:NoSchedule`
// to the cluster nodes.
var DisablePodProfile = Profile{
Tolerations: []corev1.Toleration{
{
Key: "openfaas-profiles",
Value: "disabled",
Operator: corev1.TolerationOpEqual,
},
},
}
I can then check if the profiles client is configured and include this profile in the Add/Remove checks. The benefit of using a Toleration, is that a cluster admin could decide (for some reason) to ignore the profiles completely and allow functions to be scheduled. Additionally, they can also just fix the cluster by deploying the Profile CRD and restarting the controller/operator. It will then remove this profile the next time the functions are updated/redeployed.
For reference, i reproduce the crash using this
kind create cluster --image kindest/node:v1.22.1 --config=cluster.yaml
arkade install openfaas -a=false --operator
kubectl -n openfaas rollout status deploy/gateway
kubectl delete crd profiles.openfaas.com
kubectl -n openfaas rollout restart deploy/gateway
kubectl -n openfaas get po -w
where cluster.yaml is this file
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
extraPortMappings:
- containerPort: 80
hostPort: 80
protocol: TCP
- containerPort: 443
hostPort: 443
protocol: TCP
- containerPort: 31112 # this is the NodePort created by the helm chart
hostPort: 8080 # this is your port on localhost
protocol: TCP
Thank you for this detailed analysis Lucas.
Can you confirm what the steps are to reproduce this problem?
@kevin-lindsay-1 i am a little lost with this issue. The original issue mentioned that the Profile CRD is missing and this caused the crash. Here you mention that the CRD exists, but there are not enough detail to actually verify that and i can't reproduce it in my own cluster.
Can you provide more details about how everything was installed and verification steps? Also what kind of end result are we expecting? As you can see in https://github.com/openfaas/faas-netes/issues/868#issuecomment-955221629, there are several things to consider about edge case and error handling especially in the operator.
@LucasRoesler yeah I'm kinda surprised on the triage speed of this issue, considering I basically threw it in here last night with more or less a post-it note saying "please don't close me; reminder to self to actually write this ticket when I have time".
It'd probably be better to sit on things that don't have a full repro, since several of us have now spent limited time investigating this issue.
One thing that @kevin-lindsay-1 mentioned was that they sometimes delete OpenFaaS and install it again in their staging environment, perhaps there's an ordering problem?
Lucas, I think that if the CRD has been removed, and the gateway pod crashes, that would be working as expected.
It's a mandatory part of the project, even if it's marked as disabled or not used by the installation.
https://github.com/openfaas/faas-netes/issues/868#issuecomment-955237988
I plan on doing a full repro and updating this issue to whatever I identify to be the suspected problem; I do not intend to let this stale, just juggling over here.
@alexellis and @kevin-lindsay-1 i created a proposed change to faas-netes that will have it check for the profiles CRD during deploy/update and block the function with an explicit error if the CRD is missing and it requires the profiles feature. It will also crash as startup with an explicit message about the missing CRD. This should cover both cases, can't find at startup and when the CRD is deleted after startup, and make it easier to debug.
Let me know what you think
@LucasRoesler sounds fine with me, maybe our deployment had profiles: false
and we didn't realize because of the error in the gateway.
This sounds like a good feature, as it would potentially let a developer know that devops accidentally forgot to turn on a feature.
I doubt I originally set profiles: true
in the helm chart, because I didn't really notice the feature, and the values.yaml
doesn't have a comment iirc.
I took a quick look at #872 and added comments.
Is this fixed?
@kevin-lindsay-1 the PR exists, it just needs design approval from @alexellis
I'm trying to implement profiles for tolerations and affinity, and I can confirm that I have the CRD enabled, the profiles are being created and do exist, and I am receiving this error. The gateway pod is not crashing.
Logs:
I1004 19:53:37.791376 1 shared_informer.go:247] Caches are synced for faas-netes:deployments
I1004 19:53:37.791506 1 shared_informer.go:240] Waiting for caches to sync for faas-netes:endpoints
I1004 19:53:37.891907 1 shared_informer.go:247] Caches are synced for faas-netes:endpoints
I1004 19:53:37.891983 1 shared_informer.go:240] Waiting for caches to sync for faas-netes:profiles
I1004 19:53:37.992824 1 shared_informer.go:247] Caches are synced for faas-netes:profiles
W1005 04:09:34.856603 1 reflector.go:436] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: no kind \"Profile\" is registered for version \"openfaas.com/v1\" in scheme \"github.com/openfaas/faas-netes/pkg/client/clientset/versioned/scheme/register.go:20\"") has prevented the request from succeeding
W1005 04:09:50.249779 1 reflector.go:436] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: no kind \"Profile\" is registered for version \"openfaas.com/v1\" in scheme \"github.com/openfaas/faas-netes/pkg/client/clientset/versioned/scheme/register.go:20\"") has prevented the request from succeeding
W1005 04:10:40.397332 1 reflector.go:436] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: no kind \"Profile\" is registered for version \"openfaas.com/v1\" in scheme \"github.com/openfaas/faas-netes/pkg/client/clientset/versioned/scheme/register.go:20\"") has prevented the request from succeeding
W1005 04:10:52.872676 1 reflector.go:436] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: no kind \"Profile\" is registered for version \"openfaas.com/v1\" in scheme \"github.com/openfaas/faas-netes/pkg/client/clientset/versioned/scheme/register.go:20\"") has prevented the request from succeeding
W1005 04:11:51.809675 1 reflector.go:436] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: no kind \"Profile\" is registered for version \"openfaas.com/v1\" in scheme \"github.com/openfaas/faas-netes/pkg/client/clientset/versioned/scheme/register.go:20\"") has prevented the request from succeeding
2022/10/05 04:11:52 failed create Deployment spec: profile.openfaas.com "solution-python38-profile-live" not found
2022/10/05 04:11:53 failed create Deployment spec: profile.openfaas.com "solution-python38-profile-sandbox" not found
Output from function creation job:
profile.openfaas.com/solution-python38-profile-live created
+ kubectl wait --timeout=60s '--for=jsonpath={.metadata.name}=solution-python38-profile-live' profile/solution-python38-profile-live -n openfaas
profile.openfaas.com/solution-python38-profile-live condition met
+ faas-cli deploy -f ./solution-python38.yml --namespace openfaas-fn-live
Deploying: solution-python38.
Unexpected status: 400, message: unable update Deployment: solution-python38.openfaas-fn-live, error: profile.openfaas.com "solution-python38-profile-live" not found
Function 'solution-python38' failed to deploy with status code: 400
This appears to happen after creating a number of functions. I was running a script to recreate 79 different functions with one each for our live and sandbox environments, for a total of 158, twith two being created approximately every 45 seconds -- this is to get around limitations of our docker registry and was not related to OpenFaaS specifically. The deployment started failing after about 90 functions were created, 45 in each namespace. I also saw the error earlier in the day while experimenting, but restarted the pods after that.
The functions are being created in two separate namespaces, but the profiles are in the openfaas namespace. I didn't see a way to separate them in the documentation. Each function has its own profile due to pod anti affinity rules.
Profile:
apiVersion: openfaas.com/v1
metadata:
name: "solution-python38-profile-live"
namespace: "openfaas"
spec:
tolerations:
- effect: NoSchedule
key: company.com/node-group
operator: Equal
value: actionnodes
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: company.com/app
operator: In
values:
- "solution-python38"
topologyKey: kubernetes.io/hostname
Hi @pype-leila thanks for your interest in OpenFaaS
You will need to raise your own issue with all the repro instructions. If you delete any of the template, unfortunately, we will close your issue as invalid.
Alex
/lock: stale issue