containerd-wasm-shims
containerd-wasm-shims copied to clipboard
Using `limits` with the shim makes the pod fail.
The livenessProbe reports failure continuously. Not sure if the pod is restarted because of that, but that it actually runs, or what the problem is.
Repro using k3d
k3d cluster create wasm-cluster \
--image ghcr.io/deislabs/containerd-wasm-shims/examples/k3d:v0.10.0 \
-p "8081:80@loadbalancer" \
--agents 0
kubectl apply -f https://raw.githubusercontent.com/deislabs/containerd-wasm-shims/main/deployments/workloads/runtime.yaml
Then apply the following workloads for comparison:
apiVersion: apps/v1
kind: Deployment
metadata:
name: fails
spec:
replicas: 1
selector:
matchLabels:
app: fails
template:
metadata:
labels:
app: fails
spec:
runtimeClassName: wasmtime-spin
containers:
- name: fails
image: ghcr.io/deislabs/containerd-wasm-shims/examples/spin-rust-hello:latest
command: ["/"]
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
livenessProbe:
httpGet:
path: .well-known/spin/health
port: 80
initialDelaySeconds: 3
periodSeconds: 3
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: works
spec:
replicas: 1
selector:
matchLabels:
app: works
template:
metadata:
labels:
app: works
spec:
runtimeClassName: wasmtime-spin
containers:
- name: works
image: ghcr.io/deislabs/containerd-wasm-shims/examples/spin-rust-hello:latest
command: ["/"]
resources:
requests:
cpu: 100m
memory: 128Mi
livenessProbe:
httpGet:
path: .well-known/spin/health
port: 80
initialDelaySeconds: 3
periodSeconds: 3
Just wanted to add my findings in here as well. It seems like there is a CPU spike during startup that is throttled by the resource limits. This might not be specific to the shim but a general issue with resource limits in Kubernetes. For example, I used the following two deployments to check how long it took for Spin's port to be opened and with higher or no limits on the pod it does open the port in a relatively short time.
apiVersion: apps/v1
kind: Deployment
metadata:
name: spin-slow-start
spec:
replicas: 1
selector:
matchLabels:
app: spin-slow-start
template:
metadata:
labels:
app: spin-slow-start
spec:
runtimeClassName: wasmtime-spin
containers:
- name: spin-hello
image: ghcr.io/deislabs/containerd-wasm-shims/examples/spin-rust-hello:v0.10.0
command: ["/"]
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
- image: alpine:latest
name: debug-alpine
command: ["/bin/sh", "-c"]
args:
- |
TARGET_HOST='127.0.0.1'
echo "START: waiting for $TARGET_HOST:80"
timeout 60 sh -c 'until nc -z $0 $1; do sleep 1; done' $TARGET_HOST 80
echo "END: waiting for $TARGET_HOST:80"
sleep 100000000
resources: {}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: spin-faster-start
spec:
replicas: 1
selector:
matchLabels:
app: spin-faster-start
template:
metadata:
labels:
app: spin-faster-start
spec:
runtimeClassName: wasmtime-spin
containers:
- name: spin-hello
image: ghcr.io/deislabs/containerd-wasm-shims/examples/spin-rust-hello:v0.10.0
command: ["/"]
resources:
limits:
cpu: 400m
memory: 512Mi
requests:
cpu: 100m
memory: 128Mi
- image: alpine:latest
name: debug-alpine
command: ["/bin/sh", "-c"]
args:
- |
TARGET_HOST='127.0.0.1'
echo "START: waiting for $TARGET_HOST:80"
timeout 60 sh -c 'until nc -z $0 $1; do sleep 1; done' $TARGET_HOST 80
echo "END: waiting for $TARGET_HOST:80"
sleep 100000000
resources: {}
Maybe the fix here is to just remove the limits from example deployments or bump them up? We could also evaluate adding overhead.podCpu
configuration to the runtime class to ensure the limits are tolerant of spikes, though that might impact the ability to schedule the pods on smaller nodes.
@mikkelhegn could you check to see if higher limits helps?
You might also play with livenessProbe
settings. Delaying the call a few more seconds during the spike of the initial boot might help too.
initialDelaySeconds: 10
periodSeconds: 3
I had to bump initialDelaySeconds
to 45 sec to not have the livenessProbe fail.