Broken namespace after too many pods
Summary
I Have a microk8s setup with three master nodes and one worker node.
lukas@ckf-ha-1:~$ sudo microk8s kubectl get no
NAME STATUS ROLES AGE VERSION
ckf-ha-1 Ready <none> 47h v1.29.15
ckf-ha-2 Ready <none> 47h v1.29.15
ckf-ha-3 Ready <none> 47h v1.29.15
ckf-ha-worker Ready <none> 47h v1.29.15
I also have set up microceph on the same nodes with one vdisk each as OSD. On this cluster I installed CKF 1.9 and I created a lot of pipeline runs in the default users profile.
After some while I suspect there is some limit within dqlite reached and the namespace becomes unusable:
lukas@ckf-ha-1:~$ sudo microk8s kubectl get pods -n lukas
Error from server: rpc error: code = Unknown desc = (
SELECT MAX(rkv.id) AS id
FROM kine AS rkv)
I assume there is to many failed or completed pods within the namespace. other namepsaces are still ok. Deleting this namespace does still behave weirdly as it reports successfull deletion but running get pods on the deleted namespace still returns the same error.
Recreating the cluster but running a manual garbage collection jobs scheduled to run every 3 hours to delete all failed and completed pods seems to mitigate this issue.
What Should Happen Instead?
Running thousands to hundredthousand kfp runs should not corrupt the namespace or dqlite.
Reproduction Steps
- setup microceph, microk8s and CKF
- Increase kfp-db-0 pv size
microk8s kubectl edit pvc kfp-db-database-XXX-kfp-db-0 -n kubeflow - create large amounts of kfp runs
- here we are
Introspection Report
Need to recreate this again to provide the Introspection report. I can also check if I can reproduce it on a single node setup.
Can you suggest a fix?
Are you interested in contributing with a fix?
Possibly
Hi @Naegionn!
I hope you're doing well. Thank you for filing your issue.
Could you please send a microk8s inspection report to help us debug the issue? I'm curious about the workload you are running on your cluster- would you be happy to share it?
Best regards, Louise
@louiseschmidtgen Hi here is the inspection report inspection-report-20250325_234441.tar.gz
Currently I am just testing Kubeflow Pipelines Stability to make sure it can handle thousands of jobs per day.
Hi @Naegionn,
The error messages for k8s-dqlite from your inspection report point to an issue in the after query used for Kubernetes watches. This issue will require further investigation on our end.
I will keep you posted on our findings. Louise
Thanks a lot
Hi @Naegionn,
I would like to gather further information on this error case. Would you be able to run your kubeflow pipelines again?
-
For this case could you please refrain from using the garbage collector?
-
Please test the Microk8s snap from the channel:
latest/edge/test-snap. -
Please add logging to dqlite and k8s-dqlite:
Please add debug logs by editing
/var/snap/k8s/common/args/k8s-dqlite-env or /var/snap/microk8s/current/args/k8s-dqlite-envand uncommentLIBDQLITE_TRACE=1andLIBRAFT_TRACE=1. Then restart the k8s-dqlite service and check the k8s-dqlite logs. To enable k8s-dqlite debug logging, add--debugto/var/snap/k8s/common/args/k8s-dqlite. -
Upon getting into the bad state could you please check for db corruption by running:
ubuntu@louise-dev:~$ sudo /snap/k8s/current/bin/dqlite -s file:///var/snap/k8s/common/var/lib/k8s-dqlite/cluster.yaml -c /var/snap/k8s/common/var/lib/k8s-dqlite/cluster.crt -k /var/snap/k8s/common/var/lib/k8s-dqlite/cluster.key k8s
dqlite> pragma integrity_check;
integrity_check
ok
- Finally, please send the inspection report again.
Are you able to share the kubeflow pipelines with us? I would like to reproduce the issue on our end.
Best regards, Louise
Hey, thanks ill try and set it up on a VM with the latest/edge/test-snap.
Hi @Naegionn,
thank you for taking the time to help us improve Microk8s!
The pipeline is am using for testing is quite simple.
from kfp import dsl, Client
from kfp.dsl import pipeline, container_component, ContainerSpec
# Define the container component with CPU limit
@container_component
def test_task():
return ContainerSpec(
image="ubuntu:latest",
command=["bash", "-c"],
args=[
"echo 'Utilizing CPU for 1 minutes...'; "
"timeout 60s sha256sum /dev/zero &>/dev/null; "
"echo 'Success'; exit 0; "
]
)
# Define the pipeline
@pipeline(name="test-pipeline")
def test_pipeline():
test_task().set_cpu_request("10m").set_cpu_limit("100m")
# Run the pipeline with multiple instances if executed as main
if __name__ == "__main__":
client = Client()
for i in range(2000):
client.create_run_from_pipeline_func(
test_pipeline,
arguments={},
run_name=f"test_run_{i}",
enable_caching=False
)
import time
time.sleep(25)
Once the script completes I will run it again. My Goal is to make sure that KFP will hold up runnign thousands of small jobs per day.