microk8s icon indicating copy to clipboard operation
microk8s copied to clipboard

Broken namespace after too many pods

Open Naegionn opened this issue 9 months ago • 8 comments

Summary

I Have a microk8s setup with three master nodes and one worker node.

lukas@ckf-ha-1:~$ sudo microk8s kubectl get no
NAME       STATUS   ROLES    AGE   VERSION
ckf-ha-1   Ready    <none>   47h   v1.29.15
ckf-ha-2   Ready    <none>   47h   v1.29.15
ckf-ha-3   Ready    <none>   47h   v1.29.15
ckf-ha-worker   Ready    <none>   47h   v1.29.15

I also have set up microceph on the same nodes with one vdisk each as OSD. On this cluster I installed CKF 1.9 and I created a lot of pipeline runs in the default users profile.

After some while I suspect there is some limit within dqlite reached and the namespace becomes unusable:

lukas@ckf-ha-1:~$ sudo microk8s kubectl get pods -n lukas
Error from server: rpc error: code = Unknown desc = (
                SELECT MAX(rkv.id) AS id
                FROM kine AS rkv)

I assume there is to many failed or completed pods within the namespace. other namepsaces are still ok. Deleting this namespace does still behave weirdly as it reports successfull deletion but running get pods on the deleted namespace still returns the same error.

Recreating the cluster but running a manual garbage collection jobs scheduled to run every 3 hours to delete all failed and completed pods seems to mitigate this issue.

What Should Happen Instead?

Running thousands to hundredthousand kfp runs should not corrupt the namespace or dqlite.

Reproduction Steps

  1. setup microceph, microk8s and CKF
  2. Increase kfp-db-0 pv size microk8s kubectl edit pvc kfp-db-database-XXX-kfp-db-0 -n kubeflow
  3. create large amounts of kfp runs
  4. here we are

Introspection Report

Need to recreate this again to provide the Introspection report. I can also check if I can reproduce it on a single node setup.

Can you suggest a fix?

Are you interested in contributing with a fix?

Possibly

Naegionn avatar Apr 01 '25 22:04 Naegionn

Hi @Naegionn!

I hope you're doing well. Thank you for filing your issue.

Could you please send a microk8s inspection report to help us debug the issue? I'm curious about the workload you are running on your cluster- would you be happy to share it?

Best regards, Louise

louiseschmidtgen avatar Apr 02 '25 13:04 louiseschmidtgen

@louiseschmidtgen Hi here is the inspection report inspection-report-20250325_234441.tar.gz

Currently I am just testing Kubeflow Pipelines Stability to make sure it can handle thousands of jobs per day.

Naegionn avatar Apr 07 '25 12:04 Naegionn

Hi @Naegionn,

The error messages for k8s-dqlite from your inspection report point to an issue in the after query used for Kubernetes watches. This issue will require further investigation on our end.

I will keep you posted on our findings. Louise

louiseschmidtgen avatar Apr 09 '25 08:04 louiseschmidtgen

Thanks a lot

Naegionn avatar Apr 09 '25 23:04 Naegionn

Hi @Naegionn,

I would like to gather further information on this error case. Would you be able to run your kubeflow pipelines again?

  1. For this case could you please refrain from using the garbage collector?

  2. Please test the Microk8s snap from the channel: latest/edge/test-snap.

  3. Please add logging to dqlite and k8s-dqlite:

    Please add debug logs by editing /var/snap/k8s/common/args/k8s-dqlite-env or /var/snap/microk8s/current/args/k8s-dqlite-env and uncomment LIBDQLITE_TRACE=1 and LIBRAFT_TRACE=1. Then restart the k8s-dqlite service and check the k8s-dqlite logs. To enable k8s-dqlite debug logging, add --debug to /var/snap/k8s/common/args/k8s-dqlite.

  4. Upon getting into the bad state could you please check for db corruption by running:

ubuntu@louise-dev:~$ sudo /snap/k8s/current/bin/dqlite -s file:///var/snap/k8s/common/var/lib/k8s-dqlite/cluster.yaml -c /var/snap/k8s/common/var/lib/k8s-dqlite/cluster.crt -k /var/snap/k8s/common/var/lib/k8s-dqlite/cluster.key k8s
dqlite> pragma integrity_check;
integrity_check	
ok
  1. Finally, please send the inspection report again.

Are you able to share the kubeflow pipelines with us? I would like to reproduce the issue on our end.

Best regards, Louise

louiseschmidtgen avatar Apr 23 '25 14:04 louiseschmidtgen

Hey, thanks ill try and set it up on a VM with the latest/edge/test-snap.

Naegionn avatar Apr 23 '25 22:04 Naegionn

Hi @Naegionn,

thank you for taking the time to help us improve Microk8s!

louiseschmidtgen avatar Apr 24 '25 07:04 louiseschmidtgen

The pipeline is am using for testing is quite simple.

from kfp import dsl, Client
from kfp.dsl import pipeline, container_component, ContainerSpec

# Define the container component with CPU limit
@container_component
def test_task():
    return ContainerSpec(
        image="ubuntu:latest",
        command=["bash", "-c"],
        args=[
            "echo 'Utilizing CPU for 1 minutes...'; "
            "timeout 60s sha256sum /dev/zero &>/dev/null; "
            "echo 'Success'; exit 0; "
        ]
    )

# Define the pipeline
@pipeline(name="test-pipeline")
def test_pipeline():
    test_task().set_cpu_request("10m").set_cpu_limit("100m")

# Run the pipeline with multiple instances if executed as main
if __name__ == "__main__":
    client = Client()
    for i in range(2000): 
        client.create_run_from_pipeline_func(
            test_pipeline,
            arguments={},
            run_name=f"test_run_{i}",
            enable_caching=False
        )
        import time
        time.sleep(25)

Once the script completes I will run it again. My Goal is to make sure that KFP will hold up runnign thousands of small jobs per day.

Naegionn avatar Apr 24 '25 14:04 Naegionn