pulp-operator Pulp performance issues

Version Deployed via Pulp Operator v1.0.0-beta.4 on K8s 1.26.

$ pulp status           
{                                                                                                        
  "versions": [                                                                                          
    {                                                                                                    
      "component": "core",                                                                               
      "version": "3.49.1",                                                                               
      "package": "pulpcore",                                                                             
      "module": "pulpcore.app",                                                                          
      "domain_compatible": true                                                                          
    },                                                                                                   
    <snip>                                                                                           
  ]
  <snip>

Describe the bug We've seen (increasingly) poor performance from our Pulp instance lately, and it's not entirely clear to us why. Some of the behaviors we've seen are:

Workers failing liveness checks (but not so badly that K8s restarts the pods)

2024-07-25T12:07:07.934721504Z pulp [92902ed0f2db4a62a53dfdaa49f34a24]: pulpcore.tasking.tasks:INFO: Task completed 0190e9c3-30b6-760c-8e26-deb015569ad7
2024-07-25T12:07:50.658526639Z pulp [69a42f792d4f42728d6f9d49155a93e4]: pulpcore.tasking.worker:INFO: Worker '1@pulp-worker-746fc4f5cb-nl62g' is back online.

File uploads taking significant amounts of time (>8 hours for 800 10-ish MB files)

CLI operations seem to take a while:

$ time pulp -v task list --state running
tasks_list : get https://pulp.<domain>/pulp/api/v3/tasks/?state=running&offset=0&limit=25
Response: 200
[]

real	0m16.427s
user	0m0.258s
sys	0m0.032s

Potentially (?) the issue in https://github.com/pulp/pulp_container/issues/1716.

The Pulp pods use a NFS-based storageClass but we have pretty much ruled out NFS congestion/slowness as a cause. We have 10 API pods, 5 content pods, and 10 worker pods (most of which are always idle), which seems like it should be enough to handle our use case, and none of them seem to be consuming unusual CPU/memory. We've identified some potential performance tuning that could be done on the DB but we're not seeing deadlocks or similar indications of congestion so I'm not confident that will necessarily solve things. I guess we're just wondering if there's some undocumented tuning/configuration that you could point us to.

To Reproduce Steps to reproduce the behavior: It's unclear... it seems like, as our Pulp instance gets larger, it just slows down.

Expected behavior Pulp should remain performant as we scale the infrastructure to support our use case.

Additional context N/A

Jul 25 '24 12:07 grzleadams

Is it expected the workers would not show online status if they're not processing a job? The API and Content apps seem to return the correct number of processes available, but not all 10 workers show online (although their pods are healthy and there's nothing to indicate an issue in logs).

$ pulp status | jq -s '.[] | .online_content_apps | length'
10
$ pulp status | jq -s '.[] | .online_api_apps | length'
20
$ pulp status | jq -s '.[] | .online_workers | length'
2

Jul 25 '24 13:07 grzleadams

For the workers going offline, I did see this in the API pod startup:

pulp [None]: pulpcore.app.entrypoint:WARNING: API_APP_TTL (120) is smaller than double the gunicorn timeout (900.0). You may experience workers wrongly reporting as missing
2024-07-25T22:25:15.016075012Z pulp [None]: pulpcore.app.entrypoint:WARNING: API_APP_TTL (120) is smaller than double the gunicorn timeout (900.0). You may experience workers wrongly reporting as missing

I mentioned this previously (https://github.com/pulp/pulp_container/issues/1592) but it appears that this value is hard-coded. Is the only way to override it to overwrite settings.py as a volumeMount?

Jul 25 '24 22:07 grzleadams

I saw that at one point one of the failing pods (I think it was the API but I'm not sure, it could've been a worker) had a message from PostgreSQL complaining about "too many clients already". I backed down our number of API pods to 5 but increased the number of gunicorn_workers to 4 to compensate, and reduced our number of worker pods from 10 to 5 (and, interestingly, all 5 show online now).

It seems like things are running a little more smoothly but I'll keep an eye on it tomorrow during business hours. Gunicorn docs make a specific mention of too many workers thrashing your system, which makes sense...

Jul 25 '24 23:07 grzleadams

@grzleadams It seems to me that you are being constrained by the resources available to the database. Is the database being managed by the operator also? What resource constraints does your database have?

Jul 30 '24 15:07 dkliban

We are managing the PostgreSQL DB with the operator but didn't make any changes to the configuration (i.e., we use the defaults). There are no CPU/memory limits associated with the DB. I've thought about making changes to max_connections, work_mem, etc., but I haven't actually seen any evidence on the DB of congestion.

Jul 31 '24 15:07 grzleadams

@dkliban How can I modify the PostgreSQL configuration (i.e., max_connections) via the Operator?

Aug 06 '24 15:08 grzleadams

Nevermind, got it:

    postgres_extra_args:
    - -c
    - max_connections=1000
    - -c
    - shared_buffers=512MB

Aug 06 '24 15:08 grzleadams

We can close this issue; I haven't seen any of the deadlocks, dead workers, etc., since tuning PostgreSQL and gunicorn. That said, if there's ever a Pulp performance tuning doc, I'd be happy to contribute the lessons I've learned the hard way. :)

Sep 03 '24 20:09 grzleadams