percona-server-mongodb-operator
percona-server-mongodb-operator copied to clipboard
Performance degradation in large deployments after upgrade from PSMDB operator 1.14.0 to 1.16.1
Report
Performance degradation in large deployments after upgrade from PSMDB operator 1.14.0 to 1.16.1
- we tested the upgrade in our DEV environment and did not see any issues with performance
- After upgrading the operator in our PROD environment we noticed a significant slowdown
- the PROD environment is significantly larger with ~90 PSMDBs and a retention period of 30 days resulting in 2700 psmdb-backup objects
- while the creation of a new database in version 1.14.0 took about 5 minutes it took ~6 hours to create a new psmdb database with operator version 1.16.1__
More about the problem
Analysis
-
we identified an unusual high amount of calls to the backup API (
/apis/psmdb.percona.com/v1/namespaces/mongodb/perconaservermongodbbackups
) as the main contributor to this behaviour- we have a lot of psmdb-backup resources (about 2700)
- the response time of one request is ~1s and the API is called roughly 55 times a minute
- the API request is performed in pkg/controller/perconaservermongodb/backup.go#L211
- we also noticed that, instead of doing 1 request per call to the API, the reconcile function is calling it 90 times each time (equivalent to number of deployed PSMDBs)
-
the actual bug seems to be in pkg/controller/perconaservermongodb/backup.go#L145
- the for loop iterates over all cronjobs and compares their names to the custom resource backup tasks
- as each db has a backup job with the same name (called 'daily'), this condition is matching for all 90 cron.backupjobs and thus the subsequent call of
oldScheduledBackups()
also happens 90 times
-
we are unsure why it didn't happen before version 1.16.1 as the backup code is mostly unchanged
- we suspect the caching behaviour changed and therefore this bug is now more visible
Workaround
-
rename each backup task to have a unique identifier
- e.g. for a database xyz -> 'daily-xyz' instead of 'daily'
- this ensures only 1 is call made to the API per reconcile request
- caveat: the info log line pkg/controller/perconaservermongodb/backup.go#L163 will now be printed 89 times per reconcile call
-
disable the cleanup for each backup task by setting
keep=0
and write a custom k8s cronjob that deletes any psmdb-backup older than 30 days-
works if there is no need for individual retention periods per db
-
eliminates API requests alltogether, speeding up the reconcile calls significantly
-
Steps to reproduce
- create 5 databases, enable backups for each and create a backup task named 'daily' and set the
keep
attribute to something above 0 - monitor the kubernetes API calls for psmdb-backup resources
- for each reconcile call of the psmdb object there should be 5 requests to the API:
/apis/psmdb.percona.com/v1/namespaces/mongodb/perconaservermongodbbackups?labelSelector=ancestor%3Ddaily%2Ccluster%3D<db-name>
With just 5 databases and a limited number of backups this will of course not result in a slowdown, but you will be able to see the repeated calls to the API endpoint.
Alternatively
- create 5 databases, enable backups for each and use unique names this time. set the
keep
attribute to something above 0 - in the logs you should see a lot of 'deleting outdated backup job' events (4 log lines per psmdb reconcile call)
Versions
- Kubernetes: AWS-EKS 1.24
- Operator: 1.16.1
- Database: 5.0.23-20
Anything else?
- feel free to ask in case of any unclarities