percona-server-mongodb-operator icon indicating copy to clipboard operation
percona-server-mongodb-operator copied to clipboard

Performance degradation in large deployments after upgrade from PSMDB operator 1.14.0 to 1.16.1

Open MatzeScandio opened this issue 5 months ago • 4 comments

Report

Performance degradation in large deployments after upgrade from PSMDB operator 1.14.0 to 1.16.1

  • we tested the upgrade in our DEV environment and did not see any issues with performance
  • After upgrading the operator in our PROD environment we noticed a significant slowdown
  • the PROD environment is significantly larger with ~90 PSMDBs and a retention period of 30 days resulting in 2700 psmdb-backup objects
  • while the creation of a new database in version 1.14.0 took about 5 minutes it took ~6 hours to create a new psmdb database with operator version 1.16.1__

More about the problem

Analysis

  1. we identified an unusual high amount of calls to the backup API (/apis/psmdb.percona.com/v1/namespaces/mongodb/perconaservermongodbbackups) as the main contributor to this behaviour

    • we have a lot of psmdb-backup resources (about 2700)
    • the response time of one request is ~1s and the API is called roughly 55 times a minute
    • the API request is performed in pkg/controller/perconaservermongodb/backup.go#L211
    • we also noticed that, instead of doing 1 request per call to the API, the reconcile function is calling it 90 times each time (equivalent to number of deployed PSMDBs)
  2. the actual bug seems to be in pkg/controller/perconaservermongodb/backup.go#L145

    • the for loop iterates over all cronjobs and compares their names to the custom resource backup tasks
    • as each db has a backup job with the same name (called 'daily'), this condition is matching for all 90 cron.backupjobs and thus the subsequent call of oldScheduledBackups() also happens 90 times
  3. we are unsure why it didn't happen before version 1.16.1 as the backup code is mostly unchanged

    • we suspect the caching behaviour changed and therefore this bug is now more visible

Workaround

  1. rename each backup task to have a unique identifier

    • e.g. for a database xyz -> 'daily-xyz' instead of 'daily'
    • this ensures only 1 is call made to the API per reconcile request
    • caveat: the info log line pkg/controller/perconaservermongodb/backup.go#L163 will now be printed 89 times per reconcile call
  2. disable the cleanup for each backup task by setting keep=0 and write a custom k8s cronjob that deletes any psmdb-backup older than 30 days

    • works if there is no need for individual retention periods per db

    • eliminates API requests alltogether, speeding up the reconcile calls significantly

Steps to reproduce

  1. create 5 databases, enable backups for each and create a backup task named 'daily' and set the keep attribute to something above 0
  2. monitor the kubernetes API calls for psmdb-backup resources
  3. for each reconcile call of the psmdb object there should be 5 requests to the API: /apis/psmdb.percona.com/v1/namespaces/mongodb/perconaservermongodbbackups?labelSelector=ancestor%3Ddaily%2Ccluster%3D<db-name>

With just 5 databases and a limited number of backups this will of course not result in a slowdown, but you will be able to see the repeated calls to the API endpoint.

Alternatively

  1. create 5 databases, enable backups for each and use unique names this time. set the keep attribute to something above 0
  2. in the logs you should see a lot of 'deleting outdated backup job' events (4 log lines per psmdb reconcile call)

Versions

  • Kubernetes: AWS-EKS 1.24
  • Operator: 1.16.1
  • Database: 5.0.23-20

Anything else?

  • feel free to ask in case of any unclarities

MatzeScandio avatar Aug 30 '24 09:08 MatzeScandio