flagsmith icon indicating copy to clipboard operation
flagsmith copied to clipboard

Encountered high database lock time during trait updates

Open cat5e-inv opened this issue 1 year ago • 0 comments

How are you running Flagsmith

  • [ ] Self Hosted with Docker
  • [X] Self Hosted with Kubernetes
  • [ ] SaaS at flagsmith.com
  • [ ] Some other way (add details in description below)

Describe the bug

Setup:

Flagsmith deployed in Kubernetes with the following configuration:

  • 55 flagsmith-api replicas each with a limit of 1 CPU and 1GB RAM
  • 3 pgbouncer replicas each with a limit of 0.5 CPU and 0.5GB RAM
  • A PostgreSQL instance with 12 CPUs and 16GB RAM

The load pattern is around 500-1000 RPS for updating traits (api/v1/traits/) from the Android client (https://github.com/Flagsmith/flagsmith-kotlin-android-client). There are over 1.5 million devices in the system.

I observed a performance issue in this setup. Requests made to update traits from mobile clients resulted in extensive row locks in the database.

The problematic SQL statement is:

SELECT "environments_trait"."id", "environments_trait"."identity_id", "environments_trait"."trait_key", "environments_trait"."value_type", "environments_trait"."boolean_value", "environments_trait"."integer_value", "environments_trait"."string_value", "environments_trait"."float_value", "environments_trait"."created_date" FROM "environments_trait" WHERE ("environments_trait"."identity_id" = $1 AND "environments_trait"."trait_key" = $2) LIMIT $3 FOR UPDATE

The statistics for this statement under the given load are:

  • AVG execution time: 125.41 ms
  • 99th percentile Query latency: Approaching 10 seconds
  • Lock wait time: Extending up to 180 CPU-seconds

Given that an overwhelming majority (99.99%) of the execution time comprises waiting for rows to unlock, and with incoming updates from apps, the total query execution time increases to 30-45-60 seconds. As expected, the deployment experiences crashes. Essentially, the current architecture isn't prepared for this level of load.

The relevant code at: https://github.com/Flagsmith/flagsmith/blob/90e10cf656dba13c2a137e2c5630b325b8290776/api/environments/sdk/serializers.py#L49 uses the update_or_create method, which according to Django documentation, tries to fetch an object from the database based on provided kwargs and updates the fields in the defaults dictionary if a match is found.

In my view, executing a lock-update operation every time, regardless of whether the data has changed, isn't the most optimal solution. It leads to additional database work and potential locking issues(that I see), particularly under heavy load.

I've tried to fix problem - instead of lock-update every time, it reads the data, compares the existing and incoming data, and only issues an update when the data actually changes. This strategy drastically reduces unnecessary UPDATE operations and consequently, the chances of lock contention.

Please see my PR for the detailed changes. With these changes, I observed:

  • Reduced DB CPU utilization (max 8% instead of 30%)
  • Lower RAM consumption (50% free instead of 25%)
  • Reduction in write/update queries (0.07k/s as opposed to a peak of 1.5k/s)
  • Most importantly, a significant decrease in lock time, I don't see it at all now.

Note that our use-case is quite rare, we quite often updating traits concurrently.

I'm interested in hearing your thoughts on this. Is it acceptable to make these types of changes?

Steps To Reproduce

.

Expected behavior

.

Screenshots

Screenshot at Mar 01 20-52-56

cat5e-inv avatar Mar 11 '24 13:03 cat5e-inv