postgrest
postgrest copied to clipboard
Unexpectedly High CPU Usage during Load test
Environment
- PostgreSQL version: 13.7
- PostgREST version: v9.0.0
- Operating system: AWS EKS Linux(5.10.x+ kernel version)
Description of issue
We are doing some load tests on PostgREST. Postgrest is running on Kubernetes pods with 2 replicas. It connects to AWS-managed PostgreSQL RDS.
During the load testing, we are seeing out of the 2 pods running, 1 of the pod CPU spikes very high. In the below pic, we can see that CPU usage of 1 of the pod went to almost 6 cores.
Initially, we suspected that one pod might be serving more requests than the other, but from the metrics, we saw both pods are serving almost the same no. of requests per second(QPS):
This is happening every time we do the load test, even if we scale pods to more than 2, to say 5 pods. Then we see similar behaviour with 1/2 pods and the rest of the pods keep using normal CPU.
Expected Behaviour
Since both pods are serving the same no. of requests per second, the CPU usage difference should not be this huge.
Actual Behaviour
CPU usage for a few pods is quite high as compared to others despite serving almost same amount of traffic.
Pod config for postgrest
requests:
cpu: 1.5
mem: 1Gi
limits:
cpu: 3
mem: 2Gi
replicas: 2
Postgrest Configuration:
PGRST_DB_POOL: "10"
GHCRTS: "-qg -I0" # default in v9.0.1
(Steps to reproduce: Include a minimal SQL definition plus how you make the request to PostgREST and the response body) Sample Deployment file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgrest
spec:
replicas: 2
selector:
matchLabels:
test: test
template:
metadata:
labels:
test: test
spec:
containers:
- env:
- name: PGRST_DB_URI
value: "<REDACETED>"
- name: PGRST_DB_SCHEMA
value: public
- name: PGRST_JWT_SECRET
value: <REDACETED>
- name: GHCRTS
value: '-qg -I0'
- name: PGRST_DB_ANON_ROLE
value: <REDACETED>
- name: PGRST_DB_POOL
value: '10'
- name: PGRST_DB_POOL_TIMEOUT
value: '300'
image: postgrest/postgrest:v9.0.0
imagePullPolicy: IfNotPresent
livenessProbe:
httpGet:
path: /postgrest_health
port: http-postgrest
name: postgrest
ports:
- containerPort: 3000
name: http-postgrest
protocol: TCP
readinessProbe:
httpGet:
path: /postgrest_health
port: http-postgrest
resources:
limits:
cpu: 3
memory: 2Gi
requests:
cpu: 1.5
memory: 1Gi
Any suggestion from the community on what things can be checked to find the root cause will be appreciated. Please let me know if I need to share any other info.
It seems like this only happens a few minutes after the loadtest starts?
What happens if you run the loadtest longer, let's say twice as long? Is CPU usage going down again or only when the loadtest stops?
Did you run the same test against v10? We did some work on connection handling, especially when running out of file descriptors and other resources... maybe the backlog of unserved requests is just growing huge after a certain amount of time, because the load generator creates more requests than are actually handled?
Also: Where is the throughput in requests measured? At the load generating side? Are you sure the load is actually balanced between the two pods properly?
I'd also suggest to run a new version. Even on v9.0.1 we added some performance fixes(like https://github.com/PostgREST/postgrest/issues/2294)
First of all thank you for the quick response.
It seems like this only happens a few minutes after the loadtest starts?
Not always at start, but it almost always happens during the load test, more often during the start and sometimes after some minutes of Load Test.
Is CPU usage going down again or only when the load test stops?
This CPU spike is only for some time, 3-4minutes, then HPA scales up replicas based on CPU, new pods come up & this CPU spike comes to normal.
Did you run the same test against v10?
No, we have not. Will try that as well.
Where is the throughput in requests measured?
The screenshot I attached for QPS is measured on PostgREST pod(s), so we are 100% sure that it's not a load-balancing issue.
@steve-chavez The perf fixes in v9.0.1, we have already disabled parallel & idle GC by setting env var inv PostgREST GHCRTS: "-qg -I0"
for our current version v9.0.0. It helped us reduce overall CPU by almost 50% but didn't help with the CPU spikes.
Any suggestions here? Thank you
@bhupixb Try the latest pre-release and set the new db-pool-acquisition-timeout = 10
(or a shorter value), I can confirm that lowers the CPU usage when under heavy traffic.
@bhupixb Did the new timeout helped? Do you have new results you can share?
Hi @steve-chavez, no, unfortunately.
I did 2 tests with Acquisition timeout: 2
, and Postgrest 10.1.0
. The tests include running same query on an indexed column of a table SELECT * FROM table WHERE column=1;
, 2 pods of postgrest running.
Test 1:
DB Pool size: 10
I tested this on 500 RPS and increased the RPS to 600 to cause 504. As soon as the first 504 is received(at 20:00, refer bottom right panel 5XX QPS by service), I decreased the RPS from 600 to ~300(see QPS panel) at 20:00. The CPU spikes suddenly and remains high for some time even though RPS was reduced.
Out of total 234092 requests, 200 OK
-> 222071, 504 Bad gateway
-> 12021
It's good that now we are explicitly getting 504 and we can know that we need to increase the DB connection pool size to handle our load but the tremendous increase in CPU usage is a bit concerning.
The CPU usage, RPS & Mem usage graphs are attached below:
Test 2:
DB Pool size: 50(per pod).
Didn't see a single 504 Bad gateway. The resource usage image is attached below:
Update: We found a workaround for this by giving a sufficiently large no. of DB connections pool size to each pod(and add idle conn. timeout to 5min) so that Postgrest doesn't run out of DB connections in case there's spiky load, post that this error didn't occur.
My theory: 1 thing that we noticed that consistently happens is that the CPU usage goes unexpectedly high when there are not enough DB connections available to serve the current requests. I am not sure if it's a bug on the HTTP Server side that is not handling the queued requests correctly or if this is expected behaviour.
@bhupixb Sorry for the late reply, thanks for the detailed report.
1 thing that we noticed that consistently happens is that the CPU usage goes unexpectedly high when there are not enough DB connections available to serve the current requests. I am not sure if it's a bug on the HTTP Server side that is not handling the queued requests correctly or if this is expected behaviour.
I believe this happens because we're still doing work(parsing the querystring, searching on the postgrest schema cache, etc) when it doesn't make sense to do so because we don't have available connections on the pool.
Not sure if this is the correct way to handle this. But one option to avoid unnecessary work could be that once a number of 504s happen, we can get into a temporary state where we quickly reject new requests.
pgbouncer has server_login_retry, which kinda looks related on intention:
If login to the server failed, because of failure to connect or from authentication, the pooler waits this much before retrying to connect. During the waiting interval, new clients trying to connect to the failing server will get an error immediately without another connection attempt. [seconds]
The purpose of this behavior is that clients don’t unnecessarily queue up waiting for a server connection to become available if the server is not working. However, it also means that if a server is momentarily failing, for example during a restart or if the configuration was erroneous, then it will take at least this long until the pooler will consider connecting to it again
Default: 15.0
Having similar issues with postgrest docker latest image. I am running a gatling load test in Kubuntu 22.0 intel7 16GB RAM
After ~1 minute of load test, I get the error of "too many" connections
Maybe try setting https://postgrest.org/en/stable/references/configuration.html#jwt-cache-max-lifetime to 3600 and see if it reduces CPU usage.
After ~1 minute of load test, I get the error of "too many" connections
That should only happen if you set db-pool
to surpass postgres max_connections
.