High CPU load caused by commonly executed queries
Current Behavior
All the profiling/data is being done via GCP Query Insights, as I've been trying to track down causes for high CPU utilization in our environment. Apologies for the length. :) I'm going to truncate some query details (the individual columns) as this is quite long.
High CPU utilization is being seen by this query:
SELECT
FROM
"DEPENDENCYMETRICS" "A0"
WHERE
"A0"."COMPONENT_ID" = $1
The current indexes do not seem to cover this particular use case, and this query seems to be quite aggressively used. For our environment:
We are running in GCP on CloudSQL, currently on Postgres 17 but we also experienced this on 15 as well.
Adding an index to cover this particular query drastically reduces CPU load, you can see the exact point when the additional index was added (no changes were made to any of the other indexes):
I believe this is being (partially) caused by this particular query often finding no results (as would be often expected), but because it doesn't have a particularly close match on the index, the cost of it is quite high.
Here's the query plan before adding the additional index:
And after the index:
These are also the next two worst offenders (at least in our environment):
First one:
SELECT
FROM
"COMPONENT" "A0"
WHERE
"A0"."PROJECT_ID" = $1
The second (and a bunch of others) that all fall under this index also have high utilization.
When I zoom out on metrics and look at our total utilization (about 24 hours), these are the biggest causes:
Some are simple:
SELECT
FROM
"PROJECTMETRICS" "A0"
WHERE
"A0"."PROJECT_ID" = $1
Most are complex queries, but most of them hit the same indexes.
Steps to Reproduce
1.No steps to reproduce as this is being seen in a production environment.
Expected Behavior
Adding strategic indexes for queries with high cost reduces time it takes to run the query lowering load on the DB and improving user experience,
Dependency-Track Version
4.13.0
Dependency-Track Distribution
Container Image
Database Server
PostgreSQL
Database Server Version
17_4.R20250302.00_07
Browser
N/A
Checklist
- [x] I have read and understand the contributing guidelines
- [x] I have checked the existing issues for whether this defect was already reported
We also see permanent CPU load, is this issue and MR still alive?
Sometimes we have huge disk io spikes, and sometimes they don't need disk at all.
@nscuro did you see any benefits with the indices?
Because we even got notifications from our storage Team that deptrack is getting crazy. But we only have 2181 Projects in deptrack. The latest 4.13.6 Update didn't brought any improvements. We still have idle postgres processes with very high cpu load.
The query in question is this one:
SELECT 'org.dependencytrack.model.VulnerableSoftware' AS "DN_TYPE","A0"."CPE22","A0"."CPE23","A0"."EDITION","A0"."ID","A0"."LANGUAGE","A0"."OTHER","A0"."PART","A0"."PRODUCT","A0"."PURL","A0"."PURL_NAME","A0"."PURL_NAMESPACE","A0"."PURL_QUALIFIERS","A0"."PURL_SUBPATH","A0"."PURL_TYPE","A0"."PURL_VERSION","A0"."SWEDITION","A0"."TARGETHW","A0"."TARGETSW","A0"."UPDATE","A0"."UUID","A0"."VENDOR","A0"."VERSION","A0"."VERSIONENDEXCLUDING","A0"."VERSIONENDINCLUDING","A0"."VERSIONSTARTEXCLUDING","A0"."VERSIONSTARTINCLUDING","A0"."VULNERABLE" FROM "VULNERABLESOFTWARE" "A0" WHERE "A0"."PURL_TYPE" = $1 AND "A0"."PURL_NAME" = $2 AND "A0"."PURL_NAMESPACE" = $3 AND "A0"."VERSION" = $4 AND "A0"."VERSIONENDEXCLUDING" IS NULL AND "A0"."VERSIONENDINCLUDING" IS NULL AND "A0"."VERSIONSTARTEXCLUDING" IS NULL AND "A0"."VERSIONSTARTINCLUDING" IS NULL FETCH NEXT ROW ONLY
It runs in a loop. When its done, it instantly spawns again.
@stiwa-wrre None of the proposed indexes are for that table.
Would it be possible for you to run the query you shared with EXPLAIN ANALYZE and share the result?
@nscuro The explain:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..66533.73 rows=1 width=2402) (actual time=4712.947..4712.949 rows=0 loops=1)
-> Index Scan using "VULNERABLESOFTWARE_PURL_TYPE_NS_NAME_IDX" on "VULNERABLESOFTWARE" "A0" (cost=0.56..798398.60 rows=12 width=2402) (actual time=4712.944..4712.945 rows=0 loops=1)
Index Cond: ((("PURL_TYPE")::text = 'deb'::text) AND (("PURL_NAMESPACE")::text = 'ubuntu'::text) AND (("PURL_NAME")::text = 'linux-aws'::text))
Filter: (("VERSIONENDEXCLUDING" IS NULL) AND ("VERSIONENDINCLUDING" IS NULL) AND ("VERSIONSTARTEXCLUDING" IS NULL) AND ("VERSIONSTARTINCLUDING" IS NULL) AND (("VERSION")::text = '4.4.0-1149.155'::text))
Rows Removed by Filter: 670010
Planning Time: 0.911 ms
Execution Time: 4713.001 ms
(7 rows)
ok we got a little bit deeper with glowroot now: these are the queries:
jdbc query: SELECT 'org.dependencytrack.model.VulnerableSoftware' AS "DN_TYPE","A0"."CPE22","A0"."CPE23","A0"."EDITION","A0"."ID","A0"."LANGUAGE","A0"."OTHER","A0"."PART","A0"."PRODUCT","A0"."PURL","A0"."PURL_NAME","A0"."PURL_NAMESPACE","A0"."PURL_QUALIFIERS","A0"."PURL_SUBPATH","A0"."PURL_TYPE","A0"."PURL_VERSION","A0"."SWEDITION","A0"."TARGETHW","A0"."TARGETSW","A0"."UPDATE","A0"."UUID","A0"."VENDOR","A0"."VERSION","A0"."VERSIONENDEXCLUDING","A0"."VERSIONENDINCLUDING","A0"."VERSIONSTARTEXCLUDING","A0"."VERSIONSTARTINCLUDING","A0"."VULNERABLE" FROM "VULNERABLESOFTWARE" "A0" WHERE "A0"."PURL_TYPE" = ? AND "A0"."PURL_NAME" = ? AND "A0"."PURL_NAMESPACE" = ? AND "A0"."VERSION" = ? AND "A0"."VERSIONENDEXCLUDING" IS NULL AND "A0"."VERSIONENDINCLUDING" IS NULL AND "A0"."VERSIONSTARTEXCLUDING" IS NULL AND "A0"."VERSIONSTARTINCLUDING" IS NULL FETCH NEXT ROW ONLY ['deb', 'linux-aws', 'ubuntu', '5.4.0-1151.161'] => 0 rows
We saw that OSV was causing the queries in the Stacktrace. See the Trace File below.
This is also explains the issue that the query doesn't get finished in 12h, then osv triggers another round and so on. after some days we are full with running the same stuff all the time. You can see it in the graph. The query should finish way faster and no concurrent query should start when the other is running.
We disabled OSV for now and restarted the apiserver, db queries stopped and cpu is chilled now.