High CPU usage on Postgres during Tag Retention (v2.13.1)
Harbor Version: v2.13.1 Database: Postgres (AWS RDS, 4 vCPU - db.m7g.xlarge)
Summary:
When running Tag Retention, our Postgres CPU usage spikes to 90–100% until the job finishes. Normally the DB runs at ~4% CPU, so keeping a large RDS instance only for these jobs is overprovisioning IMHO.
Example repository:
- ~3,403 artifacts
- ~138 GB size
Retention run results looks like:
Retained/Total
199/424
199/420
196/413
197/421
205/430
205/432
205/432
205/431
But during this, Postgres is fully saturated. Scaling from 2 → 4 vCPUs helped only slightly.
Culprit SQL.
From db insights we can see the following SQL query during retention run:
SELECT b.digest_blob
FROM artifact a, artifact_blob b
WHERE a.digest = b.digest_af
AND a.project_id = $1
AND b.digest_blob IN ($2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12);
This query shows up repeatedly and drives CPU usage.
References
- Related issue: https://github.com/goharbor/harbor/issues/14708
- Related PR: https://github.com/goharbor/harbor/pull/17296
- And this one: https://github.com/goharbor/harbor/issues/17653
Even with these changes already merged, in v2.13.1 we still observe the same CPU pressure.
Questions
- Do the you have any thoughts or recommendations to mitigate this?
- Are there indexes or config tweaks that could help with this specific query pattern?
- Any planned optimizations for retention/GC to avoid repeated heavy scans?
This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.
@dmitryrode Hi, could you help to explain analyze one actual SQL to help to locate the bottleneck? You may need to update the project_id and digest_blob to your real value.
explain analyze SELECT b.digest_blob FROM artifact a, artifact_blob b WHERE a.digest = b.digest_af AND a.project_id = 39 AND b.digest_blob IN ('sha256:9cd2faa50d753db1692cdfa3a73a6bb4703deae4111ab046aebbb61132254251','sha256:8c10ba5eade8d7675607cf83b1afd1b4978a2d17559dde20a3e40344a0c18c54', 'sha256:0406ad49b96b48455e773cf8e3c03febb8dca35244244868f760ab9cf92e79c5');
I did some analysis on this today, and the issue is that the b.digest_blob could be in artifact_blob multiple times, and the query will return duplicate rows every time that blob is found in the table. Especially for blobs that are used a lot, this could result in thousands of unnecessary rows getting returned.
I did a comparison of two approaches to fix this, one with distinct and another with group by. According to my tests using 10 of the most duplicated blobs in my database, they are essentially equivalent in timing and efficiency (and both MUCH faster than selecting duplicates)
Distinct
SELECT distinct b.digest_blob
FROM artifact a, artifact_blob b
WHERE a.digest = b.digest_af
AND a.project_id = $1
AND b.digest_blob IN ($2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12);
Explain Stats:
| Relation name | Scan count | Total time | % of query |
|---|---|---|---|
| artifact | 1 | 0.001 ms | 0.01% |
| Index Only Scan | 1 | 0.001 ms | 100% |
| artifact_blob | 1 | 240.826 ms | 12.11% |
| Seq Scan | 1 | 240.826 ms | 100% |
Group By
SELECT b.digest_blob
FROM artifact a, artifact_blob b
WHERE a.digest = b.digest_af
AND a.project_id = $1
AND b.digest_blob IN ($2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12)
GROUP_BY b.digest_blob;
Explain Stats:
| Relation name | Scan count | Total time | % of query |
|---|---|---|---|
| public.artifact | 1 | 0.001 ms | 0.01% |
| Index Only Scan | 1 | 0.001 ms | 100% |
| public.artifact_blob | 1 | 237.254 ms | 11.94% |
| Seq Scan | 1 | 237.254 ms | 100% |