harbor icon indicating copy to clipboard operation
harbor copied to clipboard

High CPU usage on Postgres during Tag Retention (v2.13.1)

Open dmitryrode opened this issue 3 months ago • 3 comments

Harbor Version: v2.13.1 Database: Postgres (AWS RDS, 4 vCPU - db.m7g.xlarge)

Summary:

When running Tag Retention, our Postgres CPU usage spikes to 90–100% until the job finishes. Normally the DB runs at ~4% CPU, so keeping a large RDS instance only for these jobs is overprovisioning IMHO.

Example repository:

  • ~3,403 artifacts
  • ~138 GB size

Retention run results looks like:

Retained/Total
199/424
199/420
196/413
197/421
205/430
205/432
205/432
205/431

But during this, Postgres is fully saturated. Scaling from 2 → 4 vCPUs helped only slightly.

Culprit SQL.

From db insights we can see the following SQL query during retention run:

SELECT b.digest_blob
FROM artifact a, artifact_blob b
WHERE a.digest = b.digest_af
  AND a.project_id = $1
  AND b.digest_blob IN ($2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12);

This query shows up repeatedly and drives CPU usage.

Image Image

References

  • Related issue: https://github.com/goharbor/harbor/issues/14708
  • Related PR: https://github.com/goharbor/harbor/pull/17296
  • And this one: https://github.com/goharbor/harbor/issues/17653

Even with these changes already merged, in v2.13.1 we still observe the same CPU pressure.

Questions

  • Do the you have any thoughts or recommendations to mitigate this?
  • Are there indexes or config tweaks that could help with this specific query pattern?
  • Any planned optimizations for retention/GC to avoid repeated heavy scans?

dmitryrode avatar Sep 03 '25 17:09 dmitryrode

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

github-actions[bot] avatar Nov 04 '25 09:11 github-actions[bot]

@dmitryrode Hi, could you help to explain analyze one actual SQL to help to locate the bottleneck? You may need to update the project_id and digest_blob to your real value.

explain analyze SELECT b.digest_blob FROM artifact a, artifact_blob b WHERE a.digest = b.digest_af AND a.project_id = 39 AND b.digest_blob IN ('sha256:9cd2faa50d753db1692cdfa3a73a6bb4703deae4111ab046aebbb61132254251','sha256:8c10ba5eade8d7675607cf83b1afd1b4978a2d17559dde20a3e40344a0c18c54', 'sha256:0406ad49b96b48455e773cf8e3c03febb8dca35244244868f760ab9cf92e79c5');

chlins avatar Nov 04 '25 12:11 chlins

I did some analysis on this today, and the issue is that the b.digest_blob could be in artifact_blob multiple times, and the query will return duplicate rows every time that blob is found in the table. Especially for blobs that are used a lot, this could result in thousands of unnecessary rows getting returned.

I did a comparison of two approaches to fix this, one with distinct and another with group by. According to my tests using 10 of the most duplicated blobs in my database, they are essentially equivalent in timing and efficiency (and both MUCH faster than selecting duplicates)

Distinct

SELECT distinct b.digest_blob
FROM artifact a, artifact_blob b
WHERE a.digest = b.digest_af
  AND a.project_id = $1
  AND b.digest_blob IN ($2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12);

Explain Stats:

Relation name Scan count Total time % of query
artifact 1 0.001 ms 0.01%
Index Only Scan 1 0.001 ms 100%
artifact_blob 1 240.826 ms 12.11%
Seq Scan 1 240.826 ms 100%

Group By

SELECT b.digest_blob
FROM artifact a, artifact_blob b
WHERE a.digest = b.digest_af
  AND a.project_id = $1
  AND b.digest_blob IN ($2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12)
GROUP_BY b.digest_blob;

Explain Stats:

Relation name Scan count Total time % of query
public.artifact 1 0.001 ms 0.01%
Index Only Scan 1 0.001 ms 100%
public.artifact_blob 1 237.254 ms 11.94%
Seq Scan 1 237.254 ms 100%

bbytheway-rq avatar Dec 10 '25 20:12 bbytheway-rq