Clair Postgres database storage space issue
Description of Problem / Feature Request
Postgres Storage :1800 GiB
Clair Postgres database storage space gets full in 1 week span and then suddenly storage space is freed around 1TB.
This happens every week storage gets full and suddenly freed up at once.
Garbage collection is running daily.
Need to understand why storage space is getting full and GC is not able to clean up the space.
Environment: clair:4.7.2 Combo Mode
config.yaml introspection_addr: :8089 http_listen_addr: :8080 log_level: debug indexer: connstring: XXX scanlock_retry: 10 layer_scan_concurrency: 5 migrations: true matcher: connstring: XXX migrations: true update_retention: 2 period: matchers: names:
- alpine
- aws
- debian
- oracle
- photon
- python
- rhel
- suse
- ubuntu updaters: sets:
- alpine
- aws
- debian
- oracle
- photon
- pyupio
- rhel
- suse
- ubuntu notifier: connstring: XXX delivery_interval: 1m poll_interval: 5m migrations: true auth: psk: key: XXX iss: ["quay"]
tracing and metrics
trace: name: "jaeger" probability: 1 jaeger: agent: endpoint: "localhost:6831" service_name: "clair" metrics: name: "prometheus"
Storage for Postgres "deleted" rows is only available for reuse after a Postgres VACUUM has run. Maybe check your database VACUUM / AUTOVACUUM settings
What does this mean in log. 2024-11-11 11:19:26 UTC::@:[26410]:LOG: automatic vacuum of table "ClairBackend.public.vuln": index scans: 1 pages: 0 removed, 2801169 remain, 13106 skipped due to pins, 545696 skipped frozen tuples: 96092 removed, 12632497 remain, 831982 are dead but not yet removable, oldest xmin: 934978812 buffer usage: 6392796 hits, 3018733 misses, 1438233 dirtied avg read rate: 41.457 MB/s, avg write rate: 19.751 MB/s system usage: CPU: user: 79.75 s, system: 24.19 s, elapsed: 568.88 s WAL usage: 2086644 records, 1408460 full page images, 1823645535 bytes
Does automatic vacuum of table "ClairBackend.public.vuln": index scans: 1 relates to starting GC in Clair.
Also does below statement means DELETION is not happening 2024-11-13 15:49:17 UTC:10.116.14.176(14922):root@ClairBackend:[24095]:ERROR: canceling statement due to user request 2024-11-13 15:49:17 UTC:10.116.14.176(14922):root@ClairBackend:[24095]:STATEMENT: DELETE FROM layer WHERE NOT EXISTS (SELECT FROM manifest_layer WHERE manifest_layer.layer_id = layer.id);
The concept of VACUUM / AUTOVACUUM in Postgres is important here (and nothing to do with Clair) - take a look at docs maybe starting at https://www.postgresql.org/docs/current/routine-vacuuming.html
Do we have any setting related to timeout as I see errors in log when method is DELETE {"level":"debug","request_id":"da40e9d64c5747d9","component":"httptransport/New","code":500,"error":"unable to delete manifest: timeout: context canceled","time":"2024-11-13T17:49:34Z","message":"http error response"} {"level":"info","component":"httptransport/New","request_id":"da40e9d64c5747d9","remote_addr":"100.123.110.151:55158","method":"DELETE","request_uri":"/indexer/api/v1/index_report/sha256:542c27dfe5f967b5bde01c11918871d309d30642a47848b9e13829647814b930","status":500,"duration":600097.463138,"time":"2024-11-13T17:49:34Z","message":"handled HTTP request"}
So, excessive latency when deleting manifests was identified as an issue and dealt with here, I believe the earliest version it is available in is v4.7.3.
This change includes an index creation that you can choose to pre-create concurrently if you have a larger instance using the clairctl admin sub-command https://github.com/quay/clair/blob/main/cmd/clairctl/admin.go#L42.
I'm not sure if this is related to the DB storage issue (i.e. the VACUUM isn't running for some reason due to these DELETEs being in-flight for a long time) but I echo what @frostmar said, checking how your DB is configured to (AUTO)VACUUM is likely going to be very useful.