clair icon indicating copy to clipboard operation
clair copied to clipboard

Clair Postgres database storage space issue

Open Mkori1 opened this issue 1 year ago • 7 comments

Description of Problem / Feature Request

Postgres Storage :1800 GiB Clair Postgres database storage space gets full in 1 week span and then suddenly storage space is freed around 1TB. This happens every week storage gets full and suddenly freed up at once. Screenshot 2024-11-12 at 12 49 46 PM

Garbage collection is running daily.

Need to understand why storage space is getting full and GC is not able to clean up the space.

Environment: clair:4.7.2 Combo Mode

config.yaml introspection_addr: :8089 http_listen_addr: :8080 log_level: debug indexer: connstring: XXX scanlock_retry: 10 layer_scan_concurrency: 5 migrations: true matcher: connstring: XXX migrations: true update_retention: 2 period: matchers: names:

  • alpine
  • aws
  • debian
  • oracle
  • photon
  • python
  • rhel
  • suse
  • ubuntu updaters: sets:
  • alpine
  • aws
  • debian
  • oracle
  • photon
  • pyupio
  • rhel
  • suse
  • ubuntu notifier: connstring: XXX delivery_interval: 1m poll_interval: 5m migrations: true auth: psk: key: XXX iss: ["quay"]

tracing and metrics

trace: name: "jaeger" probability: 1 jaeger: agent: endpoint: "localhost:6831" service_name: "clair" metrics: name: "prometheus"

Mkori1 avatar Nov 12 '24 07:11 Mkori1

Storage for Postgres "deleted" rows is only available for reuse after a Postgres VACUUM has run. Maybe check your database VACUUM / AUTOVACUUM settings

frostmar avatar Nov 13 '24 09:11 frostmar

What does this mean in log. 2024-11-11 11:19:26 UTC::@:[26410]:LOG: automatic vacuum of table "ClairBackend.public.vuln": index scans: 1 pages: 0 removed, 2801169 remain, 13106 skipped due to pins, 545696 skipped frozen tuples: 96092 removed, 12632497 remain, 831982 are dead but not yet removable, oldest xmin: 934978812 buffer usage: 6392796 hits, 3018733 misses, 1438233 dirtied avg read rate: 41.457 MB/s, avg write rate: 19.751 MB/s system usage: CPU: user: 79.75 s, system: 24.19 s, elapsed: 568.88 s WAL usage: 2086644 records, 1408460 full page images, 1823645535 bytes

Mkori1 avatar Nov 13 '24 16:11 Mkori1

Does automatic vacuum of table "ClairBackend.public.vuln": index scans: 1 relates to starting GC in Clair.

Also does below statement means DELETION is not happening 2024-11-13 15:49:17 UTC:10.116.14.176(14922):root@ClairBackend:[24095]:ERROR: canceling statement due to user request 2024-11-13 15:49:17 UTC:10.116.14.176(14922):root@ClairBackend:[24095]:STATEMENT: DELETE FROM layer WHERE NOT EXISTS (SELECT FROM manifest_layer WHERE manifest_layer.layer_id = layer.id);

Mkori1 avatar Nov 13 '24 16:11 Mkori1

The concept of VACUUM / AUTOVACUUM in Postgres is important here (and nothing to do with Clair) - take a look at docs maybe starting at https://www.postgresql.org/docs/current/routine-vacuuming.html

frostmar avatar Nov 13 '24 17:11 frostmar

Do we have any setting related to timeout as I see errors in log when method is DELETE {"level":"debug","request_id":"da40e9d64c5747d9","component":"httptransport/New","code":500,"error":"unable to delete manifest: timeout: context canceled","time":"2024-11-13T17:49:34Z","message":"http error response"} {"level":"info","component":"httptransport/New","request_id":"da40e9d64c5747d9","remote_addr":"100.123.110.151:55158","method":"DELETE","request_uri":"/indexer/api/v1/index_report/sha256:542c27dfe5f967b5bde01c11918871d309d30642a47848b9e13829647814b930","status":500,"duration":600097.463138,"time":"2024-11-13T17:49:34Z","message":"handled HTTP request"}

Mkori1 avatar Nov 13 '24 18:11 Mkori1

So, excessive latency when deleting manifests was identified as an issue and dealt with here, I believe the earliest version it is available in is v4.7.3.

This change includes an index creation that you can choose to pre-create concurrently if you have a larger instance using the clairctl admin sub-command https://github.com/quay/clair/blob/main/cmd/clairctl/admin.go#L42.

crozzy avatar Nov 13 '24 18:11 crozzy

I'm not sure if this is related to the DB storage issue (i.e. the VACUUM isn't running for some reason due to these DELETEs being in-flight for a long time) but I echo what @frostmar said, checking how your DB is configured to (AUTO)VACUUM is likely going to be very useful.

crozzy avatar Nov 13 '24 18:11 crozzy