nessie icon indicating copy to clipboard operation
nessie copied to clipboard

[Catalog] Move GC functionality into Nessie Catalog

Open snazy opened this issue 1 year ago • 2 comments

Having to configure all the Iceberg and potentially Hadoop configuration options for Nessie GC is not particularly convenient. Nessie Catalog has all the object storage configurations and has access to the credentials.

Nessie GC is not extremely memory hungry, it is rather "just" a time consuming process that requires a lot of object storage I/O.

Moving Nessie GC into Nessie Catalog feels like a natural follow-up, which eliminates a lot of configuration headaches.

It needs to be explored whether change is a feasible option in multi-tenant scenarios.

snazy avatar Jun 05 '24 09:06 snazy

For the record, I've been playing with a different approach using the Kubernetes Operator for Nesse: a new CRD called NessieGc that is reconciled into a CronJob (if recurring) or a Job (if one-shot).

Creating a NessieGc CRD manually creates a standalone GC job, either recurring or one-shot.

But more importantly, the main Nessie CRD has two new fields: gc.enabled and gc.schedule. If enabled, GC is then automatically started following the cron schedule, using the properties already defined in the Nessie CRD to configure the GC invocation. In this scenario, a NessieGc CRD is generated by the reconciler, and is a dependent resource whose lifecycle is tied to the parent Nessie CRD lifecycle.

adutra avatar Jul 15 '24 12:07 adutra

Hi @snazy , After moving GC into the Nessie Catalog, we should support SQL syntax for GC. For example: VACUUM

nqvuong1998 avatar Aug 09 '24 10:08 nqvuong1998