Implement expire snapshots maintenance operation
Feature Request / Improvement
Hi,
Implementing an snpahsots expiration maintance operation would be really valuable. As of now, metadata (and data) is constantly growing because no expiration is done.
Reference: https://iceberg.apache.org/javadoc/1.8.1/org/apache/iceberg/ExpireSnapshots.html
After digging a bit deeper, I noticed that most of the features I need have already been implemented.
I can see this in the code:
MetadataDeleteAfterCommitEnabledKey = "write.metadata.delete-after-commit.enabled"
MetadataDeleteAfterCommitEnabledDefault = [false](https://pkg.go.dev/builtin#false)
MetadataPreviousVersionsMaxKey = "write.metadata.previous-versions-max"
MetadataPreviousVersionsMaxDefault = 100
And these const are actually used inside the code, so in theory, I'm good.
But, these seems to be ignored. At least by AddFiles
Test
Adding 5 parquet files one by one using tx.Addfiles with properties = {"write.metadata.delete-after-commit.enabled": true, "write.metadata.previous-versions-max": 2}.
The resulting metadata file looks like:
{
"last-sequence-number": 5,
"format-version": 2,
"table-uuid": "0195fcd3-07fb-7daf-865a-d7de528a7966",
"location": "s3://test01/table01",
"last-updated-ms": 1743703467541,
"last-column-id": 9,
"schemas": [
{
"type": "struct",
"fields": [
{
"type": "int",
"id": 1,
"name": "file",
"required": true
},
{
"type": "date",
"id": 2,
"name": "date",
"required": true
},
{
"type": "string",
"id": 3,
"name": "name",
"required": true
},
{
"type": "double",
"id": 4,
"name": "value",
"required": true
},
{
"type": {
"type": "list",
"element-id": 5,
"element-required": true,
"element": "long"
},
"id": 6,
"name": "values",
"required": true
},
{
"type": {
"type": "map",
"key-id": 7,
"value-id": 8,
"value-required": true,
"key": "string",
"value": "string"
},
"id": 9,
"name": "metadata",
"required": true
}
],
"schema-id": 0,
"identifier-field-ids": []
}
],
"current-schema-id": 0,
"partition-specs": [
{
"spec-id": 0,
"fields": []
}
],
"default-spec-id": 0,
"last-partition-id": 999,
"properties": {
"schema.name-mapping.default": "[{\"names\":[\"file\"],\"field-id\":1},{\"names\":[\"date\"],\"field-id\":2},{\"names\":[\"name\"],\"field-id\":3},{\"names\":[\"value\"],\"field-id\":4},{\"names\":[\"values\"],\"field-id\":6,\"fields\":[{\"names\":[\"element\"],\"field-id\":5}]},{\"names\":[\"metadata\"],\"field-id\":9,\"fields\":[{\"names\":[\"key\"],\"field-id\":7},{\"names\":[\"value\"],\"field-id\":8}]}]"
},
"snapshots": [
{
"snapshot-id": 5878344995334384993,
"sequence-number": 1,
"timestamp-ms": 1743703443468,
"manifest-list": "s3://test01/table01/metadata/snap-5878344995334384993-0-18273e34-f9c7-4b32-a6ae-29635439d33b.avro",
"summary": {
"added-data-files": "1",
"added-files-size": "1339914",
"added-records": "10000",
"operation": "append",
"total-data-files": "1",
"total-delete-files": "0",
"total-equality-deletes": "0",
"total-files-size": "1339914",
"total-position-deletes": "0",
"total-records": "10000",
"write.metadata.delete-after-commit.enabled": "true",
"write.metadata.previous-versions-max": "2"
},
"schema-id": 0
},
{
"snapshot-id": 3188376151428634650,
"parent-snapshot-id": 5878344995334384993,
"sequence-number": 2,
"timestamp-ms": 1743703451302,
"manifest-list": "s3://test01/table01/metadata/snap-3188376151428634650-0-94264ad1-a9ed-48c5-9bbf-524f870ce861.avro",
"summary": {
"added-data-files": "1",
"added-files-size": "1269857",
"added-records": "10000",
"operation": "append",
"total-data-files": "2",
"total-delete-files": "0",
"total-equality-deletes": "0",
"total-files-size": "2609771",
"total-position-deletes": "0",
"total-records": "20000",
"write.metadata.delete-after-commit.enabled": "true",
"write.metadata.previous-versions-max": "2"
},
"schema-id": 0
},
{
"snapshot-id": 7288953860945251631,
"parent-snapshot-id": 3188376151428634650,
"sequence-number": 3,
"timestamp-ms": 1743703456988,
"manifest-list": "s3://test01/table01/metadata/snap-7288953860945251631-0-80167e08-9e14-4a6d-a07b-7c6960ccc576.avro",
"summary": {
"added-data-files": "1",
"added-files-size": "1273758",
"added-records": "10000",
"operation": "append",
"total-data-files": "3",
"total-delete-files": "0",
"total-equality-deletes": "0",
"total-files-size": "3883529",
"total-position-deletes": "0",
"total-records": "30000",
"write.metadata.delete-after-commit.enabled": "true",
"write.metadata.previous-versions-max": "2"
},
"schema-id": 0
},
{
"snapshot-id": 7065159741632588651,
"parent-snapshot-id": 7288953860945251631,
"sequence-number": 4,
"timestamp-ms": 1743703462379,
"manifest-list": "s3://test01/table01/metadata/snap-7065159741632588651-0-4c11661a-ed3b-4070-a78c-7dd6a796c7c0.avro",
"summary": {
"added-data-files": "1",
"added-files-size": "1338830",
"added-records": "10000",
"operation": "append",
"total-data-files": "4",
"total-delete-files": "0",
"total-equality-deletes": "0",
"total-files-size": "5222359",
"total-position-deletes": "0",
"total-records": "40000",
"write.metadata.delete-after-commit.enabled": "true",
"write.metadata.previous-versions-max": "2"
},
"schema-id": 0
},
{
"snapshot-id": 838362928873899020,
"parent-snapshot-id": 7065159741632588651,
"sequence-number": 5,
"timestamp-ms": 1743703467541,
"manifest-list": "s3://test01/table01/metadata/snap-838362928873899020-0-e86763aa-d1dd-480f-acdf-ae5b8f839b08.avro",
"summary": {
"added-data-files": "1",
"added-files-size": "1266393",
"added-records": "10000",
"operation": "append",
"total-data-files": "5",
"total-delete-files": "0",
"total-equality-deletes": "0",
"total-files-size": "6488752",
"total-position-deletes": "0",
"total-records": "50000",
"write.metadata.delete-after-commit.enabled": "true",
"write.metadata.previous-versions-max": "2"
},
"schema-id": 0
}
],
"current-snapshot-id": 838362928873899020,
"snapshot-log": [
{
"snapshot-id": 5878344995334384993,
"timestamp-ms": 1743703443468
},
{
"snapshot-id": 3188376151428634650,
"timestamp-ms": 1743703451302
},
{
"snapshot-id": 7288953860945251631,
"timestamp-ms": 1743703456988
},
{
"snapshot-id": 7065159741632588651,
"timestamp-ms": 1743703462379
},
{
"snapshot-id": 838362928873899020,
"timestamp-ms": 1743703467541
}
],
"sort-orders": [
{
"order-id": 0,
"fields": []
}
],
"default-sort-order-id": 0,
"refs": {
"main": {
"snapshot-id": 838362928873899020,
"type": "branch"
}
}
}
Every snapshot is still in the snapshot log, while I specified I only want to keep 2. Also, no properties is set in the metadata file.
Am I doing something wrong ?
Ok, so after digging a bit, it seems like my comment above is completely off. I confused the metadata log and the snapshot log.
Quick remark though: while implementing CommitTable for my custom CatalogIO, I can't really append to the metadata log.
Indeed, the path for the metadata file is derived from the sequence number, but the sequence number of the new snapshot is only available when the build() method is called on the builder.
Maybe the builder can expose a CurrentSnapshot() method ?
Hi @zeroshade !
I'm working on implementing snapshot expiration.
Looking at how MetadataBuilder and Update works, I'm a bit confused about how things are supposed to works.
For example:
func (b *MetadataBuilder) SetDefaultSpecID(defaultSpecID int) (*MetadataBuilder, error) {
...
b.updates = append(b.updates, NewSetDefaultSpecUpdate(defaultSpecID))
b.defaultSpecID = defaultSpecID
return b, nil
}
And
func (u *setDefaultSpecUpdate) Apply(builder *MetadataBuilder) error {
_, err := builder.SetDefaultSpecID(u.SpecID)
return err
}
I looks like a circular pattern here, where calling MetadataBuilder.SetDefaultSpecID will end up creating a setDefaultSpecUpdate struct whose Apply method will call MetadataBuilder.SetDefaultSpecID.
Yes it's a little circular because the builder tracks what updates are performed to the metadata. This is required to be able to specify the exact updates that are made to the metadata for use with table updates and commits.
Closed by https://github.com/apache/iceberg-go/pull/401