iceberg-go icon indicating copy to clipboard operation
iceberg-go copied to clipboard

Implement expire snapshots maintenance operation

Open arnaudbriche opened this issue 9 months ago • 4 comments

Feature Request / Improvement

Hi,

Implementing an snpahsots expiration maintance operation would be really valuable. As of now, metadata (and data) is constantly growing because no expiration is done.

Reference: https://iceberg.apache.org/javadoc/1.8.1/org/apache/iceberg/ExpireSnapshots.html

arnaudbriche avatar Mar 31 '25 09:03 arnaudbriche

After digging a bit deeper, I noticed that most of the features I need have already been implemented.

I can see this in the code:

MetadataDeleteAfterCommitEnabledKey     = "write.metadata.delete-after-commit.enabled"
	MetadataDeleteAfterCommitEnabledDefault = [false](https://pkg.go.dev/builtin#false)

	MetadataPreviousVersionsMaxKey     = "write.metadata.previous-versions-max"
	MetadataPreviousVersionsMaxDefault = 100

And these const are actually used inside the code, so in theory, I'm good.

But, these seems to be ignored. At least by AddFiles

Test

Adding 5 parquet files one by one using tx.Addfiles with properties = {"write.metadata.delete-after-commit.enabled": true, "write.metadata.previous-versions-max": 2}.

The resulting metadata file looks like:

{
    "last-sequence-number": 5,
    "format-version": 2,
    "table-uuid": "0195fcd3-07fb-7daf-865a-d7de528a7966",
    "location": "s3://test01/table01",
    "last-updated-ms": 1743703467541,
    "last-column-id": 9,
    "schemas": [
        {
            "type": "struct",
            "fields": [
                {
                    "type": "int",
                    "id": 1,
                    "name": "file",
                    "required": true
                },
                {
                    "type": "date",
                    "id": 2,
                    "name": "date",
                    "required": true
                },
                {
                    "type": "string",
                    "id": 3,
                    "name": "name",
                    "required": true
                },
                {
                    "type": "double",
                    "id": 4,
                    "name": "value",
                    "required": true
                },
                {
                    "type": {
                        "type": "list",
                        "element-id": 5,
                        "element-required": true,
                        "element": "long"
                    },
                    "id": 6,
                    "name": "values",
                    "required": true
                },
                {
                    "type": {
                        "type": "map",
                        "key-id": 7,
                        "value-id": 8,
                        "value-required": true,
                        "key": "string",
                        "value": "string"
                    },
                    "id": 9,
                    "name": "metadata",
                    "required": true
                }
            ],
            "schema-id": 0,
            "identifier-field-ids": []
        }
    ],
    "current-schema-id": 0,
    "partition-specs": [
        {
            "spec-id": 0,
            "fields": []
        }
    ],
    "default-spec-id": 0,
    "last-partition-id": 999,
    "properties": {
        "schema.name-mapping.default": "[{\"names\":[\"file\"],\"field-id\":1},{\"names\":[\"date\"],\"field-id\":2},{\"names\":[\"name\"],\"field-id\":3},{\"names\":[\"value\"],\"field-id\":4},{\"names\":[\"values\"],\"field-id\":6,\"fields\":[{\"names\":[\"element\"],\"field-id\":5}]},{\"names\":[\"metadata\"],\"field-id\":9,\"fields\":[{\"names\":[\"key\"],\"field-id\":7},{\"names\":[\"value\"],\"field-id\":8}]}]"
    },
    "snapshots": [
        {
            "snapshot-id": 5878344995334384993,
            "sequence-number": 1,
            "timestamp-ms": 1743703443468,
            "manifest-list": "s3://test01/table01/metadata/snap-5878344995334384993-0-18273e34-f9c7-4b32-a6ae-29635439d33b.avro",
            "summary": {
                "added-data-files": "1",
                "added-files-size": "1339914",
                "added-records": "10000",
                "operation": "append",
                "total-data-files": "1",
                "total-delete-files": "0",
                "total-equality-deletes": "0",
                "total-files-size": "1339914",
                "total-position-deletes": "0",
                "total-records": "10000",
                "write.metadata.delete-after-commit.enabled": "true",
                "write.metadata.previous-versions-max": "2"
            },
            "schema-id": 0
        },
        {
            "snapshot-id": 3188376151428634650,
            "parent-snapshot-id": 5878344995334384993,
            "sequence-number": 2,
            "timestamp-ms": 1743703451302,
            "manifest-list": "s3://test01/table01/metadata/snap-3188376151428634650-0-94264ad1-a9ed-48c5-9bbf-524f870ce861.avro",
            "summary": {
                "added-data-files": "1",
                "added-files-size": "1269857",
                "added-records": "10000",
                "operation": "append",
                "total-data-files": "2",
                "total-delete-files": "0",
                "total-equality-deletes": "0",
                "total-files-size": "2609771",
                "total-position-deletes": "0",
                "total-records": "20000",
                "write.metadata.delete-after-commit.enabled": "true",
                "write.metadata.previous-versions-max": "2"
            },
            "schema-id": 0
        },
        {
            "snapshot-id": 7288953860945251631,
            "parent-snapshot-id": 3188376151428634650,
            "sequence-number": 3,
            "timestamp-ms": 1743703456988,
            "manifest-list": "s3://test01/table01/metadata/snap-7288953860945251631-0-80167e08-9e14-4a6d-a07b-7c6960ccc576.avro",
            "summary": {
                "added-data-files": "1",
                "added-files-size": "1273758",
                "added-records": "10000",
                "operation": "append",
                "total-data-files": "3",
                "total-delete-files": "0",
                "total-equality-deletes": "0",
                "total-files-size": "3883529",
                "total-position-deletes": "0",
                "total-records": "30000",
                "write.metadata.delete-after-commit.enabled": "true",
                "write.metadata.previous-versions-max": "2"
            },
            "schema-id": 0
        },
        {
            "snapshot-id": 7065159741632588651,
            "parent-snapshot-id": 7288953860945251631,
            "sequence-number": 4,
            "timestamp-ms": 1743703462379,
            "manifest-list": "s3://test01/table01/metadata/snap-7065159741632588651-0-4c11661a-ed3b-4070-a78c-7dd6a796c7c0.avro",
            "summary": {
                "added-data-files": "1",
                "added-files-size": "1338830",
                "added-records": "10000",
                "operation": "append",
                "total-data-files": "4",
                "total-delete-files": "0",
                "total-equality-deletes": "0",
                "total-files-size": "5222359",
                "total-position-deletes": "0",
                "total-records": "40000",
                "write.metadata.delete-after-commit.enabled": "true",
                "write.metadata.previous-versions-max": "2"
            },
            "schema-id": 0
        },
        {
            "snapshot-id": 838362928873899020,
            "parent-snapshot-id": 7065159741632588651,
            "sequence-number": 5,
            "timestamp-ms": 1743703467541,
            "manifest-list": "s3://test01/table01/metadata/snap-838362928873899020-0-e86763aa-d1dd-480f-acdf-ae5b8f839b08.avro",
            "summary": {
                "added-data-files": "1",
                "added-files-size": "1266393",
                "added-records": "10000",
                "operation": "append",
                "total-data-files": "5",
                "total-delete-files": "0",
                "total-equality-deletes": "0",
                "total-files-size": "6488752",
                "total-position-deletes": "0",
                "total-records": "50000",
                "write.metadata.delete-after-commit.enabled": "true",
                "write.metadata.previous-versions-max": "2"
            },
            "schema-id": 0
        }
    ],
    "current-snapshot-id": 838362928873899020,
    "snapshot-log": [
        {
            "snapshot-id": 5878344995334384993,
            "timestamp-ms": 1743703443468
        },
        {
            "snapshot-id": 3188376151428634650,
            "timestamp-ms": 1743703451302
        },
        {
            "snapshot-id": 7288953860945251631,
            "timestamp-ms": 1743703456988
        },
        {
            "snapshot-id": 7065159741632588651,
            "timestamp-ms": 1743703462379
        },
        {
            "snapshot-id": 838362928873899020,
            "timestamp-ms": 1743703467541
        }
    ],
    "sort-orders": [
        {
            "order-id": 0,
            "fields": []
        }
    ],
    "default-sort-order-id": 0,
    "refs": {
        "main": {
            "snapshot-id": 838362928873899020,
            "type": "branch"
        }
    }
}

Every snapshot is still in the snapshot log, while I specified I only want to keep 2. Also, no properties is set in the metadata file.

Am I doing something wrong ?

arnaudbriche avatar Apr 03 '25 18:04 arnaudbriche

Ok, so after digging a bit, it seems like my comment above is completely off. I confused the metadata log and the snapshot log.

Quick remark though: while implementing CommitTable for my custom CatalogIO, I can't really append to the metadata log. Indeed, the path for the metadata file is derived from the sequence number, but the sequence number of the new snapshot is only available when the build() method is called on the builder.

Maybe the builder can expose a CurrentSnapshot() method ?

arnaudbriche avatar Apr 07 '25 15:04 arnaudbriche

Hi @zeroshade !

I'm working on implementing snapshot expiration.

Looking at how MetadataBuilder and Update works, I'm a bit confused about how things are supposed to works.

For example:

func (b *MetadataBuilder) SetDefaultSpecID(defaultSpecID int) (*MetadataBuilder, error) {
       ...

	b.updates = append(b.updates, NewSetDefaultSpecUpdate(defaultSpecID))
	b.defaultSpecID = defaultSpecID

	return b, nil
}

And

func (u *setDefaultSpecUpdate) Apply(builder *MetadataBuilder) error {
	_, err := builder.SetDefaultSpecID(u.SpecID)

	return err
}

I looks like a circular pattern here, where calling MetadataBuilder.SetDefaultSpecID will end up creating a setDefaultSpecUpdate struct whose Apply method will call MetadataBuilder.SetDefaultSpecID.

arnaudbriche avatar Apr 07 '25 17:04 arnaudbriche

Yes it's a little circular because the builder tracks what updates are performed to the metadata. This is required to be able to specify the exact updates that are made to the metadata for use with table updates and commits.

zeroshade avatar Apr 07 '25 17:04 zeroshade

Closed by https://github.com/apache/iceberg-go/pull/401

arnaudbriche avatar Aug 07 '25 08:08 arnaudbriche