iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Implement caching of manifest-files

Open Fokko opened this issue 1 year ago • 2 comments

Feature Request / Improvement

We currently loop over the manifests of a snapshot often just once. But now when we're compounding the operations (DELETE+APPEND), there is a fair chance that read a manifest more than once. The spec states that manifest files are immutable, this means that we can cache is locally using a method annotated with the lru_cache.

Fokko avatar Apr 10 '24 08:04 Fokko

I am trying to working on this, is it possible to assign it to me?

swapdewalkar avatar Apr 10 '24 14:04 swapdewalkar

@swapdewalkar Thanks for picking this up! I've just assigned it to you

Fokko avatar Apr 10 '24 14:04 Fokko

Hi @swapdewalkar I wanted to check in and see if you have any updates on this task. If you need any assistance or if there are any obstacles, please let me know—I will be happy to help!

MehulBatra avatar May 18 '24 15:05 MehulBatra

Hi, can we increase the scope of this issue to cache/store all_manifests, data_manifests & delete_manifests? Or do I create a new issue for this? This feature would be useful for tasks like Incremental Scans (Append, Changelog, etc) where we frequently access manifest files. I imagine we would like this feature to be similar to the java implementation.

Also, since @swapdewalkar hasn't responded yet and if they do not have the time/bandwidth for the issue, I'm happy to give this a shot! :)

chinmay-bhat avatar May 26 '24 18:05 chinmay-bhat

@chinmay-bhat I think we can generalize this quite easily, since from the spec:

Once written, data and metadata files are immutable until they are deleted.

I think we could go as easy to have a lru-cache based on the path to the metadata to cache it :)

Fokko avatar May 26 '24 18:05 Fokko

Thanks @Fokko for the quick response! Really appreciate it!

based on the path to the metadata to cache it

I'm not clear on this. Are you saying we can simply add lru_cache to def manifests(self, io: FileIO) in class Snapshot? And then whenever we need data manifests or delete manifests, we iterate over the cached manifests? Wouldn't it be better to cache those too, since as you said, the files are immutable?

For ex:

@lru_cache
def manifests(self, io: FileIO):
......


@lru_cache
def data_manifests(self):
   return [manifest_file for manifest_file in self.manifests(self.io) if manifest_file.content == ManifestContent.DATA]

chinmay-bhat avatar May 26 '24 19:05 chinmay-bhat

@chinmay-bhat I don't think it is as easy as that. We should ensure that the manifest_list path is part of the cache. We could share the cache between calls, since if you do subsequent queries, and the snapshot hasn't been updated, this would speed up the call quite a bit.

We could also make the FileIO part of the caching key. I don't think that's stricktly required, but if something changed in the FileIO we might want to invalidate the cache, but I'm open to arguments here.

Fokko avatar May 26 '24 20:05 Fokko

Thank you for clarifying! Here's how I imagine manifests() would look like :)

@lru_cache()
def manifests(self, io: FileIO, manifest_list_location: str) -> List[ManifestFile]:
    if manifest_list_location is not None:
        file = io.new_input(manifest_list_location)
        return list(read_manifest_list(file))
    return []

When we call snapshot.manifests(self.io, snapshot.manifest_list), if manifest_list is the same, we simply query the cached files. But if the snapshot is updated, manifest_list is also updated, and calling manifests() triggers a re-read of manifest files. Is this similar to what you have in mind?

chinmay-bhat avatar May 27 '24 06:05 chinmay-bhat