Cache Manifest files
Closes #595.
One test is failing CI (test_duckdb_url_import).
FAILED tests/integration/test_writes/test_writes.py::test_duckdb_url_import - duckdb.duckdb.IOException: IO Error: Failed to download extension "iceberg" at URL "http://extensions.duckdb.org/v0.10.3/linux_amd64_gcc4/iceberg.duckdb_extension.gz"
Extension "iceberg" is an existing extension.
(ERROR Connection)
= 1 failed, 797 passed, 5 skipped, 2735 deselected, 160 warnings in 261.13s (0:04:21) =
make: *** [Makefile:46: test-integration] Error 1
Error: Process completed with exit code 2.
Seems to be a download error and not due to changes in this PR. Also, the same test passes locally. Can we re-trigger the tests?
Hmm, it's my first time to see this error. I've merged a PR that bumps duckdb to 1.0.0: #793. Hope that can fix the issue here.
BTW, thanks for working on this!
@Fokko @HonahX Thanks for the review!!
@chinmay-bhat @Fokko @HonahX Sorry for the last-minute feedback, but wouldn't it be better if we explicitly pass (maxsize=128) for the lru_cache annotation, When a new call comes in, the decorator’s implementation will evict the least recently used of the existing 128 entries to make a place for the new item. that way it would make it clearer for the end user and increase the code visibility, in the near future we can also take that as a user config via pyiceberg.yaml
@lru_cache(maxsize=128)
def _manifests(io: FileIO, manifest_list: str) -> List[ManifestFile]:
"""Return the manifests from the manifest list."""
file = io.new_input(manifest_list)
return list(read_manifest_list(file))
I'm not sure why test_duckdb_url_import continues to fail with the same error. I've rebased onto latest main branch, and the test runs locally. I also don't see this issue in other open PyIceberg PRs which have recently cleared all tests.
Is this a duck db problem? Or do I need to open a new PR (from main branch + my changes) to resolve it?
Being able to configure (and also disable) the cachine would be a very nice touch
Op ma 10 jun 2024 om 09:52 schreef Chinmay Bhat @.***>
@.**** commented on this pull request.
In pyiceberg/table/snapshots.py https://github.com/apache/iceberg-python/pull/787#discussion_r1632758206 :
@@ -228,6 +229,13 @@ def eq(self, other: Any) -> bool: )
@.***_cache
As the default is 128 https://docs.python.org/3/library/functools.html#functools.lru_cache, I don't think we need to explicitly define maxsize=128.
Also, _manifests() is not a public API, so we would probably not set the cache size through user config via pyiceberg.yaml. Is there a use-case where the user would want to set the cache size?
— Reply to this email directly, view it on GitHub https://github.com/apache/iceberg-python/pull/787#discussion_r1632758206, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIU5KBBXVDMDRY62QQEY2LZGVLMNAVCNFSM6AAAAABIUZG46GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCMBWHEZTSOJSGQ . You are receiving this because you were mentioned.Message ID: @.***>
@chinmay-bhat Looks like the CI is still a bit sad :(
Being able to configure (and also disable) the cachine would be a very nice touch
Hi @Fokko, I'm curious to know why we would want to support custom sized manifest caches. I agree user should be able to disable the cache.
Here's my initial implementation for configuring and disabling cache:
<top of file>
MANIFEST_CACHE_SIZE = "manifest-cache-size"
config = Config()
manifest_cache_size = config.get_int(MANIFEST_CACHE_SIZE) if config.get_int(MANIFEST_CACHE_SIZE) else 128
....
@lru_cache(maxsize=manifest_cache_size)
def _manifests(io: FileIO, manifest_list: str) -> List[ManifestFile]:
"""Return the manifests from the manifest list."""
file = io.new_input(manifest_list)
return list(read_manifest_list(file))
class Snapshot:
....
def manifests(self, io: FileIO, use_cache: bool = True) -> List[ManifestFile]:
"""Return the manifests for the given snapshot."""
if self.manifest_list:
if use_cache:
return _manifests(io, self.manifest_list)
else:
_manifests.cache_clear()
return _manifests.__wrapped__(io, self.manifest_list)
return []
Is this in the right direction? For reference: how to disable lru_cache and lru_cache doc
Looks like the CI is still a bit sad :(
@Fokko @HonahX I need help resolving this error. Error seems unrelated to the changes in this PR. I've added more info in this comment.
Is this a duck db problem? Or do I need to open a new PR (from main branch + my changes) to resolve it?
I tried this PR locally and did not observe this issue. My testing platforms are
- MacOS 14.5 - Apple Silicone
- Ubuntu 24.04 - Intel x86/64
It seems this issue can only be reproduced in github action😱:
- with
lru_cahce, test failed: https://github.com/HonahX/iceberg-python/pull/4 - without
lru_cache, test passed: https://github.com/HonahX/iceberg-python/pull/5
I created a new PR against my fork, and once the GitHub actions failed, I manually re-tried them. https://github.com/chinmay-bhat/iceberg-python/pull/1/checks?sha=8c2e79a9c62f98c51eb56ae65369a7bf3f6d49f4
With lru_cache,
- Python Integration / integration-test (
make test-integration) passes. - Python CI / lint-and-test (
make test-coverage) fails.
Also, with lru_cache, test-integration and test-coverage both pass locally.
My environment is MacOS 14.4.1 - Apple Silicon M2
FYI this is caused by transient networking failures when downloading the duckdb iceberg extension. Using the pre-release duckdb library ([email protected]) resolves this issue because retry was added in duckdb#13122.
See duckdb#13808 for more info. Might need to wait for duckdb v1.1.0 release
#1149 bumps duckdb to v1.1.0 which has the PR to retry extension installation
@chinmay-bhat do you mind rebasing off main to re-run CI?
rebased off main!