iceberg_scan issues rpcs even when fully cached
On duckdb 1.2.2:
SELECT ... FROM iceberg_scan('s3://.../metadata/00007-1c0ec89f-bbfd-45d5-acb1-0adcaf8a684f.metadata.json')
Re-run the same query. Check profile - everything is cached:
cache_httpfs_get_profile()
"For temp profile collector and stats for on_disk_cache_reader (unit in milliseconds)
metadata cache hit count = 106
metadata cache miss count = 0
data cache hit count = 40
data cache miss count = 0
file handle cache hit count = 27
file handle cache miss count = 0
glob cache hit count = 0
glob cache miss count = 0
Turn off network. Re-run the same query:
Invalid Input Error:
Avro file s3://.../metadata/snap-432837495865380287-1-fb802b6f-730c-4fcd-bd1d-cab1870347c7.avro not found
Pure parquet seems to work fine in offline mode (SELECT * FROM read_parquet(...)).
Hi @serge-melnik thank you for trying out this extension!
Haven't executed the query on my end, may I ask when you say "rerun the same query", do you mean turn off duckdb and re-execute, or execute in the same duckdb instance? Asking this because metadata is cache in-memory.
I will take a further look on the weekend.
Hi @serge-melnik thanks for your report! But I just checked, I don't see cache issue. Copy my response on another thread here. I'm wondering what's your config?
My command executed:
-- Setup commands.
D SET s3_access_key_id='xxx';
D SET s3_secret_access_key='xxx';
D SET s3_region='us-west-1';
D INSTALL avro;
D LOAD avro;
D FORCE INSTALL iceberg FROM core_nightly;
D LOAD iceberg;
D SET cache_httpfs_profile_type='temp';
D SELECT cache_httpfs_clear_cache();
┌────────────────────────────┐
│ cache_httpfs_clear_cache() │
│ boolean │
├────────────────────────────┤
│ true │
└────────────────────────────┘
-- First access.
D SELECT COUNT(*) FROM iceberg_scan('s3://moonlink-test-s3-hao/public/16384.98642');
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 125000 │
└──────────────┘
D COPY (SELECT cache_httpfs_get_profile()) TO '/tmp/output-1.txt';
-- Second access.
D SELECT COUNT(*) FROM iceberg_scan('s3://moonlink-test-s3-hao/public/16384.98642');
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 125000 │
└──────────────┘
D COPY (SELECT cache_httpfs_get_profile()) TO '/tmp/output-2.txt';
-- Third access.
D SELECT COUNT(*) FROM iceberg_scan('s3://moonlink-test-s3-hao/public/16384.98642/metadata/v2.metadata.json');
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 125000 │
└──────────────┘
D COPY (SELECT cache_httpfs_get_profile()) TO '/tmp/output-3.txt';
First access stats output is:
cache_httpfs_get_profile()
"For temp profile collector and stats for on_disk_cache_reader (unit in milliseconds)
metadata cache hit count = 66
metadata cache miss count = 22
data cache hit count = 0
data cache miss count = 22
file handle cache hit count = 0
file handle cache miss count = 30
glob cache hit count = 0
glob cache miss count = 0
<emit latency>
Second access stats output:
cache_httpfs_get_profile()
"For temp profile collector and stats for on_disk_cache_reader (unit in milliseconds)
metadata cache hit count = 154
metadata cache miss count = 22
data cache hit count = 22
data cache miss count = 22
file handle cache hit count = 30
file handle cache miss count = 30
glob cache hit count = 0
glob cache miss count = 0
Third access stats output:
cache_httpfs_get_profile()
"For temp profile collector and stats for on_disk_cache_reader (unit in milliseconds)
metadata cache hit count = 240
metadata cache miss count = 22
data cache hit count = 43
data cache miss count = 22
file handle cache hit count = 59
file handle cache miss count = 30
glob cache hit count = 0
glob cache miss count = 0
@dentiny it seems to be an issue if you ATTACH an iceberg REST catalog rather than use iceberg_scan
@dentiny it seems to be an issue if you
ATTACHan iceberg REST catalog rather than useiceberg_scan
Hi @nicosuave , thanks for the feedback! Could you please provide a minimal reproduction script?
ATTACH uses duckdb storage extension, which could basically do anything you want.
Hi @dentiny I was incorrect have things working now. Was just a misconfiguration on my side.