duck-read-cache-fs icon indicating copy to clipboard operation
duck-read-cache-fs copied to clipboard

iceberg_scan issues rpcs even when fully cached

Open serge-melnik opened this issue 7 months ago • 5 comments

On duckdb 1.2.2:

SELECT ... FROM iceberg_scan('s3://.../metadata/00007-1c0ec89f-bbfd-45d5-acb1-0adcaf8a684f.metadata.json')

Re-run the same query. Check profile - everything is cached:

cache_httpfs_get_profile()
"For temp profile collector and stats for on_disk_cache_reader (unit in milliseconds)

metadata cache hit count = 106
metadata cache miss count = 0

data cache hit count = 40
data cache miss count = 0

file handle cache hit count = 27
file handle cache miss count = 0

glob cache hit count = 0
glob cache miss count = 0

Turn off network. Re-run the same query:

Invalid Input Error:
Avro file s3://.../metadata/snap-432837495865380287-1-fb802b6f-730c-4fcd-bd1d-cab1870347c7.avro not found

Pure parquet seems to work fine in offline mode (SELECT * FROM read_parquet(...)).

serge-melnik avatar May 23 '25 00:05 serge-melnik

Hi @serge-melnik thank you for trying out this extension!

Haven't executed the query on my end, may I ask when you say "rerun the same query", do you mean turn off duckdb and re-execute, or execute in the same duckdb instance? Asking this because metadata is cache in-memory.

I will take a further look on the weekend.

dentiny avatar May 23 '25 11:05 dentiny

Hi @serge-melnik thanks for your report! But I just checked, I don't see cache issue. Copy my response on another thread here. I'm wondering what's your config?

My command executed:

-- Setup commands.
D SET s3_access_key_id='xxx';
D SET s3_secret_access_key='xxx';
D SET s3_region='us-west-1';
D INSTALL avro;
D LOAD avro;
D FORCE INSTALL iceberg FROM core_nightly;
D LOAD iceberg;
D SET cache_httpfs_profile_type='temp';
D SELECT cache_httpfs_clear_cache();
┌────────────────────────────┐
│ cache_httpfs_clear_cache() │
│          boolean           │
├────────────────────────────┤
│ true                       │
└────────────────────────────┘

-- First access.
D SELECT COUNT(*) FROM iceberg_scan('s3://moonlink-test-s3-hao/public/16384.98642');
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    125000    │
└──────────────┘
D COPY (SELECT cache_httpfs_get_profile()) TO '/tmp/output-1.txt';

-- Second access.
D SELECT COUNT(*) FROM iceberg_scan('s3://moonlink-test-s3-hao/public/16384.98642');
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    125000    │
└──────────────┘
D COPY (SELECT cache_httpfs_get_profile()) TO '/tmp/output-2.txt';

-- Third access.
D SELECT COUNT(*) FROM iceberg_scan('s3://moonlink-test-s3-hao/public/16384.98642/metadata/v2.metadata.json');
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    125000    │
└──────────────┘
D COPY (SELECT cache_httpfs_get_profile()) TO '/tmp/output-3.txt';

First access stats output is:

cache_httpfs_get_profile()
"For temp profile collector and stats for on_disk_cache_reader (unit in milliseconds)

metadata cache hit count = 66
metadata cache miss count = 22

data cache hit count = 0
data cache miss count = 22

file handle cache hit count = 0
file handle cache miss count = 30

glob cache hit count = 0
glob cache miss count = 0

<emit latency>

Second access stats output:

cache_httpfs_get_profile()
"For temp profile collector and stats for on_disk_cache_reader (unit in milliseconds)

metadata cache hit count = 154
metadata cache miss count = 22

data cache hit count = 22
data cache miss count = 22

file handle cache hit count = 30
file handle cache miss count = 30

glob cache hit count = 0
glob cache miss count = 0

Third access stats output:

cache_httpfs_get_profile()
"For temp profile collector and stats for on_disk_cache_reader (unit in milliseconds)

metadata cache hit count = 240
metadata cache miss count = 22

data cache hit count = 43
data cache miss count = 22

file handle cache hit count = 59
file handle cache miss count = 30

glob cache hit count = 0
glob cache miss count = 0

dentiny avatar Jul 27 '25 04:07 dentiny

@dentiny it seems to be an issue if you ATTACH an iceberg REST catalog rather than use iceberg_scan

nicosuave avatar Nov 17 '25 00:11 nicosuave

@dentiny it seems to be an issue if you ATTACH an iceberg REST catalog rather than use iceberg_scan

Hi @nicosuave , thanks for the feedback! Could you please provide a minimal reproduction script? ATTACH uses duckdb storage extension, which could basically do anything you want.

dentiny avatar Nov 17 '25 00:11 dentiny

Hi @dentiny I was incorrect have things working now. Was just a misconfiguration on my side.

nicosuave avatar Nov 17 '25 02:11 nicosuave