duck-read-cache-fs icon indicating copy to clipboard operation
duck-read-cache-fs copied to clipboard

0 byte cache files

Open nicosuave opened this issue 1 month ago • 7 comments

Describe the bug In certain environments, thousands of 0 byte cache files (with .httpfs_local_cache suffix) are written and no data is ever properly cached.

Initially I thought this might have been my config, specifically something around cache_httpfs_min_disk_bytes_for_cache but disabling that had no impact.

cache_httpfs_type is on_disk if helpful

To Reproduce I attempted to build one locally, containerizing a subset of this environment, but all worked as expected (data properly written to cache files). Will keep iterating on this...

Expected behavior Cache files should not be empty

nicosuave avatar Nov 17 '25 16:11 nicosuave

This seems to be isolated to OCI VMs; very weird.

You can run the below script on an Oracle VM with uv run filename.py to reproduce

# /// script
# dependencies = ["duckdb"]
# ///

import duckdb
import os
import tempfile

# Create cache directory
cache_dir = os.path.join(tempfile.gettempdir(), 'duckdb_cache_test')
os.makedirs(cache_dir, exist_ok=True)

# Connect and configure
conn = duckdb.connect(':memory:')

# Install extensions
conn.execute("INSTALL httpfs;")
conn.execute("LOAD httpfs;")
conn.execute("INSTALL iceberg;")
conn.execute("LOAD iceberg;")
conn.execute("INSTALL cache_httpfs FROM community;")
conn.execute("LOAD cache_httpfs;")

# Configure cache_httpfs
conn.execute("SET cache_httpfs_type='on_disk';")
conn.execute(f"SET cache_httpfs_cache_directory='{cache_dir}/';")
conn.execute("SET cache_httpfs_cache_block_size=4194304;")
conn.execute("SET cache_httpfs_min_disk_bytes_for_cache=1000000000;")

# Create Iceberg secret and attach catalog
conn.execute("""
    CREATE SECRET r2_secret (
        TYPE ICEBERG,
        TOKEN ''
    );
""")

conn.execute("""
    ATTACH '' AS ib (
        TYPE ICEBERG,
        ENDPOINT ''
    );
""")

# Run query
result = conn.execute("""
""").fetchall()

print(f"Query returned {len(result)} rows")

# Check cache files
cache_files = [f for f in os.listdir(cache_dir) if os.path.isfile(os.path.join(cache_dir, f))]
print(f"Cache files created: {len(cache_files)}")

if cache_files:
    # Show first 5 file sizes
    for f in cache_files[:5]:
        filepath = os.path.join(cache_dir, f)
        size = os.path.getsize(filepath)
        print(f"  {f}: {size:,} bytes")

conn.close()

nicosuave avatar Nov 17 '25 19:11 nicosuave

Hi @nicosuave thanks for the feedback! I added a unit test to "read and cache" empty files, and it seems to be fine: https://github.com/dentiny/duck-read-cache-fs/pull/310 I actually want to know more things:

  • Does the iceberg queries succeed or not?
  • Does the "empty cache file" phenomenon apply to all cache files, or only a portion of it?

~I think it's worth adding a cached block query function as external file cache does, for easier cache inspection and debugging.~

dentiny avatar Nov 18 '25 03:11 dentiny

And you should be able to query cache block information via SELECT cache_httpfs_get_profile();

dentiny avatar Nov 18 '25 03:11 dentiny

Hi @dentiny the extension works 100% fine querying and caching properly on my local macOS env & in an ubuntu container running locally but doesn't seem to write cache file contents in an Oracle Cloud VM for some reason. Signing up for an Oracle account is probably the last thing you want to do but they have a nice free tier.

Querying seems to work fine when empty cache files are created, although with a performance hit. This seems to affect all cache files; almost feels like something filesystem related but I've not uncovered anything definitive.

I'll check that function, thank you!

nicosuave avatar Nov 18 '25 04:11 nicosuave

Hi @dentiny the extension works 100% fine querying and caching properly on my local macOS env & in an ubuntu container running locally but doesn't seem to write cache file contents in an Oracle Cloud VM for some reason.

If that's the case, I think an easier observability mechanism is logging. For example, we could log persistent files.

BTW, there're two things I would like more information:

  • Does in-memory caching works for you?
  • Does oracle cloud VM apply write permission for certain directories? (i.e., file creation succeeds, but write/sync fails)

dentiny avatar Nov 18 '25 05:11 dentiny

  • Does in-memory caching works for you?
  • Does oracle cloud VM apply write permission for certain directories? (i.e., file creation succeeds, but write/sync fails)

In memory caching seems to be working fine. I couldn't find any issues with the permissions on the Oracle VM but will keep itreating

nicosuave avatar Nov 25 '25 21:11 nicosuave

Hi @nicosuave I just upgraded the extension, which should be available to use after half a day or a day. In the latest release, I added logging for local cache file deletion and creation, which could be enabled as description at https://github.com/dentiny/duck-read-cache-fs/pull/314. Let me know if it helps. FYI: I check linkedin frequently, feel free to drop a message.

dentiny avatar Nov 26 '25 09:11 dentiny