cloudpathlib icon indicating copy to clipboard operation
cloudpathlib copied to clipboard

How to use cloudpathlib without any cache or disk I/O?

Open cdeil opened this issue 5 months ago • 3 comments

I'm using cloudpathlib for file write, read, delete as part of a web app.

Running it on AWS AppRunner and S3 we are seeing some issues that sometimes delete doesn't work. cloudpathlib claims to have deleted the file but then the next GET request shows it's still there.

Could this be a cloudpathlib caching issue?

I would like to turn caching off completely and also not use a local disk or tempfiles. It should be just a stateless in-memory HTTP application server.

Is this possible? How do I configure cloudpathlib to NOT have caching and NOT have any disk I/O?

This is the utils s3.py module I'm using at the moment:

"""File storage utilities for handling both S3 and local filesystem operations."""

import base64
import logging
from cloudpathlib import S3Client, S3Path
from fastapi import UploadFile
from stocadro.config import settings
from stocadro.core.schemas import MemFile

logger = logging.getLogger(__name__)

# S3 client configuration
s3_client = S3Client(
    aws_access_key_id=settings.S3_ACCESS_KEY_ID,
    aws_secret_access_key=settings.S3_ACCESS_KEY,
)
s3_client.set_as_default_client()

s3_image_path = S3Path(settings.s3_image_path)


class S3:
    """File storage on S3 or local filesystem with uniform interface."""

    @staticmethod
    def write(input_file: UploadFile, filename: str) -> None:
        path = s3_image_path / filename
        logger.info(f"S3 write {path=}")
        path.parent.mkdir(exist_ok=True, parents=True)
        path.write_bytes(input_file.file.read())

    @staticmethod
    def read(filename: str) -> MemFile:
        path = s3_image_path / filename
        logger.info(f"S3 read {path=}")
        return MemFile(content=path.read_bytes(), name=filename)

    @staticmethod
    def delete(filename: str) -> None:
        path = s3_image_path / filename
        logger.info(f"S3 delete: {path=}")
        path.unlink()

cdeil avatar Jul 01 '25 09:07 cdeil

The current cache architecture is designed around on-disk storage since that best works with the most common workloads. The caching docs cover the details.

Here are a few things that might help for your scenario:

  • If on unix, you can mount a RAM disk and use that as the cache path
  • Use the "close" cache mode so files drop out of the cache as soon they are closed
  • Look into using something like smart-open for the reading/writing operations.

pjbull avatar Jul 01 '25 20:07 pjbull

@pjbull - thanks for the advice and pointers!

I did change to file_cache_mode="close_file" today:

s3_client = S3Client(
    aws_access_key_id=settings.S3_ACCESS_KEY_ID,
    aws_secret_access_key=settings.S3_ACCESS_KEY,
    file_cache_mode="close_file",
)

It did resolve the issue that GET after DELETE was still showing the file.

Don't want to look into RAM disk mounting, partly because I want our solution to just work on MacOS as well as on Linux (we run on AppRunner AWS). I guess I'll take the performance hit of doing disk I/O for no reason for now.

I do feel it would be worth stating somewhere explicitly that this package isn't really designed for stateless in-memory non-cached usage, which is very common for web apps to not want to have caching in their API layer.

Also I noticed that the TOC for the different caching options here isn't working properly:

Image

But thanks for making this package! It did nicely solve our wish to have PROD S3 and local testing as similar as possible by just using LocalS3Client/LocalS3path in the tests switching that over in one pytest fixture.

cdeil avatar Jul 01 '25 21:07 cdeil

Great, thanks for the detail and the bug in the docs.

You likely already know, but worth noting that it is also a common pattern to use presigned urls to pass data between object stores and clients of web applications rather than transferring any data through your server directly.

pjbull avatar Jul 02 '25 04:07 pjbull