cachier Add an S3 backend

For me, the most probable use case in the near future is an an S3-backed persistent cache.

Dec 01 '20 08:12 shaypal5

I'm looking for this feature and could possibly create a PR. I think the _MongoCore class would be a good starting point, not? Where do you see the complexity? locking objects for evaluation?

Jan 08 '22 20:01 pumelo

Hey,

A PR would be great! And indeed the _MongoCore class would be the best starting point. I guess entry locking would be a challenge. I think the largest amount of work is hidden in developing a flexible entry format. Perhaps the best solution is to imitate the MongoDB entry structure and just have a .json file corresponding to each serialized binary file, with the json containing all the entry data and the two together comprising a cache entry.

Also maybe the implementation of a search. I guess nothing too sophisticated. It would be linear in the number of entries in some way, and you would probably have to use the S3 functionality of getting all object with a certain key prefix, to get all cache entry for certain function, or maybe just use it to get the one with a specific function-key combo; this would mean naming object in a certain way, like func/key.json and func/key.bin and then search for func/key. I think.

Let me know what you think.

Jan 09 '22 10:01 shaypal5

S3 supports to add metadata as html headers to each object. I think this could serve the purpose of the additional .json. When doing a HEAD request only the headers are returned, this could be a cheap method to check if the object is still valid.

Automatic expiration directly inside S3 would be nice too, but that seems to be only supported on a per bucket configuration ... https://stackoverflow.com/questions/12185879/s3-per-object-expiry , however up to 100 different expiry rules could be added. Looks like this is really an advanced feature.

If the caching is for an asset served over http, S3 would even make it possible to offload the delivery of the object to S3 and use pre-sigend urls: https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html

Thinking about locking. Is this really required? Do you want to guarantee that each object is only generated once? If yes, this is not possible with S3 alone. If there is no locking, it could happen, that multiple processes will generate the object at the same time and upload it to S3. S3 will happily just serve the last object uploaded. So probably the locking is not required ...

What are your thoughts?

Jan 09 '22 11:01 pumelo

Metadata sounds great.
Automatic expiration per object would have amazing, but 100 different rules can mean we can support up to 100 different functions per bucket, each with a different lifecycle setting! Sounds enough for most use cases for me. I think that if a rule can condition that all objects with a certain metadata attribute have a lifecycle of X hours/days, that's enough for an awesome implementation that supports up to 100 different functions, and we can always warn users about this limitation. The MongoDB core anyway has an issue that clean up needs to be done manually, as I didn't want to write any daemon process to take care of that (feels a bit out of scope for the package).
No, we don't have to guarentee it, but note that the implementation must prevent such duplicate computations for the vast majority of the function calls, otherwise the package does essentially nothing. The point is to save redundant calculations. It's ok to not be able to gurantee it for two calls that are very close in time (and obviously this also depends on the function computation duration). We can start with no locking and have it as an open issue for an enhancement/feature.

Jan 10 '22 12:01 shaypal5

cachier cachier copied to clipboard

Add an S3 backend

cachier
cachier copied to clipboard