Refactor client cache to use httpx transport
By using a custom httpx Transport we can remove significant complexity from the caching logic.
There is a lot about how this is factored that makes sense to me: https://github.com/obendidi/httpx-cache/tree/main/httpx_cache
I like:
- Use of a Transport to inject into Client at the right layer
- Cache Controller (which status codes, methods, etc)
- Pluggable serializer
- Avoid the complexity of a separate URL -> ETag, ETag -> content mappings. That optimizes for a pretty rare situation (same content at different URLs) and adds lookup overhead as well as complexity. I should strip that out.
- A default cache location in ~/.cache/tiled? Maybe? Once this is robust, and as long as NFS does not actually make this slower.
I question:
- There is no mechanism for bounding the cache that I can find. We want to keep our scoring mechanism.
- Serializing with msgpack forces us to deserialize the whole (maybe large body) before we know whether the cache entry is expired. Maybe we can pack the content and/or stream in a second, separate msgpack chunk such that we do not read it if we know we won’t use it.
- The RW locks are correct (and better than my messy “reservation” idea) but do not work for multiprocess clients. I think I need a RW lock on top of locket.
Our current implementation makes a half-hearted attempt at multi-process support but it's flawed in a couple ways. I think it would be better to explicitly make the file-based cache single-process --- maybe even added a pidfile lock. If we do that, that opens up the possibility of using sqlite to store header info, size, etc.
I think we should change the internal cache layout from many-small-separate-files to a single SQLite database file, referencing some external files only for large data (MB-scale image chunks, not KB-scale metadata). This will enable better read and write performance:
https://www.sqlite.org/fasterthanfs.html https://www.sqlite.org/intern-v-extern-blob.html
SQLite supports concurrent reads from multiple processes. It manages concurrent writes with an internal locking mechanism, though I have questions about how that performs on distributed file systems (e.g. NFS, Lustre), both in terms of speed and risk of corruption.
I think it makes sense to steer multiprocess writing to a dedicate bulk download utility. It will manage multiple process, perhaps writing to separate SQLite files and merging them at the end. The merged file will be safe for multiprocess reading.
Notes from a design conversation with @tacaswell, working from the top down...
In the client download() method, let the user optionally specify which cache to download into. (Otherwise, use the first persistent cache configured, and raise if there isn't one.) Also add an evict() method that removes the node. I'm not sure these names are good.
class BaseClient:
def download(self, cache=None):
# If cache is None, use first persistent cache.
...
def evict(self, cache=None):
# If cache is None, use first persistent cache.
...
The offline state is a concern of the Transport. It is just aware of one "cache" which may fan out to multiple caches internally.
class Transport:
def __init__(self, cache, offline=False):
...
The cache always has one in-memory dict-like transient cache (might be an LFUCache or the exponential-decay cache we use now). The default configuration of the Tiled client should include this so that latency-eating metadata responses get cached in memory by default. There should be an upper bound on the size of a given item, so that larger data payloads do not eat the whole budget. The default capacity of the in-memory cache should be 100 MB or so.
It may have additional persistent caches. On read, each will be checked in turn. If the request is refreshed, the cache it came from will be invited to update its contents. (If it is read-only it may ignore this invitation.)
When there is a cache miss, only the in-memory cache will be invited to store the response. Unlike the current implementation, the persistent caches will not implicitly stash responses. They will operate more like repositories than webbrowser caches, with data only saved when the user explicitly calls download(). This is in line with how they are typically used in a practice---not for perusing but for deliberately downloading for faster local access or offline access. Perhaps we could add a switch later to opt in to auto-syncing everything that hits with in-memory cache with one of the persistent caches.
class Cache
def __init__(self, in_memory_cache, persistent_caches=None):
class SQLiteCache:
def __init__(self, path, read_only=None):
...
@classmethod
def new(cls, path, capacity=None, read_only=False):
# CREATE TABLE ...
# set default read_only boolean in file
return cls(path, read_only)
...
def set(self, request, response, force_write=False):
...
def get(self, request):
...
def set_read_only(bool_):
# Update the file's default setting and the mode on this instance.
...
def set_capacity(capacity, target_fraction=None):
# If this is an increase, it always succeeds.
# If this is a decrease and the cache is already larger than the new capacity,
# and target_fraction is None, raise.
# If target_fraction is not None, prune_largest until the cache usage is
# below target_fraction of the new capacity.
...
def prune_largest(fraction):
...
def prune_least_recently_used(fraction):
...
def prune_random(fraction):
...
def prune_not_touched_since(datetime):
...
An interesting future possibility: run a tiled server backed by these caches.
The above proposal was trying to combine two use cases:
- Deliberate download for offline use
- Passive automatic caching (like a web browser cache).
This became very complex. Also, "offline mode" implemented this way could only handle requests exactly like ones it had seen before. It demoed well, but in practice it was easy to bump into its limitations.
Now, we have a clear path for handling (1) with a local SQLite database and files. This would give a full-featured offline mode: #473. The user experience is stronger in every respect.
And so it seems we can make the cache way, way simpler and opinionated. It will be untangled from "offline mode". It can implement an LRU cache in SQLite, with each row storing
- a cache key
- headers, as JSON
- the raw (not decompressed) response body
- the response body size in bytes (Or: Just put an index on
Content-Lengthkey in the header column? Something tells me we want a separate column though…) - time created
- time last accessed
We can rely on SQLite itself to tier the storage between an in-memory page cache and disk. The only configuration we need to expose is capacity (max size in bytes) and filepath. If users want to do fancier things (cull by age, cull randomly) they can do it directly with SQLite, and we can always add Python API later if it is really needed.
It probably also makes sense to make it possible to attach the cache in a read-only mode where it would:
- not refresh or add any content
- not update the time last accessed
Thoughts from chat with @tacaswell:
- 500 MB is probably good conservative default capacity
- Max 500 KB per item (large items are not cached)