HTTP caching
Issue type:
- :heavy_plus_sign: Feature request
Description:
We should add support for proper client-side caching of HTTP resources.
We should check both the Expires, Cache-Control, ETag and Last-Modified headers, and appropriately send requests with If-None-Match/If-Modified-Since if needed.
We should make the cache actor work on the http bus.
There may be npm packages available that can help us out.
This article contains some useful information on caching: https://link.springer.com/chapter/10.1007/978-3-319-18818-8_18
More relevant papers:
- Bahn, Hyokyung, et al. "Efficient replacement of nonuniform objects in web caches." Computer 35.6 (2002): 65-73.
- Jin, Shudong, and Azer Bestavros. "GreedyDual∗ Web caching algorithm: exploiting the two sources of temporal locality in Web request streams." Computer Communications 24.2 (2001): 174-183.
- Jin, Shudong, and Azer Bestavros. "Popularity-aware greedy dual-size web proxy caching algorithms." Proceedings 20th IEEE International Conference on Distributed Computing Systems. IEEE, 2000.
Implementation suggestions:
- A new package:
actor-http-cacheon the http bus - This new actor should be able to use a
Cachefrom the context (interface is available in TS when using dom types). - A simple default LRUCache-based
Cacheimplementation should be set here if the user provided none in the context: https://github.com/comunica/comunica/blob/master/packages/actor-init-query/lib/QueryEngineBase.ts#L110 It can be implemented using https://www.npmjs.com/package/http-cache-semantics - This actor should invoke the HTTP invalidation bus when cache entries become invalidation, to ensure that other parts of the query engine that cache things will be notified. This can be done by adding a
mediatorHttpInvalidateto this actor, and invoking it like here: https://github.com/comunica/comunica/blob/master/packages/actor-init-query/lib/QueryEngineBase.ts#L301 - This actor should also listen to manual HTTP invalidations via an HTTP invalidate observer, just like here: https://github.com/comunica/comunica/blob/master/packages/actor-rdf-resolve-quad-pattern-hypermedia/lib/ActorRdfResolveQuadPatternHypermedia.ts#L42-L44
Things that become possible after this issue has been resolved:
- The hardcoded cache store and cache semantics can be moved to a bus, with corresponding default actors. We should still allow values to be set for them via the context, in which case the bus should not be invoked.
Bounty
A bounty has been placed on this issue by:
| €544 |
Click here to learn more if you're interested in claiming this bounty by resolving this issue.
I believe we should rely on https://www.npmjs.com/package/http-cache-semantics
The cache probably should not be mediated, but it should be statically configured: e.g., if you have a redis cache, you will need a config file to define how to connect to this, or maybe you’d rather want to use something like LevelDB or cacache. As long as somethin adheres to https://developer.mozilla.org/en-US/docs/Web/API/Cache, I would allow it through a kind of strategy pattern that reads a config. What do you think?
That package looks like it might do the trick.
The cache probably should not be mediated, but it should be statically configured
If with "not mediated", you mean that it should not be on a separate "caching" bus, then I agree :-) But it should still listen on the HTTP bus IMO.
We could have this caching actor observe all HTTP calls passively (which will cache all performed HTTP requests), and also actively be registered on the HTTP bus (for replying with results from cache).
a kind of strategy pattern that reads a config
Yes, that sounds good. The caching storage strategy (redis, LevelDB, ...) could be defined in the config file via a parameter. We'd then need separate caching storage components (not necessarily actors (I think)) as well.
To consider: https://github.com/solid/web-access-control-spec/pull/67
As discussed with @rubensworks, I will work on this issue via the Comunica Association.
I've encountered a few problems with this task:
A simple default LRUCache-based Cache implementation should be set here if the user provided none in the context: https://github.com/comunica/comunica/blob/master/packages/actor-init-query/lib/QueryEngineBase.ts#L110 It can be implemented using https://www.npmjs.com/package/http-cache-semantics
The http-cache-semantics library is primarily concerned with the "cache-control" header and the web-cache interface doesn't care about that (for example see https://w3c.github.io/ServiceWorker/#cache-addAll the only thing it cares about is the status of the response and whether or not its http)
Instead of using the http-cache-semantics library, I could just program the Cache interface the way the specification tells me to, but I don't think that's what we're going for.
Maybe abandoning the goal to build a node version of the Cache interface is the best way to proceed. Instead we can build a cache around http-cache-semantics that doesn't follow the web-cache interface.
So, perhaps the best solution is to write a generic interface that does consider these things. I could keep with the Web-Cache interface, but in that case, we shouldn't call it a "Web-Cache" since it doesn't follow the spec. Maybe we could call it http-cache-lru
Ok, let's get rid of add and addAll. But I still think reusing (part of) the Cache interface is valuable, even if we only implement parts of it.
My suggestion is to only implement match, (perhaps matchAll?) put, delete, and keys, and throw a NotImplementedError on all other methods.
And we don't have to follow the exact steps that are prescribed in the Web Cache spec.
Within Comunica, this cache would probably only even use match (for reading from cache), put (for caching requests that have been done by another HTTP actor) and delete (for manually invalidating requests, possibly with the help of keys).
So the general flow of the caching actor would be:
const cachedResponse = await cache.match(originalRequestAction);
if (cachedResponse)
return cachedResponse;
const response = await mediatorHttp.mediate(originalRequestAction);
const [ clonedResponse1, clonedResponse2 ] = someMethodToCloneResponse(response);
await cache.put(originalRequestAction, cloneResponse1);
return clonedResponse2;
Within the match and put, we would still be free to implement support E-Tags and anything we like.
So, perhaps the best solution is to write a generic interface that does consider these things. I could keep with the Web-Cache interface, but in that case, we shouldn't call it a "Web-Cache" since it doesn't follow the spec. Maybe we could call it http-cache-lru
👍
We won't say we follow the spec, but we can keep using the same interface for convenience.
Okay awesome, @rubensworks! That was easy. I've completed the library here: https://github.com/o-development/web-cache-lru/settings. What are the next steps to getting it under Comunica umbrella?
@jaxoncreed Nice!
What are the next steps to getting it under Comunica umbrella?
I would suggest just putting it inside the Comunica monorepo as a new package under packages/.
This would make it easier to tweak things to it in case we notice new requirements when using it from within @comunica/actor-http-cache.
So starting your own fork, and creating a PR with it in it might the the easiest path forward.
Sorry to keep coming back to this point. But I think there's a problem with this:
A simple default LRUCache-based Cache implementation should be set here if the user provided none in the context:
I don't think we should be asking users to provide a Cache in context if our library is expecting a Cache that doesn't perform in a spec compliant way.
Instead I think we should have a simple key value interface that gets set in context. Then we'll have comunica take care of the cache semantics. That way you could pass in a redis, or localstore, or any kind of key value interface as a cache without having to worry about re-implementing cache semantics.
@jaxoncreed Indeed, good point.
I would suggest creating a minimal copy of the Cache interface (with just match, (perhaps matchAll?) put, delete, and keys), and add this to @comunica/types.
This way, we stay aligned with the Web cache interface, even though we don't strictly implement it.
Does this sound ok to you?
Not sure yet about the new interface name, perhaps just WebCache or FetchCache?
Edit: I think you were also referring to decoupling the cache store from the cache semantics. So maybe we want to have two new interfaces: CacheStore and CacheSemantics
I implemented the first suggestion, but now I think we shouldn't be aligned with Web Cache interface at all. What's the point of aligning with something at an interface level if you're not going to implement it at a spec level? If someone passes in a spec-compliant Cache, comunica will break because the spec-compliant Cache doesn't care about http cache semantics.
By allowing developers to pass in a WebCache, we are communicating to them that it's fine of pass in a spec-compliant WebCache, when it is not fine. It's confusing to developers and causes code like this:

I really think the best way forward is to not use the Cache interface at all. The CacheStore should just be a simple key value store.
The CacheStore should just be a simple key value store.
Ok, go ahead!
Will you also be decoupling the cache store from the cache semantics?
Yep, I think decoupling them makes sense.
@jaxoncreed I'll unassign you from this bounty. Please let me know if you'd like to come back to it.
Thanks, sorry this was never completed.