comunica icon indicating copy to clipboard operation
comunica copied to clipboard

HTTP caching

Open rubensworks opened this issue 7 years ago • 25 comments

Issue type:

  • :heavy_plus_sign: Feature request

Description:

We should add support for proper client-side caching of HTTP resources. We should check both the Expires, Cache-Control, ETag and Last-Modified headers, and appropriately send requests with If-None-Match/If-Modified-Since if needed.

We should make the cache actor work on the http bus.

There may be npm packages available that can help us out.

This article contains some useful information on caching: https://link.springer.com/chapter/10.1007/978-3-319-18818-8_18

More relevant papers:

  • Bahn, Hyokyung, et al. "Efficient replacement of nonuniform objects in web caches." Computer 35.6 (2002): 65-73.
  • Jin, Shudong, and Azer Bestavros. "GreedyDual∗ Web caching algorithm: exploiting the two sources of temporal locality in Web request streams." Computer Communications 24.2 (2001): 174-183.
  • Jin, Shudong, and Azer Bestavros. "Popularity-aware greedy dual-size web proxy caching algorithms." Proceedings 20th IEEE International Conference on Distributed Computing Systems. IEEE, 2000.

Implementation suggestions:

  • A new package: actor-http-cache on the http bus
  • This new actor should be able to use a Cache from the context (interface is available in TS when using dom types).
  • A simple default LRUCache-based Cache implementation should be set here if the user provided none in the context: https://github.com/comunica/comunica/blob/master/packages/actor-init-query/lib/QueryEngineBase.ts#L110 It can be implemented using https://www.npmjs.com/package/http-cache-semantics
  • This actor should invoke the HTTP invalidation bus when cache entries become invalidation, to ensure that other parts of the query engine that cache things will be notified. This can be done by adding a mediatorHttpInvalidate to this actor, and invoking it like here: https://github.com/comunica/comunica/blob/master/packages/actor-init-query/lib/QueryEngineBase.ts#L301
  • This actor should also listen to manual HTTP invalidations via an HTTP invalidate observer, just like here: https://github.com/comunica/comunica/blob/master/packages/actor-rdf-resolve-quad-pattern-hypermedia/lib/ActorRdfResolveQuadPatternHypermedia.ts#L42-L44

Things that become possible after this issue has been resolved:

  • The hardcoded cache store and cache semantics can be moved to a bus, with corresponding default actors. We should still allow values to be set for them via the context, in which case the bus should not be invoked.

Bounty

A bounty has been placed on this issue by:

Comunica Association
€544

Click here to learn more if you're interested in claiming this bounty by resolving this issue.

rubensworks avatar Jul 03 '18 10:07 rubensworks

I believe we should rely on https://www.npmjs.com/package/http-cache-semantics

The cache probably should not be mediated, but it should be statically configured: e.g., if you have a redis cache, you will need a config file to define how to connect to this, or maybe you’d rather want to use something like LevelDB or cacache. As long as somethin adheres to https://developer.mozilla.org/en-US/docs/Web/API/Cache, I would allow it through a kind of strategy pattern that reads a config. What do you think?

pietercolpaert avatar Oct 07 '18 11:10 pietercolpaert

That package looks like it might do the trick.

The cache probably should not be mediated, but it should be statically configured

If with "not mediated", you mean that it should not be on a separate "caching" bus, then I agree :-) But it should still listen on the HTTP bus IMO.

We could have this caching actor observe all HTTP calls passively (which will cache all performed HTTP requests), and also actively be registered on the HTTP bus (for replying with results from cache).

a kind of strategy pattern that reads a config

Yes, that sounds good. The caching storage strategy (redis, LevelDB, ...) could be defined in the config file via a parameter. We'd then need separate caching storage components (not necessarily actors (I think)) as well.

rubensworks avatar Oct 08 '18 14:10 rubensworks

To consider: https://github.com/solid/web-access-control-spec/pull/67

rubensworks avatar Aug 28 '19 12:08 rubensworks

As discussed with @rubensworks, I will work on this issue via the Comunica Association.

jaxoncreed avatar May 23 '22 06:05 jaxoncreed

I've encountered a few problems with this task:

A simple default LRUCache-based Cache implementation should be set here if the user provided none in the context: https://github.com/comunica/comunica/blob/master/packages/actor-init-query/lib/QueryEngineBase.ts#L110 It can be implemented using https://www.npmjs.com/package/http-cache-semantics

The http-cache-semantics library is primarily concerned with the "cache-control" header and the web-cache interface doesn't care about that (for example see https://w3c.github.io/ServiceWorker/#cache-addAll the only thing it cares about is the status of the response and whether or not its http)

Instead of using the http-cache-semantics library, I could just program the Cache interface the way the specification tells me to, but I don't think that's what we're going for.

Maybe abandoning the goal to build a node version of the Cache interface is the best way to proceed. Instead we can build a cache around http-cache-semantics that doesn't follow the web-cache interface.

So, perhaps the best solution is to write a generic interface that does consider these things. I could keep with the Web-Cache interface, but in that case, we shouldn't call it a "Web-Cache" since it doesn't follow the spec. Maybe we could call it http-cache-lru

jaxoncreed avatar Jun 02 '22 15:06 jaxoncreed

Ok, let's get rid of add and addAll. But I still think reusing (part of) the Cache interface is valuable, even if we only implement parts of it.

My suggestion is to only implement match, (perhaps matchAll?) put, delete, and keys, and throw a NotImplementedError on all other methods. And we don't have to follow the exact steps that are prescribed in the Web Cache spec. Within Comunica, this cache would probably only even use match (for reading from cache), put (for caching requests that have been done by another HTTP actor) and delete (for manually invalidating requests, possibly with the help of keys).

So the general flow of the caching actor would be:

const cachedResponse = await cache.match(originalRequestAction);
if (cachedResponse)
  return cachedResponse;

const response = await mediatorHttp.mediate(originalRequestAction);
const [ clonedResponse1, clonedResponse2 ] = someMethodToCloneResponse(response);
await cache.put(originalRequestAction, cloneResponse1);
return clonedResponse2;

Within the match and put, we would still be free to implement support E-Tags and anything we like.

So, perhaps the best solution is to write a generic interface that does consider these things. I could keep with the Web-Cache interface, but in that case, we shouldn't call it a "Web-Cache" since it doesn't follow the spec. Maybe we could call it http-cache-lru

👍

We won't say we follow the spec, but we can keep using the same interface for convenience.

rubensworks avatar Jun 03 '22 06:06 rubensworks

Okay awesome, @rubensworks! That was easy. I've completed the library here: https://github.com/o-development/web-cache-lru/settings. What are the next steps to getting it under Comunica umbrella?

jaxoncreed avatar Jun 04 '22 20:06 jaxoncreed

@jaxoncreed Nice!

What are the next steps to getting it under Comunica umbrella?

I would suggest just putting it inside the Comunica monorepo as a new package under packages/. This would make it easier to tweak things to it in case we notice new requirements when using it from within @comunica/actor-http-cache. So starting your own fork, and creating a PR with it in it might the the easiest path forward.

rubensworks avatar Jun 07 '22 06:06 rubensworks

Sorry to keep coming back to this point. But I think there's a problem with this:

A simple default LRUCache-based Cache implementation should be set here if the user provided none in the context:

I don't think we should be asking users to provide a Cache in context if our library is expecting a Cache that doesn't perform in a spec compliant way.

Instead I think we should have a simple key value interface that gets set in context. Then we'll have comunica take care of the cache semantics. That way you could pass in a redis, or localstore, or any kind of key value interface as a cache without having to worry about re-implementing cache semantics.

jaxoncreed avatar Jun 09 '22 17:06 jaxoncreed

@jaxoncreed Indeed, good point.

I would suggest creating a minimal copy of the Cache interface (with just match, (perhaps matchAll?) put, delete, and keys), and add this to @comunica/types. This way, we stay aligned with the Web cache interface, even though we don't strictly implement it. Does this sound ok to you?

Not sure yet about the new interface name, perhaps just WebCache or FetchCache?

Edit: I think you were also referring to decoupling the cache store from the cache semantics. So maybe we want to have two new interfaces: CacheStore and CacheSemantics

rubensworks avatar Jun 09 '22 17:06 rubensworks

I implemented the first suggestion, but now I think we shouldn't be aligned with Web Cache interface at all. What's the point of aligning with something at an interface level if you're not going to implement it at a spec level? If someone passes in a spec-compliant Cache, comunica will break because the spec-compliant Cache doesn't care about http cache semantics.

By allowing developers to pass in a WebCache, we are communicating to them that it's fine of pass in a spec-compliant WebCache, when it is not fine. It's confusing to developers and causes code like this:

image

I really think the best way forward is to not use the Cache interface at all. The CacheStore should just be a simple key value store.

jaxoncreed avatar Jun 09 '22 18:06 jaxoncreed

The CacheStore should just be a simple key value store.

Ok, go ahead!

Will you also be decoupling the cache store from the cache semantics?

rubensworks avatar Jun 10 '22 05:06 rubensworks

Yep, I think decoupling them makes sense.

jaxoncreed avatar Jun 14 '22 01:06 jaxoncreed

@jaxoncreed I'll unassign you from this bounty. Please let me know if you'd like to come back to it.

rubensworks avatar Apr 19 '24 13:04 rubensworks

Thanks, sorry this was never completed.

jaxoncreed avatar May 03 '24 07:05 jaxoncreed