undici icon indicating copy to clipboard operation
undici copied to clipboard

Implement HTTP caching

Open jeswr opened this issue 1 year ago • 11 comments

This would solve...

Performance and robustness issues associated with fetching the same HTTP resource multiple times.

The implementation should look like...

Use an interceptor(?) to implement http caching using the Cache-Control header. I'm inclined to think that the Expires header should not be implemented as http 1.0 has extremely limited use.

Additional context

Originally raised in https://github.com/nodejs/undici/issues/3221.

jeswr avatar May 09 '24 12:05 jeswr

I've looked into implementing this as an interceptor, and need some information in order to proceed:

  • Is there existing code for caching and cloning responses in the code base that can be used (for this I was expecting to be able to use logic from the web cache which seems to be implementing the Service Worker Spec)
  • Is rfc7234 implemented anywhere in the codebase (I cannot find any indication that it is). http-cache-semantics seems to be a good reference implementation of it if not, however it is under the BSD-2-Clause so @kornelski would need to give permission in order for code to be re-used in this library without copying the BSD licence over.

jeswr avatar May 09 '24 14:05 jeswr

Is there existing code for caching and cloning responses in the code base that can be used (for this I was expecting to be able to use logic from the web cache which seems to be implementing the Service Worker Spec)

Can you elaborate on what you have in mind?

Is rfc7234 implemented anywhere in the codebase (I cannot find any indication that it is). http-cache-semantics seems to be a good reference implementation of it if not, however, it is under the BSD-2-Clause so @kornelski would need to permit for code to be re-used in this library without copying the BSD licence over.

The closest implementation will be the fetch one but I don't believe it implements its full RFC-7234 spec, so this will be a new implementation for the interceptor.

metcoder95 avatar May 09 '24 20:05 metcoder95

I have somehow interest to implement it...

Uzlopak avatar May 09 '24 20:05 Uzlopak

Can you elaborate on what you have in mind?

I was trying to work out the easiest way of caching the response object in-memory would be given that I'm assuming it would be coming through with the body as a readable-stream that needs to be cloned - I was assuming that the Web Cache would already need to implement such behavior. I'll step out of discussions of implementation details now if someone else is willing to take over.

I have somehow interest to implement it...

@Uzlopak I'm very time poor so I'm more than happy to step back on my side! Note you'll probably want to implement rfc9111 which appears to obsolete rfc7234.

jeswr avatar May 09 '24 21:05 jeswr

I am also not very time "rich", but I already investigated some hours in reading into rfc 9111.

The only thing which was not clear to me, is how we ensure that we dont open a security issue :D

What I mean is:

In a browser context you have one user and fetch calls can be cached, as the cache is user scoped. In node we dont have users. So a http cache would be globally activated (?) in nodejs.

So we need to have something like setGlobalDispatcher but for caching? setGlobalHttpCache?

Uzlopak avatar May 09 '24 22:05 Uzlopak

I would start having an interceptor or Agent that implement the caching protocol of HTTP, and then worry about how we enable it all the time.

I think it's possible to have it always enabled and not cache any private data: that's how cdn works after all. Before worrying about this, we need the implementation ;).

mcollina avatar May 10 '24 07:05 mcollina

Ok, i will start tonight. ;)

Uzlopak avatar May 10 '24 12:05 Uzlopak

I've implemented http-cache-semantics by reading the RFC start to finish, and translating each paragraph into code.

I highly recommend that approach, because it results in good comments with references to their RFC sections (I wish I kept more of them in my code), and helps ensure that unobvious behaviors are not missed.

kornelski avatar May 10 '24 15:05 kornelski

Usually when I implement an RFC or any other spec I print it out. Then like you say, use the text for commenting the code. When I implemented a paragraph I mark it in the print out that it was implemented, too. Yellow marker for implemented, pink marker for has an issue and should be checked again. When the whole text is colored, then I know the spec is implemented completely ;).

Uzlopak avatar May 10 '24 20:05 Uzlopak

Any updates on this?

jeswr avatar Jun 26 '24 14:06 jeswr

@Uzlopak?

metcoder95 avatar Jun 30 '24 09:06 metcoder95

ping

metcoder95 avatar Aug 16 '24 09:08 metcoder95

Hey all, I can start taking a look at implementing this this week. Need to get up to speed on rfc9111 first, but ideally would be able to get a conversation started on some implementation details by end of next week

Also, related: #2760 #2256 #1146

cc @mcollina

flakey5 avatar Aug 18 '24 22:08 flakey5

We have already been working a bit on this. https://github.com/nxtedition/nxt-undici/pull/3

Maybe should sync our efforts?

ronag avatar Aug 19 '24 18:08 ronag

Maybe should sync our efforts?

That sgtm, how much progress have y'all made on it?

flakey5 avatar Aug 19 '24 21:08 flakey5

Basing off of some ideas discussed here as well as ones in nxtedition/nxt-undici, what do y'all think of this api?

interface CacheInterceptorGlobalOptions {
  // Function for creating the cache key
  makeKey?: (opts: Dispatcher.RequestOptions) => string;
  store?: CacheStore;
}

// Interface responsible for storing the cached responses.
interface CacheStore {
  get(key: string) => Promise<string>;
  put(opts: CachePutOptions) => Promise<void>;
}

type CachePutOptions = {
  key: string;
  value: string;
  size: number;
  vary?: string;
  expires?: number;
  // Subject to change depending on implementation specifics, but the idea is to
  //  provide the cache store with all the info we're given in the cache control
  //  directives
};

Example usage

const { Client } = require('undici')
const cacheInterceptor = require('../lib/interceptor/cache')

const client = new Client('https://google.com')
  .compose(cacheInterceptor({
    makeKey: (opts) => `${opts.origin}:${opts.method}:${opts.path}`,
    store: {
      get: async (key) => {/* ... */},
      set: async (opts) => {/* ... */}
    }
  }))

client.request({ path: '/', method: 'GET' }).then(/* ... */);

Defaults

makeKey and store are optional since I think we should have builtin defaults.

The spec recommends that the cache key is at least the request's uri and method in Section 2.

For the default cache store, I think could use node:sqlite like in nxtedition/nxt-undici#4 with a schema like

CREATE TABLE IF NOT EXISTS cacheInterceptor(
  key TEXT PRIMARY KEY NOT NULL,
  value TEXT NOT NULL,
  vary TEXT,
  size INTEGER,
  expires INTEGER
  -- Subject to change depending on implementation specifics
) STRICT;

CREATE INDEX IF NOT EXISTS idxCacheInterceptorExpires ON cacheInterceptor(expires);

For purging old responses, we can go with the same approach in nxtedition/nxt-undici#4 and purge them in the set call (see here).

flakey5 avatar Aug 24 '24 18:08 flakey5

Why can't this be simplified for something like:

interface CacheInterceptorGlobalOptions {
  // Function for creating the cache key
  store?: CacheStore;
}

// Interface responsible for storing the cached responses.
interface CacheStore {
  get(key: Dispatcher.RequestOptions) => Promise<??>;
  put(key: Dispatcher.RequestOptions, opts: CachePutOptions) => Promise<void>;
}

type CachePutOptions = {
  value: string;
  size: number;
  vary?: string;
  expires?: number;
  // Subject to change depending on implementation specifics, but the idea is to
  //  provide the cache store with all the info we're given in the cache control
  //  directives
};

I'm unsure we could always streamline the key to a string without overhead

mcollina avatar Aug 26 '24 05:08 mcollina

I think key can always be {method}:{path}? Unless interacting with a buggy server that doesn't set proper vary headers.

ronag avatar Aug 26 '24 05:08 ronag

I think key can always be {method}:{path}? Unless interacting with a buggy server that doesn't set proper vary headers.

It probably will be, but imo it'd still be good to allow it to be overrided if there's some use case that calls for it. It's definitely not needed right now though and can always be added later

flakey5 avatar Aug 26 '24 06:08 flakey5

Depending on the scope of the Cache, we might need to also add origin to the mix (as recommended by spec as well).

metcoder95 avatar Aug 27 '24 06:08 metcoder95

Depending on the scope of the Cache, we might need to also add origin to the mix (as recommended by spec as well).

I thought in this case the server should respond with a vary header containing host. Adding origin to the key will cause problems (at least for me). Where in the spec does it say this?

ronag avatar Aug 27 '24 07:08 ronag

I still didn't get an answer why we need to have a cache key as a string. I think we should keep it maximally flexible and pass in the request options.

mcollina avatar Aug 27 '24 08:08 mcollina

I think we should keep it maximally flexible and pass in the request options.

That won't work though? You can have different request options that have the same result. That's why the spec says you should set the key to method + path and then use the vary header to determine what in the request headers is relevant.

ronag avatar Aug 27 '24 09:08 ronag

RFC 9111 Section 2 (https://www.rfc-editor.org/rfc/rfc9111.html#section-2) says

The "cache key" is the information a cache uses to choose a response and is composed from, at a minimum, the request method and target URI used to retrieve the stored response

I can see no mention of ‘method+path’; I would infer the URI must be absolute, as different hosts could provide unrelated responses with the same path. Am I missing something?

I thought the ‘vary’ field cannot have a value which includes the host? According to https://httpwg.org/specs/rfc9110.html#field.vary,

The "Vary" header field in a response describes what parts of a request message, aside from the method and target URI, might have influenced the origin server's process for selecting the content of this response.

chrisveness avatar Aug 27 '24 09:08 chrisveness

method+path === "the request method and target URI"

?!

Uzlopak avatar Aug 27 '24 09:08 Uzlopak

https://www.ibm.com/docs/en/cics-ts/6.x?topic=concepts-components-url path = the specific resource in the host

chrisveness avatar Aug 27 '24 10:08 chrisveness

as different hosts could provide unrelated responses with the same path. Am I missing something?

If different host can provide unrelated responses then they should reply with Vary: host.

I thought the ‘vary’ field cannot have a value which includes the host? According to https://httpwg.org/specs/rfc9110.html#field.vary,

That doesn't say anything about not including host header.

ronag avatar Aug 27 '24 10:08 ronag

Cache is always split per URL, and Vary only adds additional granularity within the single URL.

(Cache is technically also split per method, but PUT/POST invalidates cached GET, so in practice you can just cache GET)

Varying by Host is a basic guarantee since RFC2616, and never needs any additional reinforcement. I'm not sure if it's forbidden in Vary, but it's certainly useless there.

kornelski avatar Aug 27 '24 10:08 kornelski

Cache is always split per URL, and Vary only adds additional granularity within the single URL.

(Cache is technically also split per method, but PUT/POST invalidates cached GET, so in practice you can just cache GET)

Varying by Host is a basic guarantee since RFC2616, and never needs any additional reinforcement. I'm not sure if it's forbidden in Vary, but it's certainly useless there.

Any chance you could refer to the relevant parts in the spec?

ronag avatar Aug 27 '24 11:08 ronag

I still didn't get an answer why we need to have a cache key as a string. I think we should keep it maximally flexible and pass in the request options.

+1

That won't work though? You can have different request options that have the same result. That's why the spec says you should set the key to method + path and then use the vary header to determine what in the request headers is relevant.

It allows for each cache store to have its own makeKey (or whatever works best for it). The store gets the full context for a request, can make its own cache key based off of it using what it cares about, and then use it for lookups/puts

flakey5 avatar Aug 27 '24 19:08 flakey5