http-extensions icon indicating copy to clipboard operation
http-extensions copied to clipboard

Alt-Svc and multi-CDN

Open mnot opened this issue 4 years ago • 13 comments

An Alternative Service is bound to a location for a specified period of time; with the header filed, the ma parameter.

That works reasonably well when the alternative service is relatively stable. However, it causes a problem when the origin uses DNS to distribute traffic to many servers -- e.g., when a server is using more than one CDN for its traffic (so-called 'multi-CDN', which has become common for large sites).

In these cases, there's a strong motivation to give the alternative a long-ish lifetime, especially when alt-svc is being used to advertise QUIC. However, multi-CDN typically uses fairly short DNS TTLs, which leads to a client who has been balanced to a new CDN applying alt-svc policy for the old one.

In the best case here, multi-CDN doesn't distribute traffic in the manner that the site wishes. In the worst, requests go to the wrong node and they 421 or even show a user-visible error.

While the HTTPS RR might address some of these issues, it would be good to clean up the Alt-Svc mechanism too. One possible way forward would be to advise that alt-svc information is scoped by the resolution of the origin name -- when it changes, the cache should be dropped.

mnot avatar Sep 15 '21 05:09 mnot

So you would perform a DNS query for the origin name and only proceed if it were the same as it was on the connection on which you learned Alt-Svc? Or the last time you connected to that origin (under any service)?

Is this just an extension of the rule that says if your environment changes, drop the cache?

martinthomson avatar Sep 15 '21 07:09 martinthomson

Yes, this is a problem; I'm not sure this is the path to solve it. The resolution could easily wind up with a different IP later simply due to mapping changes or load balancing even within the same CDN.

MikeBishop avatar Sep 15 '21 13:09 MikeBishop

Is this something that can be addressed by refining the "persist" flag (or really something that is nearly the opposite of it)? On the "persist" end of things, changes in all aspects of topology/configuration are ignored. This is asking clients to be more sensitive to changes.

martinthomson avatar Sep 16 '21 00:09 martinthomson

That's an interesting thought, particularly for the clients who say they can't realistically implement "persist". At the least, this should be mentioned as a way the flag could be implemented if you don't have visibility into topology changes.

MikeBishop avatar Sep 16 '21 15:09 MikeBishop

I am not sure how clients can detect topology change. I am worried that it is hard for a client to detect it. Would something like this be possible:

  • look at the cache and choose a ALtSvc. The altsvc entry have a parameter "valid-for-ip-range"
  • do a DNS lookup for the AltSvc origin and find out its IP address. if the IP address is not in the range drop the altsvc.

I know that IP address ranges are not the best solution. Would CNAME maybe work?

Note: I do not like the idea of doing DNS query before decided to use an AltSvc record, but that does not work according to this comment

ddragana avatar Oct 20 '21 11:10 ddragana

I agree that this is a problem. I think the case can be worse than @mnot describes when it comes to QUIC. Under certain circumstances, depending on client implementations, Alt-Svc caching can lead to clients attempting QUIC handshakes that might get blackholed, which could cause delays until fallbacks kick in. This wouldn't manifest directly as a user visible error but an annoying intermittent delay that could cause some diagnosis headaches. This could be treated as a failure from a performance perspective.

LPardue avatar Nov 22 '21 18:11 LPardue

Pinterest is still interested in a solution to this as it would allow more adoption without the latency regression. Referenced in our blog post.

We have made great progress in pushing CDN vendors to allow modifications to alt-svc response headers but it is pretty awkward given we really only should be allowed to modify ma since CDN vendors want to control the protocol ids (as they should).

Note: we were able to launch by setting ma to 10m but would love to be able to set for as long as a month or year.

sc0ttbeardsley avatar Feb 25 '23 03:02 sc0ttbeardsley

@sc0ttbeardsley just in case you're not tracking the HTTP WG too closely, we've been actively discussing an "Alt-Svc plan B" that might provide a solution, the latest update was given at IETF 115; see https://github.com/httpwg/wg-materials/blob/gh-pages/ietf115/alt-svc.pdf.

LPardue avatar Feb 25 '23 03:02 LPardue

Thanks for the pointer @LPardue ! I will try to get more hooked in to what is happening with plan b. It looks reasonable so far but I have some concerns some of the details like CNAME chains and willingness/ability of endpoint owners to provide the necessary features to domain owners.

sc0ttbeardsley avatar Mar 01 '23 05:03 sc0ttbeardsley

@sc0ttbeardsley, where you say

endpoint owners to provide the necessary features to domain owners.

could you elucidate a bit more? It would help the WGs collective understanding if can share your requirements or needs openly.

LPardue avatar Mar 01 '23 17:03 LPardue

@LPardue sure! wasn't sure if we should have the conversation here or elsewhere...

My assumption and understanding of "alt-svc plan b" is that endpoint operators control the dns records which are consulted by clients... for instance assuming this dns chain:

www.pinterest.com -CNAME-> pinterest.examplecdn.com -A-> [10.1.1.1, 10.2.2.2]

Assuming a config of alt-svc: h3=:443;ma=3600 This DNS record would be consulted:

_443._https.pinterest.examplecdn.com

Endpoint operators (aka CDNs): care about things like the protocol ids and ports. Endpoint operators might also care about maximum possible value for the ma param so that they have a path to retire old protocol ids and introduce new protocol ids.

Domain owners (aka me and other CDN users): care about ma and perhaps the persist values. Domain owners need to roll out new protocols (like H3) and typically want to experiment with the ma duration to prevent clients getting into a bad state for too long. Eventually, once adoption of new protocols is complete, they might want this to be some large value to allow it the best opportunity to be used for return clients. Again, this is all in the interest of the domain owner and the endpoint owner couldn't care less.

Does "alt-svc plan b" assign control of these components of alt-svc to their respective owners? If we make everything just move to a DNS record controlled by the endpoint operators this solves the multi-endpoint problem by allowing each endpoint operator to have distinct values independent of other endpoints which solves the problem of load balancing across multiple vendors. However the problem of control of the ma param is not solved and is left to endpoint operators to figure out how best to expose to the domain owners. Every CDN would need to agree to implement some configuration option to allow me (domain owner) to modify only the ma portion of the DNS answer at _443._https.pinterest.examplecdn.com. We've had trouble getting CDNs to allow us to modify the alt-svc response headers (rightfully so! we can potentially cause havoc through protocol id mismatches) so this separation of interests and outlining exactly how endpoint operators should implement this is important to us.

sc0ttbeardsley avatar Mar 02 '23 01:03 sc0ttbeardsley

Thanks Scott, having some concrete use cases from what you Domain owners really helps. I'm sure vendors are happy to continue talking to customers in more confidential channels where its needed but this seems broad enough not to require it.

To oversimplify "Alt-Svc plan B", (spec: https://www.ietf.org/archive/id/draft-thomson-httpbis-alt-svcb-00.html), roughly what we in mind is to deprecate the Alt-Svc header and push people towards using the DNS as a better source of truth. A new header, WIP name alt-svcb, provides a prompt to the best place in the DNS to start looking.

Part of the outcome of this proposal is to get rid of caching based on ma, and instead let it be handled by DNS aspects like TTL. So for example, say a client were to rely on the SVCB HTTPS RR for discovering that HTTP/3 was available. This aligns well with DNS-based steering strategies.

Alt-Svc persist is an odd parameter in practice, so we have deprecated that. Even without, clients will still be able to use their own local knowledge together with the advertised configuration(s) to determine if a protocol is a good selection or not.

With this design, your question of configurability probably then comes down to DNS configurability.

LPardue avatar Mar 02 '23 02:03 LPardue

Thanks deprecating ma and relying on DNS TTL makes perfect sense. I agree this comes down to DNS configurability of the endpoint-owned records. I worry about CDN vendors providing this configurability consistently but if there is some recommendation and everyone gets on board my concerns mostly go away.

sc0ttbeardsley avatar Mar 02 '23 02:03 sc0ttbeardsley