kubo icon indicating copy to clipboard operation
kubo copied to clipboard

Subdomain support for CIDs longer than 63

Open lidel opened this issue 5 years ago • 26 comments

I hoped to punt this until we need to switch away from sha256 in CIDs, but we may need to solve this problem sooner than expected due to ED25519 keys being new default soon (https://github.com/ipfs/go-ipfs/issues/6916)

Problem: DNS label limit of 63

RFC 1034: "each node has a label, which is zero to 63 octets in length"

The default CIDv1 Base32 with multihash of sha256 and RSA libp2p-key fits:

  • Default CID: http://bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi.ipfs.dweb.link
  • RSA: https://bafzbeih7uocwo6vmbjusf4wzw6h5cruhw3gf4jxhevmxssnz5vkgryk2za.ipns.dweb.link

but if we use ED25519 libp2p-key then we are 2 characters over the limit:

  • ED25519 libp2p-key: https://bafzaajaiaejca4syrpdu6gdx4wsdnokxkprgzxf4wrstuc34gxw5k5jrag2so5gk.ipns.dweb.link
  • CID created with --hash sha2-512 will be even longer: https://bafkrgqe3ohjcjplc6n4f3fwunlj6upltggn7xqujbsvnvyw764srszz4u4rshq6ztos4chl4plgg4ffyyxnayrtdi5oc4xb2332g645433aeg.ipfs.dweb.link

Label longer than 63 characters means the hostname can't resolve:

$ ping bafzaajaiaejca4syrpdu6gdx4wsdnokxkprgzxf4wrstuc34gxw5k5jrag2so5gk.ipns.dweb.link
ping: bafzaajaiaejca4syrpdu6gdx4wsdnokxkprgzxf4wrstuc34gxw5k5jrag2so5gk.ipns.dweb.link: Name or service not known

And links are not picked up by tools like Slack:

oops-2020-05-14--17-59-27

Note: I used ED25519 as an example, but not limited to that single type of CID. Even if we find a way to fit ED25519 in a single label, the problem remains for CIDs with a multihash created with longer hash functions.

Solved: IPNS-specific fix for ED25519 keys

In parallel to the generic fix, we could represent ED25519 keys in a way that fits under 63 characters, solving the UX issue for IPNS websites loaded from public gateways.

Done: https://github.com/ipfs/go-ipfs/pull/7441 – we support {cidv1base36}.ipns.dweb.link which perfectly fits

Open Problem: generic solution for long CIDs

I am happy to open PR with a fix, but unsure if I have the best fix in mind, would love to gather feedback first.

:question: (A) support split CIDs (but have broken TLS)

The first idea I have is to split the label when the max is reached. To maximize entropy for Origin isolation, the remainder should be on the left side:

  • https://ba.fzaajaiaejca4syrpdu6gdx4wsdnokxkprgzxf4wrstuc34gxw5k5jrag2so5gk.ipns.dweb.link

Pros:

  • :+1: each long CID gets own Origin – we keep isolation
  • :+1: path redirect provided by subdomain gateway can take care of splitting
  • :+1: future-proof solution for longer hashes such as sha2-512
    • the next limit is pretty far away: the maximum length of full domain name: 253 characters, including dots
    • sha2-512 on dweb.link is 121 characters

Cons:

  • :anger: decreased entropy in security guarantees provided by origin isolation
  • :anger: wildcard TLS certificate does not pass validation for more than a single level of labels
    • this will produce annoying UX on public gateways such as dweb.link: TLS warning when opening IPFS website on IPNS. we get the same problem as ENS gateway at *.eth.link (https://blog.almonit.eth.link vs https://almonit.eth.link
  • :anger: copying & pasting CID as-is no longer works on public gateways (user needs to put . in the middle etc)
    • Note: to make it easier UX-wise, we should allow . anywhere inside of CID, but internally merge labels, and return a redirect to canonical version that splits at deterministic position (enforcing maximum label for Origin).

:question: (B) redirect long CIDs to an "insecure" subdomain

This would make it possible for content to load, but longer CIDs would not get Origin isolation per CID.

To make this bit more clear and idiomatic, we could present this as "cross origin resource sharing" endpoint that allows both CORS requests + supports loading everything from a single origin + has paths locked down in browsers like noted in https://github.com/ipfs/in-web-browsers/issues/157.

Think in terms of

  • https://dweb.link/ipfs/superlongcid redirecting to https://cors.dweb.link/superlongcid

Pros:

  • :+1: does not break TLS wildcard certs (easy setup for gateway operators)
  • :+1: useful outside this problem: provides idiomatic way for exposing path gateway on subdomain gateways (for use when origin isolation is not needed)

Cons:

  • :anger: long CIDs don't get Origin isolation

:question: (C) swap DAG root with CID that uses shorter hash function

Pros:

  • :+1: "just works"

Cons:

  • :anger: decreased entropy
  • :anger: newly created root blocks need to be persisted somehow: if I bookmark the page loaded via shortened CID and then the root block gets garbage-collected, the address is dead.
    • potential fix: we could always create redundant sha256 root block for every DAG that uses longer hash function for interop

:question: (D) leverage HTTP proxy mode (on localhost)

When Gateway port is used as HTTP proxy, local client does not perform DNS lookup, but original URL is sent in HTTP request to the proxy for processing.

Because HTTP proxy IS go-ipfs node in that scenario, it does not do DNS lookup, but extract original (long) CID and resolves it, without involvement of DNS.

As long user agents are not overzealous in validating URLs, this would allow for long (>63) CIDs on subdomains.

This is important, because it enables localhost gateway (used by Brave) to resolve long CIDs correctly without any additional hacks.

UX details tbd. This could be the solution for localhost gateway, but for public ones we still need something else.

Other ideas?

Would love to find a better way to work around this

cc @aschmahmann @Stebalien https://github.com/ipfs/in-web-browsers/issues/89

lidel avatar May 14 '20 16:05 lidel

It's a bit unfortunate that keys are so overly verbose: https://cid.ipfs.io/#bafzaajaiaejca4syrpdu6gdx4wsdnokxkprgzxf4wrstuc34gxw5k5jrag2so5gk

It looks like we have an actual protobuf construct inside the raw bytes. Is this... something we need to do?

If we shave off 2 bytes, nothing extra needs to be done...

ribasushi avatar May 14 '20 16:05 ribasushi

I am afraid even if we find a hacky workaround for libp2p-keys in ED25519, the problem remains for CID that use longer hash functions than sha256.

lidel avatar May 14 '20 17:05 lidel

For context, we're trying to encode 40 bytes into 62 characters (with one character for the multibase prefix).

I believe base36 would work, if that's an option. That should give us exactly 63 characters.

We could change how we encode these peer IDs in text and use an ed25519 specific codec (<cidv1>-<ed25519>-<multihash>). That would still be a reasonable encoding of an ed25519 CID but I'd prefer to avoid it.

Stebalien avatar May 14 '20 17:05 Stebalien

But I agree we should support longer keys regardless. But will this be a problem for TLS certs? Can we get a double-star cert?

Stebalien avatar May 14 '20 17:05 Stebalien

  • I am not aware of any CA that provides double wildcard certs. That is why ENS gateway still has the TLS warning (example: https://blog.almonit.eth.link).

  • Switching the default text representation of PeerID to Base36 would introduce work across ecosystem to bubble up support (missing from multibase.csv atm) and its not as popular as RFC version of Base32. Not sure what's lesser evil, that, or a new codec.

lidel avatar May 14 '20 17:05 lidel

@aschmahmann and I discussed this and it is possible to shrink ed25519 pids, but it's painful and requires coordination with all libp2p implementations.

To shrink ed25519 keys, we need to:

  1. Encode them as <cidv1>-<ed25519>-<multihash> in text. This will reduce the id size to 36 bytes (from 40).
  2. Ideally, migrate to CIDs on the wire in libp2p. That would save us 10% on the wire for ed25519 keys and make it easier to interoperate with other p2p networks (because we could use their native key formats instead of wrapping them in protobufs before hashing).

Unfortunately, if we want to get 1 in the near future, we'd make it significantly harder to get 2. Basically, if we start using the new ed25519 pid encoding now, we'd have to convert back to the normal pid binary format (raw multihash) when decoding. However, if/when we decide to use CIDs as the binary pid format, we'd have trouble round-tripping.

That is, in the ideal world, if we encounter a text-based PID as a CID:

  • If it uses the libp2p-key multicodec, it's a legacy peer ID. Encode it as a multihash on the wire.
  • If it uses any other multicodec, it's a new peer ID. Encode it as a CID on the wire.

However, if we implement 1 before 2, we'd have to encode legacy keys in this new CID format. When converting back, we'd end up with the wrong "on the wire" format.

Stebalien avatar May 14 '20 19:05 Stebalien

* I am not aware of any CA that provides double wildcard certs.

This seems to not be possible: https://serverfault.com/a/946120

MichaelMure avatar May 20 '20 09:05 MichaelMure

^ might have been closed a bit eagerly by github.

So am I correct to assume that multi-subdomain is not considered anymore ? That'd be nice as it would be a pain to host with TLS due to the certificate limitation.

MichaelMure avatar May 22 '20 10:05 MichaelMure

@MichaelMure yeah, github is too eager indeed. Yes, this is precisely why we went with b36 - to keep TLS possible for the time being.

ribasushi avatar May 22 '20 10:05 ribasushi

We've met yesterday and came up with next steps to always resolve CIDs over DNS and have no TLS errors when current defaults/ED25519 keys are used: (1) solve TLS problem for IPNS with ED25519 keys (2) make it possible to load longer CIDs

Notes at: https://github.com/ipfs/team-mgmt/pull/1159 – early feedback / questions appreciated!

lidel avatar May 22 '20 12:05 lidel

Could you explain what (2) is in more details ? This document mainly discuss IPNS.

MichaelMure avatar May 22 '20 12:05 MichaelMure

@MichaelMure see https://github.com/ipfs/team-mgmt/pull/1159#discussion_r429208550 Note: it won't be needed for defaults, but will make it possible to load custom CIDs if someone has to use longer hashes for some reason.

lidel avatar May 22 '20 12:05 lidel

Alright. Due to the TLS problem, Infura in unlikely to support that but I suppose that sort of OK as it should be a very rare usecase.

MichaelMure avatar May 22 '20 12:05 MichaelMure

Well, the hope is that use of companion and/or native IPFS support is wide-spread before that ever becomes an issue...

Stebalien avatar May 22 '20 16:05 Stebalien

I started working on the splitting logic in subdomains, expect PR soon.

Update: https://github.com/ipfs/go-ipfs/pull/7358

lidel avatar May 25 '20 11:05 lidel

I found an interesting Proposed Standard https://tools.ietf.org/html/rfc4343#section-2.2 that suggests that there may be 230=256-26 different usable byte values in DNS hostnames. But I guess in practice, many servers and clients will not support these as part of FQDNs.

bmwiedemann avatar May 28 '20 17:05 bmwiedemann

Leveraging RFC4343 is a no-go – no browser support afaik..

FYSA I've talked with @Stebalien last week, and we are re-evaluating.

None of us is happy with ramifications of splitting into multiple DNS labels, originally proposed in #7358. It will cause us troubles with TLS in the future, and the ultimate goal of subdomain gateways is seamless UX in web browsers.

Decided to look into alternative approach that prioritizes UX in user agents and removes the problem of TLS errors caused by more than one level of wildcards: #7441

lidel avatar Jun 08 '20 14:06 lidel

@lidel can we close this?

Stebalien avatar Apr 05 '21 15:04 Stebalien

No, we need to solve this in a way that enables people to load all CIDs, no matter what gateway type is used.

Right now, subdomains are limited to subset of CIDs: https://dweb.link/ipfs/bafkriqdv2ut4g2hs57uer3hwwbz2gz3hqaeal2po6kyyk7k7tbhqg3vw36er25pxfwnrkriyyhgvra2sq3i5vgry325d32mlljj6l3lyvbexm → CID incompatible with DNS label length limit of 63

Hot take: our options are limited here, could be that that longer CIDs end up on a separate subdomain with the same sandboxing / local storage / api limitations as ones proposed for path gateway (https://github.com/ipfs/in-web-browsers/issues/157). Those would not work as website roots, but would be fine for loading other types of content.

lidel avatar Jun 07 '21 20:06 lidel

Just wanted to add to this discussion with an idea, what if you used queries to hold the ID of the CID, e.g.

bafkreievmw4c7yvuhvxt4qjcgqz4nsejxrw4wy4xkhtq54dc62ptceu6xq becomes:

vmw4c7yvuhvxt4qjcgqz4nsejxrw4wy4xkhtq54dc62ptceu6xq.ipfs.dweb.link/?id="bafkreie" (or maybe keeping the multihash ID in the subdomain is better)

Only CIDs for the same content can share the multihash subdomain, so subdomain isolation should be maintained. (unless I'm missing something major, in which case correct me)

(Also, I think topic/ed25519 can be removed)

Winterhuman avatar Apr 16 '22 20:04 Winterhuman

  • we did solve ED25519 in this issue (see first comment) – keeping the label for discoverability
  • query parameter does not provide Origin isolation, and we already have path gateways for cases where isolation is not necessary, so it adds no value
  • keeping only the multihash part in the DNS label does not help – multihash with sha512 digest won't fit in a single DNS label
    • example: https://cid.ipfs.io/#bafkrgqe3ohjcjplc6n4f3fwunlj6upltggn7xqujbsvnvyw764srszz4u4rshq6ztos4chl4plgg4ffyyxnayrtdi5oc4xb2332g645433aeg

lidel avatar Apr 19 '22 12:04 lidel

As another option, using CIDv2 (https://github.com/ipfs/specs/pull/305) may allow for "case-insensitive" CIDs which are actually case-sensitive when parsed.

The difference between foo and FOO can be expressed as 000 and 111, where 0 is lowercase and 1 is uppercase, so if CIDs had metadata to describe their casing, then you could do case-insensitive versions of case-sensitive encoded CIDs. e.g.

CIDv1 doesn't fit, but is case-insensitive: id...long-cid CIDv1 fits, but is case-sensitive: ID...LoNg-CiD CIDv2 fits, and is case-insensitive: id+metadata...long-cid (or wherever the metadata for CIDv2 will be placed)

The advantage is that the CID metadata changes the CID slightly, so each CID will still have Origin Isolation. But, if the metadata itself gets too long, then extremely long CID strings will still be too big, however, encoding the case-binary efficiently to take the minimal space should make the limit pretty high in theory.

Winterhuman avatar Oct 21 '22 19:10 Winterhuman

@Winterhuman how you can fit sha512 in proposed CIDv2 and have no more than 63 characters? Are you suggesting using a different (weaker) hash like sha256 to point at the stronger one sha512? If so, I am afraid that is not a fix, just a workaround – you are decreasing security of use cases that need longer hashes.

lidel avatar Oct 21 '22 21:10 lidel

No, that's already described in option C. As in encode a SHA512 CID using a case-sensitive encoding, like base58btc. Then, you store the casing of the characters as metadata, e.g.

zYAjKoNbau5KiqmHPmSxYCvn66dA1vLmwbt

Could be z+metadata+yajkonbau5kiqmhpmsxycvn66da1vlmwbt, where the metadata bytes describe the casing to apply to the all-lowercase multicodec + multihash characters to make it the original case-sensitive encoding, and since the metadata changes the CID slightly each casing would be a unique CID. One complication is that you'd need the metadata to be encoded as case-insensitive inside the case-sensitive CID in order for it to be read

Winterhuman avatar Oct 23 '22 14:10 Winterhuman

Either that or you could nest a case-sensitive CIDv1 inside a multibase-esque multiformat so it's constructed like:

<multicasing code><multicasing bytes (variable)><multibase><multicodecs>...<multihash digest>

That'd get around having to encode the casing metadata inside the case-sensitive encoding itself, but, requires making a new multiformat or modifying multibase significantly

Winterhuman avatar Oct 23 '22 14:10 Winterhuman

Couldn't you go with the splitting option, but instead of putting the remainder in a subdomain, you put it in the path? Instead of:

https://ba.fzaajaiaejca4syrpdu6gdx4wsdnokxkprgzxf4wrstuc34gxw5k5jrag2so5gk.ipfs.dweb.link/

Do:

https://fzaajaiaejca4syrpdu6gdx4wsdnokxkprgzxf4wrstuc34gxw5k5jrag2so5gk.ipfs.dweb.link/remainder/ba/

This is an annoying UX, but it preserves as much subdomain isolation as is possible with 63 characters and doesn't result in TLS wildcard problems.

MicahZoltu avatar Aug 16 '24 13:08 MicahZoltu