multiaddr icon indicating copy to clipboard operation
multiaddr copied to clipboard

Handle unparsable multiaddrs

Open Stebalien opened this issue 5 years ago • 12 comments

Due to the fact that binary multiaddrs don't include protocol definitions, we can't parse them unless we know all the relevant protocols.

However, we can parse a prefix. Therefore, I'd like to propose a special, string-only "unknown" protocol that takes a single multibase encoded argument. That is: `/ip4/1.2.3.4/tcp/123/unknown/bxyz". This would only exist in the string format and would allow us to keep and use multiaddrs we don't fully understand.

This should fix some of the problems described in #6.

Stebalien avatar Jul 18 '18 15:07 Stebalien

A few things I'd like to explore:

  1. How can we retain the original string? In terms of printing these addrs out, it's more useful to be able to print the representation that we recieved, not a /unknown/whatever even if it cannot be used. The use case is an old js-ipfs-api requesting swarm.addrs to be printed out on a webpage. It might not understand how to parse QUIC addrs but it's preferable to print the QUIC addr than an "unknown" addr
  2. How can we distinguish between a multiaddr that is valid or not? If we can pass it any string protocol - how do we determine if it's valid (at least in our version of the codec table)? What would valid mean now? What do we do when passing a string/buffer to the multiaddr constructor?
  3. If we create a "string only" multiaddr - what happens when we request the buffer version - throw error?

alanshaw avatar Nov 13 '18 11:11 alanshaw

@jacobheun it might be useful to have your opinion on this?

alanshaw avatar Nov 13 '18 11:11 alanshaw

The requirement that a multiaddr can be represented as a buffer is tricky. Every user of multiaddr has a stale copy of the buffer-to-protocol mapping. It's trivial to syntacitically validate a multiaddr string, but hard to semantically validate that protocols are valid if the definition of valid is their existance in the mapping table.

We will keep running into situations where peers and apps and api clients are running with different versions of multiaddr, with different mapping tables, and older ones unable to parse newer ones.

Having a string only format for multiaddr that doesn't guarantee it can be converted to a buffer would at least tick the points 1 to 4 on the problem statement https://github.com/multiformats/multiaddr#introduction

olizilla avatar Nov 13 '18 12:11 olizilla

I think we're talking past each other a bit here. My primary concern is relay dialing (which should have been in the issue description...). That is, I need to be able to use a multiaddr prefix (encoded) without the rest.

For example: /ip4/1.2.3.4/tcp/1234/p2p/QmRelay/p2p-circuit/ip4/tcp/1234/special_protocol/p2p/QmTarget.

QmRelay and QmTarget may understand special_protocol while I may not.

Basically, I'm worried about creating unnecessary network partitions because we're being too strict in parsing our inputs.

We still need to solve the UX issues.

How can we retain the original string?

We usually don't and can't know the original string (e.g., binary peer address records from the DHT).

How can we distinguish between a multiaddr that is valid or not? If we can pass it any string protocol - how do we determine if it's valid (at least in our version of the codec table)? What would valid mean now? What do we do when passing a string/buffer to the multiaddr constructor?

This proposal tries to cover the use-case where I don't really care about the rest of the multiaddr. Ideally, I'd be able to validate it, but even if I could, I'm just going to pass it off to someone else anyways.

If we create a "string only" multiaddr - what happens when we request the buffer version - throw error?

By string-only, I meant that /unknown/... would only appear in the string representation. When converting back to the binary representation, we'd parse $x/unknown/$y back into ParseMultiaddr($x) + CastMultiaddrBytes(DecodeMultibase($y)). That is, everything after unknown would just be the part of the multiaddr we failed to parse (binary), multibase-encoded.

Now, there is the case where a user gives me /ip4/.../...../someproto and I don't know what someproto is. This proposal doesn't handle that case but I guess we could:

  1. Allocate a special "string" path multicodec.
  2. Use that multicodec and encode the rest of the string as the argument.

That is, if we ran across $x/someproto/$y, we'd encode this to BinaryMultiaddr($x) + StringMultiaddrCodec + Length + $y. When parsing such a multiaddr, implementations would do their best to lower it to a the correct binary format.

However, really, "upgrade your client" may also be a valid solution here.

Stebalien avatar Nov 14 '18 01:11 Stebalien

Thanks for your thoughts!

That is, if we ran across $x/someproto/$y, we'd encode this to BinaryMultiaddr($x) + StringMultiaddrCodec + Length + $y. When parsing such a multiaddr, implementations would do their best to lower it to a the correct binary format.

Do we need a StringMultiaddrCodec - could we just use the unknown codec?

That aside, I'm interested to see if/how this would work in code - I'll see if I can work up a PR to js-multiaddr and report back :D

alanshaw avatar Nov 19 '18 16:11 alanshaw

Do we need a StringMultiaddrCodec - could we just use the unknown codec?

So, there wouldn't really be an unknown "codec". Basically, we're trying to handle two inverse cases:

  1. We have a string and need to turn it into binary. In this case, we need a protocol that only exists in binary form. When converting back to a string, the protocol would disappear.
  2. We have a binary address and need to turn it into a string. In this case, we need a protocol that only exists in string form. When encoding to the binary representation, it would disappear as we already know the binary protocol code.

Basically, we need a StringInBinary protocol (StringMultiaddrCodec) and a BinaryInString protocol (the "unknown" protocol).

That aside, I'm interested to see if/how this would work in code - I'll see if I can work up a PR to js-multiaddr and report back :D

The second half (binary in strings) has a PR here: https://github.com/multiformats/go-multiaddr/pull/74.

However, I'm still worried about multiple binary representations of multiaddrs. That's why I'm less sure we want StringInBinary.

Stebalien avatar Nov 19 '18 21:11 Stebalien

Could the problem of unknown/new address protocols be solved by the following steps?

  1. require that all new protocols are length-prefixed
  2. require that all implementations support the existing non-length-prefixed and non-value protocols

That'd seem like a simple way forward to me.

ghost avatar Mar 27 '19 15:03 ghost

@lgierth: How do we know whether a protocol even requires an argument at all? /utp, /quic and about 10 others do not, for instance. Also why not simply require all protocols to be length-prefixed then? While it would increase many binary representations by 1 byte per protocol, it would make parsing byte representations much simpler in general and solve the “unknown protocol” issue almost as a side-effect. (Although it would not help with the problem parsing unknown string representations, of course.)

ntninja avatar Mar 27 '19 21:03 ntninja

How do we know whether a protocol even requires an argument at all? /utp, /quic and about 10 others do not, for instance.

Yeah, actually, point 2 above should include most or all existing protocols, not just the length-prefixed ones.

Also why not simply require all protocols to be length-prefixed then?

Backward compatibility. It'd be a huuuuge hassle to fundamentally change existing multiaddr protocols, while it's easy to freeze them as-is and introduce new rules for new protocols.

We must avoid breaking changes at all cost.

ghost avatar Mar 27 '19 22:03 ghost

We must avoid breaking changes at all cost.

Note: We can also transition. That is, we can say introduce a new "multiaddr-2" protocol and parse everything after that using the new system. Eventually, we can remove support for multiaddr-1 (the network will forget old multiaddrs pretty quickly). This would give us a chance to revisit this from scratch.

Stebalien avatar Mar 28 '19 02:03 Stebalien

Has there been any more progress made on this?

This would make multiaddr much more useful for projects that need to add complexity to what is already defined by multiaddr without requiring per-application changes to the table.

saul-jb avatar Mar 13 '22 21:03 saul-jb

Unfortunately, no. If you're interested, I'd start by forking go-multiaddr and adding an implementation there. I say "fork" because adding support for this will require a non-trivial refactor of everything depending on multiaddrs, so I expect we'll end up with two "flavors" of multiaddrs: partially validated ones and fully validated ones".

Stebalien avatar Mar 20 '22 20:03 Stebalien