nim-libp2p feat: Add callback for pessimistic transport start

Fixes: #667.

Add switch.listenError callback that is called when a transport fails to start. This is a pessimistic approach meaning that it will fail on the first address that fails to start in the transport, which has been discussed in #662.

This remains in draft to kick the discussion off if we want to be optimistic inside of the transport.start (a la #661), which can be added in a new commit.

Dec 07 '21 12:12 emizzle

I had another idea, let me know what you think:

We build a seq[(MultiAddress, CatchableError)] (or equivalent), and put all failures in it. Then, at the the of the start, we call the callback with the seq. The default one would be if seq.len > 0: throw

This allows for more sophisticated logic in the handle, and we just have to call/handle it once.

wdyt?

Dec 07 '21 15:12 Menduist

I had another idea, let me know what you think:

We build a seq[(MultiAddress, CatchableError)] (or equivalent), and put all failures in it. Then, at the the of the start, we call the callback with the seq. The default one would be if seq.len > 0: throw

This allows for more sophisticated logic in the handle, and we just have to call/handle it once.

wdyt?

Seems less flexible than the callback, also see my comment which is related - https://github.com/status-im/nim-libp2p/pull/668#discussion_r764147551

Dec 07 '21 16:12 dryajov

Maybe I wasn't clear, but the idea would be to keep the callback, just call it once at the end. Pseudocode:

proc start*(s: Switch) =
  let failures: seq[(MultiAddress, CatchableError)]
  for t in s.transport:
    failures &= await t.start(s.addrs.filterIt(t.handles(it))

  if failures.len > 0:
    let toThrow = callback(failures)
    if toThrow: raise toThrow

Pros:

Allows more sophisticated logic in the callback (for instance, if both ipv4 & ipv6 failed, throw, but swallow error if one of them succeeded. This can be done with our current callback, but not elegantly) Cons:
If the first address failed and we want to stop on that one, we'll still start everything before shutting down, since the callback is only called at the end

Dec 09 '21 11:12 Menduist

Maybe I wasn't clear, but the idea would be to keep the callback, just call it once at the end. Pseudocode:
proc start*(s: Switch) =
  let failures: seq[(MultiAddress, CatchableError)]
  for t in s.transport:
    failures &= await t.start(s.addrs.filterIt(t.handles(it))

  if failures.len > 0:
    let toThrow = callback(failures)
    if toThrow: raise toThrow
Pros:

Allows more sophisticated logic in the callback (for instance, if both ipv4 & ipv6 failed, throw, but swallow error if one of them succeeded. This can be done with our current callback, but not elegantly) Cons:

If the first address failed and we want to stop on that one, we'll still start everything before shutting down, since the callback is only called at the end

This would require the transport to start all of its addresses optimistically in order to return a seq of exceptions, right? Didn't we want it to be a) pessimistic by default and b) customisable from the caller?

Dec 10 '21 00:12 emizzle

If the first address failed and we want to stop on that one, we'll still start everything before shutting down, since the callback is only called at the end

Yeah, to me this defeats the purpose of the callback.

Dec 10 '21 03:12 dryajov

This PR now has:

support for switch.listenError callback
support for transport.listenError callback, unique per transport
tests that have fully and partially optimistic/pessimistic cases
tests for changes made to tcp and ws transports to support listenError
transport addrs sequence only contains handleable addresses once start is called, which has been handled in the Transport base class' start method.
tests for changes made to tcp and ws transports to support filtering of unhandleable addresses

There is one particular use case for pessimism/optimism that should be pointed out as it may or may not be what is expected. When a switch is pessimistic and all its transports are optimistic, then one might assume that after the first failure occurs in a transport, that the subsequent transport would not be started (the switch is supposed to be pessimistic). This, however, does not occur as the optimism in the transport prevents the switch.listenError from being called at all -- the optimistic transport.listenError does not return an error, and therefore an error is not re-raised inside of switch.start, which means switch.listenError will not be called. Is this a use that we want to handle or is this expected behaviour?

Dec 13 '21 08:12 emizzle

switch.listenError has been removed, along with filtering of addresses in transport's base handles.

Dec 15 '21 01:12 emizzle

@emizzle we don't need to mock connections, instead we use the BufferStream abstraction for that. Please take a look at test/helpers.nim where a wrapper TestBufferStream is defined which allows you to set a writer handler. In order to "write" data to the Connection backed by the BufferStream use await conn.pushData(@[...]). BufferStream inherits Connection, so it can be already used interchangeably everywhere you need a stream, for more examples take a look at testmplex.nim or I can provide some if you need.

Dec 15 '21 14:12 dryajov

I'm still a bit conflicted about how we're approaching this from an API standpoint.

In our last call @Menduist brought up a good point about passing the callback only to the transports, which doesn't allow us to handle switch specific issues, the main example used was addresses that don't have a transport to handle them. This in turn raises an additional question of what is really a fatal vs a non fatal error and at which point should we detect it? So I have this two questions right now:

Should a missing address (or similar) be considered a fatal error?
What is the appropriate stage to detect and report this kind of errors?

I don't have a good answer for the first question, but on the second I'm of the opinion that errors should be detected and reported/handled as soon as possible. In the case of missing transports, we can detect it a lot earlier than calling start, either before instantiating the Switch or during instantiation, with an assert or similar. This goes for must everything that can be detected early and we shouldn't allow instantiation with incorrect/unwanted states in general.

The second point that @Menduist brought up was in regards to https://github.com/status-im/nim-libp2p/pull/668#discussion_r765708049, which allows passing the callback to start instead the constructor. I don't have a strong opinion here, except that passing the callback during instantiation allows us to clearly separate the two stages, for example, a user might instantiate the switch in one palace and hold on to the instance until starting it in a completely different part of the application? Dunno... just my 2c on this one.

Dec 15 '21 16:12 dryajov

@emizzle we don't need to mock connections, instead we use the BufferStream abstraction for that. Please take a look at test/helpers.nim where a wrapper TestBufferStream is defined which allows you to set a writer handler. In order to "write" data to the Connection backed by the BufferStream use await conn.pushData(@[...]). BufferStream inherits Connection, so it can be already used interchangeably everywhere you need a stream, for more examples take a look at testmplex.nim or I can provide some if you need.

Thanks, ~~I'll remove the mock connection and use a TestBufferStream instead in the tests.~~ the MockConnection has been removed and replaced with TestBufferStream. The point of having the mock connection was to attempt to remove as much logic as possible that was not being tested. Using a TestBufferStream didn't seem to remove as much logic because of the parent type BufferStream still have fleshed out implementations that might be called. But it appears that BufferStream is well covered in unit tests, so using it in tests should be less of an issue (if something in BufferStream has broken, then those tests would pick it up).

Dec 15 '21 23:12 emizzle

I'm still a bit conflicted about how we're approaching this from an API standpoint.

In our last call @Menduist brought up a good point about passing the callback only to the transports, which doesn't allow us to handle switch specific issues, the main example used was addresses that don't have a transport to handle them. This in turn raises an additional question of what is really a fatal vs a non fatal error and at which point should we detect it? So I have this two questions right now:

Should a missing address (or similar) be considered a fatal error?

What is the appropriate stage to detect and report this kind of errors?

I don't have a good answer for the first question, but on the second I'm of the opinion that errors should be detected and reported/handled as soon as possible. In the case of missing transports, we can detect it a lot earlier than calling start, either before instantiating the Switch or during instantiation, with an assert or similar. This goes for must everything that can be detected early and we shouldn't allow instantiation with incorrect/unwanted states in general.

I like the idea of asserting during instantiation for things that are required for the switch and transports to work. If a switch needs one or more transports to function, (and will not function otherwise) we could assert this in the switch constructor. If a transport needs one or more addresses to start (and will not start otherwise), we could assert this in start. This handles detecting and reporting errors as soon as possible and it also describes expected behaviour of the switch and transport.

I'm quite possibly missing some context, but it feels like the issue here is whether we should allow a switch to be instantiated without transports or addresses, then allow the consumer to add transports and addresses afterwards as needed, prior to calling start. Are there particular use cases in mind (now or in the future) in which we would be interested in supporting this? If it needs to be supported, maybe we could use a booldefine to allow consumers to disable the defect at compilation, at which point they could instantiate a switch and start to build it as needed after instantiation.

Regarding whether or not a missing address should be considered a fatal error. If the calling application could recover from providing no addresses, then we should use an exception. Again, not being super experienced in libp2p, but doesn't it make sense that an application would simply check if it has an address before constructing a switch? In my opinion, this should be a defect, unless there are situations where getting addresses is not something that can be checked by the application before building the switch.

Dec 16 '21 01:12 emizzle

The second point that @Menduist brought up was in regards to #668 (comment), which allows passing the callback to start instead the constructor. I don't have a strong opinion here, except that passing the callback during instantiation allows us to clearly separate the two stages, for example, a user might instantiate the switch in one palace and hold on to the instance until starting it in a completely different part of the application? Dunno... just my 2c on this one.

Agreed, I don't have an entirely strong opinion either direction. It is an interesting thought that if a switch is created in ModuleA and injected into ModuleB where start is called, then there may be some context missing in ModuleB to reason about an error. However, the way it works right now, switch.listenError can be provided during instantiation or overridden at any point thereafter, so ModuleB could override listenError prior to calling start, if needed. This setup seems to provide a fair bit of flexibility. Providing listenError as a parameter to start would remove that flexibility as well as instantiation context (eg ModuleA context) in order to reason about an error.

Dec 16 '21 01:12 emizzle

Latest commit adds two things:

During switch instantiation, raises defect if no transports or addresses supplied
During transport.start, raises defect if no addresses supplied.

I implemented the points discussed above because sometimes it's easier to reason once written out.

Dec 17 '21 10:12 emizzle

Anyone know why CI is failing for testnative on Linux? I've run this in a Ubuntu 20.04.3 VM using nim 1.4.8 and 1.2.16 and had no failures...

Dec 21 '21 07:12 emizzle

Can't repro locally either

Dec 22 '21 12:12 Menduist

All tests passing.

Jan 18 '22 05:01 emizzle

I'll admit, I'm still not pleased with the state of this PR. We still don't raise an error when an address is not handled by any transport, and that would imply yet-another-callback

I don't find it very user friendly, you have to make a lot of different callbacks without clear benefits.

ATM, I don't have anything better to propose (outside what I've already proposed here), so we may have to brainstorm a bit

Jan 21 '22 11:01 Menduist

We brainstormed a bit with @dryajov and this is what we came up with: https://github.com/status-im/nim-libp2p/issues/693 (I think someone proposed that a while ago actually)

I think it's more elegant and powerful than the handler approach, sorry that we launched you on the wrong path

Feb 02 '22 10:02 Menduist

nim-libp2p nim-libp2p copied to clipboard

feat: Add callback for pessimistic transport start

nim-libp2p
nim-libp2p copied to clipboard