go-libp2p icon indicating copy to clipboard operation
go-libp2p copied to clipboard

webrtc: add WebRTC transport

Open ckousik opened this issue 1 year ago • 4 comments

This PR implements the webrtc transport spec according to https://github.com/libp2p/specs/pull/412 . This PR is not yet ready for review.

The webrtc protocol for multiaddr is implemented in this PR and needs to be implemented in go-multiaddr prior to merging this PR. This PR also uses multibase encoded multihash for DTLS fingerprint verification after the NOISE handshake.

ckousik avatar Jul 12 '22 13:07 ckousik

@marten-seemann @mxinden I've implemented the multibase multihash noise handshake here. For verification, we use the remote certificate of the underlying DTLS transport.

ckousik avatar Jul 12 '22 17:07 ckousik

@marten-seemann @mxinden this is ready for review

ckousik avatar Jul 15 '22 13:07 ckousik

2022-08-05 triage conversation: with the latest spec changes, https://github.com/libp2p/go-libp2p/pull/1663 is a prereq for this.

BigLep avatar Aug 05 '22 16:08 BigLep

@BigLep That PR is approved and awaiting merging.

ckousik avatar Aug 05 '22 16:08 ckousik

@marten-seemann or @MarcoPolo do you have capacity to give this another review? If not, I will take a deeper look myself, though I think your review is way more valuable than mine.

mxinden avatar Sep 30 '22 11:09 mxinden

2022-11-17 triage conversation: @marten-seemann plans to do the review early the week of 2022-11-21. He'll then be out until 2022-12-05, so he'll have to look at new revisions then.

BigLep avatar Nov 17 '22 21:11 BigLep

Follow up to go-libp2p triage session:

@ckousik let's rebase and address or close open comments. Chiefly: https://github.com/libp2p/go-libp2p/pull/1655#discussion_r985486271 https://github.com/libp2p/go-libp2p/pull/1655#discussion_r999754935 https://github.com/libp2p/go-libp2p/pull/1655#discussion_r985473552 https://github.com/libp2p/go-libp2p/pull/1655#discussion_r973052723 https://github.com/libp2p/go-libp2p/pull/1655#discussion_r973051121 https://github.com/libp2p/go-libp2p/pull/1655#discussion_r922799947 https://github.com/libp2p/go-libp2p/pull/1655#discussion_r922800772 https://github.com/libp2p/go-libp2p/pull/1655#discussion_r922800077 https://github.com/libp2p/go-libp2p/pull/1655#discussion_r922800554

Please prioritize this over adding Circuit Relay v2 in js-libp2p for now. If you think it's worthwhile to pair program with @marten-seemann please schedule a call (he's in NZST) before Nov 23.

The rust-libp2p implementation has been merged so there's opportunity to do some interop testing as well

p-shahi avatar Nov 18 '22 00:11 p-shahi

@p-shahi Most of the comments have been resolved. There are a bunch of comments at the end for udp_mux.go and udp_mux_conn.go. These are files that we take directly from pion with slight modification.

ckousik avatar Nov 18 '22 05:11 ckousik

Hi @ckousik, I tried playing around with WebRTC, and I found that this code here doesn't work very well.

Server:

func main() {
	h1, err := libp2p.New(
		libp2p.Transport(libp2pwebrtc.New),
		libp2p.ListenAddrStrings("/ip4/127.0.0.1/udp/1234/webrtc"),
	)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("Server: %s/p2p/%s\n", h1.Addrs()[0], h1.ID())
	select {}
}

Client:

func main() {
	h2, err := libp2p.New(
		libp2p.NoListenAddrs,
		libp2p.Transport(libp2pwebrtc.New),
	)
	if err != nil {
		log.Fatal(err)
	}

	ai, err := peer.AddrInfoFromString(os.Args[1])
	if err != nil {
		log.Fatal(err)
	}
	if err := h2.Connect(context.Background(), *ai); err != nil {
		log.Fatal(err)
	}
	conn := h2.Network().ConnsToPeer(ai.ID)[0]
	str, err := conn.NewStream(context.Background())
	if err != nil {
		log.Fatal(err)
	}
	str.Write([]byte("foo"))
	select {}
}

Here's a some super trivial things that doesn't work:

  • Right after connecting to the server, I get this error:

❯ go run ./cmd/client/main.go /ip4/127.0.0.1/udp/1234/webrtc/certhash/uEiCUingRvm4mNSTQQ0D2SuKWj97fEpUyi8l5SvGd1iM8Hw/p2p/12D3KooWBHii1mmbbDFJmJyZ8SWAz63atDZXBNmXj96zTgnHhzhD sctp ERROR: 2022/12/02 22:09:45 [0x1400033e1c0] stream 2 not found)

There's an equivalent error for the server side.

  • On the server side (where I'm not even accepting the stream), another error is thrown:

2022-12-02T22:09:55.662+1300 ERROR webrtc-transport webrtc/datachannel.go:376 could not read messageEOF

  • It is impossible to connect to the same server more than once, this leads to a connection timeout.

I was initially planning to have a look at the resource usage, but I don't think I can't continue until these bugs are fixed.

marten-seemann avatar Dec 02 '22 09:12 marten-seemann

@marten-seemann I checked the example you posted and the connection is getting established. The write also takes place successfully. I'm looking into why pion is reporting stream not found for a datachannel that we do not create. I fixed the issue where connections time out. This was due to a deadlock on my end. The EOF message is caused because a protocol is not negotiated on the stream that you create, therefore the server resets the stream.

ckousik avatar Dec 02 '22 17:12 ckousik

@marten-seemann Yes, muxedConnection just wraps a packetQueue and adds a reverse mapping from connection to its remote addresses, in addition to allowing writes to the underlying net.PacketConn. While it is easy to roll them into a single struct, it feels more readable to implement them as separate structs as it completely separates any access to the underlying connection/networking details from the queuing and blocking logic.

ckousik avatar Dec 07 '22 13:12 ckousik

What are the next steps here? Is this blocked until we get more performance insight from https://github.com/libp2p/go-libp2p/issues/1917 ?

BigLep avatar Jan 07 '23 00:01 BigLep

It requires @marten-seemann 's review afaik it's not blocked on the performance work.

p-shahi avatar Jan 07 '23 00:01 p-shahi

It requires @marten-seemann 's review afaik it's not blocked on the performance work.

Absolutely it is blocked on performance work. We

  1. have the policy to not build foot-guns and
  2. we have anecdotal evidence that WebRTC's resource usage will be much larger than other transports'

Now if this is a factor of 2 or 3, it probably doesn't matter that much. If it's a factor of 10 or 20, things might look different. Getting some concrete numbers on what the resource usage of the WebRTC transport looks like as compared to other transports is therefore a prerequisite for merging this into master, and even more so for enabling WebRTC as a default transport.

marten-seemann avatar Jan 07 '23 01:01 marten-seemann

Sorry, I only assumed it was a blocker for merging the PR but not subsequent re-reviews. Chinmay says he was blocked on re-review here and here. There may have been a separate convo that I'm not aware of.

https://github.com/libp2p/go-libp2p/issues/1917 has been unblocked so let's treat this as P0 @ckousik ?

p-shahi avatar Jan 07 '23 02:01 p-shahi

https://github.com/libp2p/go-libp2p/pull/1655#discussion_r1063940219 seems to be an older commit which was addressed.

ckousik avatar Jan 07 '23 16:01 ckousik

We should figure out if it's possible to disable the DTLS Hello Verify Request. There's not really a good reason to send it, and it costs us 1 RTT on every handshake. See https://github.com/libp2p/blog/pull/18#discussion_r1063068104 for context and links to the respective Pion code.

marten-seemann avatar Jan 07 '23 21:01 marten-seemann

Trying to connect to a node that's presenting a different than expect cert hash results in the following error message:

2023/01/08 11:17:25 failed to dial 12D3KooWE3AwZFT9zEWDUxhya62hmvEbRxYBWaosn7Kiqw5wsu73:
  * [/ip4/127.0.0.1/udp/1234/webrtc/certhash/uEiBUvyKrUqPcYoJk9T2JW_I6IWCnMZnF0WS3D9hpaifKwg] peerconnection failed: peerconnection move to failed state: <nil>

That's not a very helpful error.

marten-seemann avatar Jan 07 '23 22:01 marten-seemann

I added a minimal WebRTC client and server in https://github.com/libp2p/go-libp2p/commit/42e20e3a3e9c786f6a16530b0cee2370a079c62a to play around with this PR a bit, and I'm occasionally seeing SCTP errors are printed to stdout / stderr (?) directly. Most times, I need to hit the server a few times before they show up:

 go run ./cmd/webrtc/server/main.go
2023/01/08 11:26:23 I am /ip4/127.0.0.1/udp/1234/webrtc/certhash/uEiCwYOwbyh2EBSrHCovTbyAPE3e27i_mpkqGJ2lNTYmkag/p2p/12D3KooWE3AwZFT9zEWDUxhya62hmvEbRxYBWaosn7Kiqw5wsu73

2023/01/08 11:26:37 new stream from 12D3KooWLUz7G4ejfn8DE1YV1KHc343zLHChf8ztyrhkPRkmCZ91
2023/01/08 11:26:37 copied 6
2023/01/08 11:27:58 new stream from 12D3KooWGPaGns4WpD7pNwf9RjhhFHELVCGTz46hkUtJZjCCw8Fo
2023/01/08 11:27:58 copied 6
2023/01/08 11:28:03 new stream from 12D3KooWRLnpdWFciYvfuNTm2BwAFPwo1GMG7K2RMb8ALtFKeYiA
2023/01/08 11:28:03 copied 6
sctp ERROR: 2023/01/08 11:28:04 [0x1400179a000] stream 4 not found)
sctp ERROR: 2023/01/08 11:28:04 [0x1400179a000] stream 4 not found)
sctp ERROR: 2023/01/08 11:28:06 [0x1400179a000] stream 4 not found)
sctp ERROR: 2023/01/08 11:28:06 [0x1400179a000] stream 4 not found)

This will spam the logs of everybody running the library. I assume this is because the client goes away before the connection is properly closed. This will happen on live networks. At the very least, these errors need to be suppressed somehow.

marten-seemann avatar Jan 07 '23 22:01 marten-seemann

CI is failing with errors in the WebRTC package (example: https://github.com/libp2p/go-libp2p/actions/runs/3863157724/jobs/6585170202):

  === RUN   TestTransportWebRTC_DialerCanCreateStreams
      transport_test.go:245: dialer opened connection
      transport_test.go:256: timed out
  --- FAIL: TestTransportWebRTC_DialerCanCreateStreams (10.00s)

@ckousik, please pay attention to the CI output for failures in the code that you're adding to the code base. (I'm aware that there are other sources of flaky tests in our code base, and yes, this is annoying and we're working on this). I expect the WebRTC tests to be rock-solid though and not to introduce any additional flakiness.

marten-seemann avatar Jan 07 '23 22:01 marten-seemann

There seems to be a bug in the stream state machine. Using this modified example: https://github.com/libp2p/go-libp2p/compare/webrtc-cmd...webrtc-cmd-timeout

Run a transfer from server to client. The server does not close the stream, so the client's ReadAll call will block. Now kill the server.

What you'd expect to happen: the connection times out eventually, and the client's ReadAll call errors. What actually happens: the connection times out, and the client's ReadAll call succeeds.

This is a very serious bug. It would lead the client to believe that a transfer was completed successfully, when in reality it failed. It's also a bug that's expected to be exceedingly common in practice: nodes go away all time time, often times without properly shutting down the connection.

marten-seemann avatar Jan 07 '23 23:01 marten-seemann

Trying to connect to a node that's presenting a different than expect cert hash results in the following error message:

2023/01/08 11:17:25 failed to dial 12D3KooWE3AwZFT9zEWDUxhya62hmvEbRxYBWaosn7Kiqw5wsu73:
  * [/ip4/127.0.0.1/udp/1234/webrtc/certhash/uEiBUvyKrUqPcYoJk9T2JW_I6IWCnMZnF0WS3D9hpaifKwg] peerconnection failed: peerconnection move to failed state: <nil>

That's not a very helpful error.

So, this case is actually covered in a test. We can only know that the certhash was bad when the peerconnection fails. I've actually tested this with the peerconnection.SCTP().OnError() callback, but that does not surface the error.

ckousik avatar Jan 08 '23 06:01 ckousik

@marten-seemann Any chance the global pool could set off the race detector? I got the following race condition and then ran an experiment to confirm:

WARNING: DATA RACE
Write at 0x00c001921002 by goroutine 10685:
  runtime.slicecopy()
      /usr/lib/go/src/runtime/slice.go:307 +0x0
  github.com/flynn/noise.(*HandshakeState).WriteMessage()
      /home/ckousik/go/pkg/mod/github.com/flynn/[email protected]/state.go:379 +0x764
  github.com/libp2p/go-libp2p/p2p/security/noise.(*secureSession).sendHandshakeMessage()
      /home/ckousik/projects/go-libp2p/p2p/security/noise/handshake.go:175 +0xb1
  github.com/libp2p/go-libp2p/p2p/security/noise.(*secureSession).runHandshake()
      /home/ckousik/projects/go-libp2p/p2p/security/noise/handshake.go:81 +0x691
  github.com/libp2p/go-libp2p/p2p/security/noise.newSecureSession.func1()
      /home/ckousik/projects/go-libp2p/p2p/security/noise/session.go:69 +0x58

Previous read at 0x00c001921000 by goroutine 3133:
  runtime.slicecopy()
      /usr/lib/go/src/runtime/slice.go:307 +0x0
  github.com/libp2p/go-libp2p/p2p/transport/webrtc/udpmux.(*packetQueue).pop()
      /home/ckousik/projects/go-libp2p/p2p/transport/webrtc/udpmux/packetqueue.go:54 +0x44e
  github.com/libp2p/go-libp2p/p2p/transport/webrtc/udpmux.(*muxedConnection).ReadFrom()
      /home/ckousik/projects/go-libp2p/p2p/transport/webrtc/udpmux/muxed_connection.go:56 +0x65
  github.com/pion/ice/v2.(*candidateBase).recvLoop()
      /home/ckousik/go/pkg/mod/github.com/pion/ice/[email protected]/candidate_base.go:222 +0x222
  github.com/pion/ice/v2.(*candidateBase).start.func1()
      /home/ckousik/go/pkg/mod/github.com/pion/ice/[email protected]/candidate_base.go:206 +0x47

Goroutine 10685 (running) created at:
  github.com/libp2p/go-libp2p/p2p/security/noise.newSecureSession()
      /home/ckousik/projects/go-libp2p/p2p/security/noise/session.go:68 +0x6d9
  github.com/libp2p/go-libp2p/p2p/security/noise.(*SessionTransport).SecureOutbound()
      /home/ckousik/projects/go-libp2p/p2p/security/noise/session_transport.go:96 +0x1c7
  github.com/libp2p/go-libp2p/p2p/transport/webrtc.(*WebRTCTransport).noiseHandshake()
      /home/ckousik/projects/go-libp2p/p2p/transport/webrtc/transport.go:482 +0x2b1
  github.com/libp2p/go-libp2p/p2p/transport/webrtc.(*listener).setupConnection()
      /home/ckousik/projects/go-libp2p/p2p/transport/webrtc/listener.go:311 +0xd3c
  github.com/libp2p/go-libp2p/p2p/transport/webrtc.(*listener).handleCandidate()
      /home/ckousik/projects/go-libp2p/p2p/transport/webrtc/listener.go:189 +0x18b
  github.com/libp2p/go-libp2p/p2p/transport/webrtc.(*listener).handleIncomingCandidates.func1()
      /home/ckousik/projects/go-libp2p/p2p/transport/webrtc/listener.go:134 +0x115

Goroutine 3133 (running) created at:
  github.com/pion/ice/v2.(*candidateBase).start()
      /home/ckousik/go/pkg/mod/github.com/pion/ice/[email protected]/candidate_base.go:206 +0x28e
  github.com/pion/ice/v2.(*CandidateHost).start()
      <autogenerated>:1 +0x6b
  github.com/pion/ice/v2.(*Agent).addCandidate.func1()
      /home/ckousik/go/pkg/mod/github.com/pion/ice/[email protected]/agent.go:837 +0x1bb
  github.com/pion/ice/v2.(*Agent).taskLoop()
      /home/ckousik/go/pkg/mod/github.com/pion/ice/[email protected]/agent.go:229 +0x13b
  github.com/pion/ice/v2.NewAgent.func3()
      /home/ckousik/go/pkg/mod/github.com/pion/ice/[email protected]/agent.go:363 +0x39
==================

The receiveMTU for the UDP mux is 1500 bytes, so it should be requesting buffers of capacity 2048 from the global pool. The noise handshake requests buffers of length 2048 too https://github.com/libp2p/go-libp2p/blob/4ad3734091da59d91747706a5f49ff0a5cfdad92/p2p/security/noise/handshake.go#L75 .

I temporarily changed the receiveMTU to request 4096 sized buffers and the race condition doesn't seem to happen.

ckousik avatar Jan 08 '23 11:01 ckousik

I am not finished with my review, but I can already tell that it doesn't surprise me that there are data races here. You're playing very loose with low-level concurrency primitives such as the Mutex.

GlenDC avatar Jan 09 '23 21:01 GlenDC

This PR can be closed in favour of https://github.com/libp2p/go-libp2p/pull/1999.

GlenDC avatar Jan 17 '23 22:01 GlenDC

Closing in favor of #1999

ckousik avatar Jan 18 '23 05:01 ckousik