dtls icon indicating copy to clipboard operation
dtls copied to clipboard

Listener doesn't process parallel handshakes

Open jkralik opened this issue 3 years ago • 16 comments

Your environment.

https://github.com/go-ocf/go-coap/issues/109

What did you do?

Just look at last comment. But it seems that other handshakes waits each other.

What did you expect?

dlts.Listener is able to process parallel handshakes.

jkralik avatar Aug 17 '20 13:08 jkralik

The issue is here. Every time we get a new inbound stream we block until the handshake is complete. This behavior also isn't a regression f7a68a36523bf733b1c4139f88e109b4b31e02f8

We should look at how crypto/tls buffers internally so that handshakes don't block either. I don't know what the API should look like. I don't know if this is an established Go pattern, but you could call accept n times. That is one way you could accomplish the concurrency. Let me confirm first.

@backkem any ideas?

Sean-Der avatar Aug 19 '20 00:08 Sean-Der

https://gist.github.com/Sean-Der/a3a6abbe9b78ced613590312a18dc8a4

@jkralik @boaks @jdbruijn this worked^ If one Server process hangs the other 4 work just fine. If I remove the change in the example one client hanging hurts everything.

You just need to decide how many goroutines you are willing to spawn.. Maybe this should be something that Pion automatically does? I am not sure what is idiomatic.

Sean-Der avatar Aug 19 '20 01:08 Sean-Der

I've checked in my test setup and also works there. See the readme of the repo I've just referenced this in. Will be testing this in my application code, but should work for the time being at lest. @Sean-Der Thanks so much for the quick analysis and fix.

We should look at how crypto/tls buffers internally so that handshakes don't block either.

I have no idea what is the idiomatic but I'll have a look at crypto/tls to see what they do. I can't imagine that handshakes would be blocking there.

You just need to decide how many goroutines you are willing to spawn..

I'll have a look at how the go routines work exactly and what spawning N of them means in CPU, RAM etc. to determine how much I'd be willing to spawn.

For reference to anyone reading: In most applications this wouldn't be that much of an issue since the handshake over an internet connection goes relatively quickly, so context deadlines can be configured to be very tight (e.g. 100 ms). With embedded devices, the full handshake generally takes 5 seconds, which is much more blocking even if it doesn't hang and 500 devices are trying to make connection at the same time.

jdbruijn avatar Aug 19 '20 05:08 jdbruijn

I'm probably misunderstanding the problem, but can't we use a buffered channel to get around this?

var conns := make(chan net.Conn, 100)

hub := util.NewHub()

go func(l net.Listener) {
    for {
        conn, err := l.Accept()
        if err != nil {
            .. do something ..
            return
        }
        conns <- conn
    }
}(listener)

for {
    select {
    case c := <-conns:
        hub.Register(conn)
    }
}
hub.Chat()

daenney avatar Aug 19 '20 16:08 daenney

Oh mm, the Accept is on dtls.Listener, so that triggers the whole handshake machinery. We should be able to Accept() on the underlying listener first and then spin the rest off into a goroutine? I suppose the problem is this: https://github.com/pion/dtls/blob/c6dbd482fa3a0a7655e9dc338a0f15be905cdf35/listener.go#L65? We do this inline, instead of spawning a goroutine there that'll then take care of the handshake?

daenney avatar Aug 19 '20 16:08 daenney

@Sean-Der From my point view, spawning goroutine's below to underlayer of dtls/listener, because we want to have similar API as tls.Listener. and it is more convenient for using. Of course there must be option to limit spawned goroutines.

jkralik avatar Aug 20 '20 08:08 jkralik

@daenney yep exactly!

@jkralik agree!


Server in crypto/tls doesn't actually do anything so we need to match this API. Instead the first call to Read or Write should start the handshake.

Sean-Der avatar Aug 21 '20 06:08 Sean-Der

It looks like we will need to do a v3 and add Handshake

I can't believe I missed this before :( this is a pretty big breaking change, but I don't think it will be difficult to do!

Sean-Der avatar Aug 21 '20 06:08 Sean-Der

For reference to anyone reading: In most applications this wouldn't be that much of an issue since the handshake over an internet connection goes relatively quickly, so context deadlines can be configured to be very tight (e.g. 100 ms). With embedded devices, the full handshake generally takes 5 seconds, which is much more blocking even if it doesn't hang and 500 devices are trying to make connection at the same time.

I'm not sure, when exactly the spawned goroutine is required. If this is already required for RFC 6347 - 4.2.1. Denial-of-Service Countermeasures, I think it violates the "In order to counter both of these attacks, DTLS borrows the stateless cookie technique used by Photuris [PHOTURIS] and IKE [IKEv2]." there. Then each (spoofed) CLIENT_HELLO may trigger/consume such a goroutine. Even if that number is limited, it will make it easy for spoofer to perform a DoS attack.

boaks avatar Aug 22 '20 14:08 boaks

I'm not sure, when exactly the spawned goroutine is required.

@boaks Each Accept call would be in one. You would do a pre-allocated pool, so I don't think the initial ClientHello will cause a problem!

The API difference is that crypto/tls does the handshake on the first call to Read or Write. pion/dtls does it on the call to Accept. All the blocking occurs in Read and Write which users usually move to a goroutine.


If users do a pool and call Accept multiple times they can handle multiple handshakes at once. That is the best option to unblock right away.


Going forward we have two choices to make the API behave exactly like crypto/tls

  • Add an option to do Handshake in the first call to Read/Write. This adds more complexity, but is not a breaking change!
  • Break the API and never do the Handshake when creating the connection. Only when doing the first Read/Write or when calling Handshake explicitly.

One thing that worries me is that this could encourage users to spawn a goroutine each time a ClientHello is received. The user will do the call to Read/Write/Handshake in a new goroutine.

Sean-Der avatar Aug 22 '20 18:08 Sean-Der

If we were to take option 1, realistically we'd want that option on/by default since the current behaviour is undesirable. But we can't flip that on by default since it changes our behaviour quite a bit and we don't know if folks are relying on it somehow. That would result in us having to do a major anyway to be able to flip the default for that option, at which point I'd rather we break the API and align on crypto/tls than take on the tech debt.

daenney avatar May 17 '21 09:05 daenney

Is there any progress on this? We are accepting DTLS connections of potentially many thousand clients (IoT devices) on low bandwidth connections with high timeouts - which is worst case for this scenario ....

niondir avatar Apr 19 '22 18:04 niondir

My experience with (too) many coap and/or dtls stacks is, it's either intended and developed to be used for IoT, or you may be faced much more pain, then you expect. That doesn't mean, other implementations are bad, it just means, they are not really supporting IoT.

boaks avatar Apr 20 '22 08:04 boaks

Hi @niondir

Instead of doing your Handshake or first Read/Write in a goroutine create your dtls/Conn in one. Even when we change the behavior to match crypto/TLS you still have to handle the concurrent handshakes.

Happy to create a JSFiddle showing how to do this. When this issue is closed though I don’t think this library will change dramatically, we are just moving when something blocks.

thanks!

Sean-Der avatar Apr 20 '22 12:04 Sean-Der

FWIW I'm doing like example below in our DTLS server for IoT devices to create 1000 handshake workers/listeners, also using go-coap. Has been working great for over a year now.

func (l *DTLSListener) acceptLoop() {
 	defer l.wg.Done()
 	for i := 0; i < 1000; i++ {
 		go func() {
 			for {
 				conn, err := l.listener.Accept()
 				select {
 				case l.connCh <- connData{conn: conn, err: err}:
 				case <-l.doneCh:
 					return
 				}
 			}
 		}()
 	}
 }

jdbruijn avatar Apr 21 '22 06:04 jdbruijn

This is a pretty big issue. From my experience, 1 bad handshaker and a wonky config can pretty much take down the entire server (default 30s handshake timeout, client may have a much shorter one like 5s. Client handshakes build up infinitely.)

I will submit a PR, but I propose:

type Handshaker interface {
  Handshake() (net.Conn, error)
}

type handshaker struct {
  inner net.Conn
  conf *Config
}

func(h handshaker) Handshake() (net.Conn, error) {
  return Server(h.inner, h.conf)
}

func(ln *listener) AcceptHandshaker() (Handshaker, error) {
  inner, err := ln.parent.Accept()
  return handshaker{inner, ln.config}, err
}

The Handshake method can then be called in a background goroutine.

jordan-bonecutter avatar Mar 18 '24 14:03 jordan-bonecutter