tunctl icon indicating copy to clipboard operation
tunctl copied to clipboard

tuncer:send timeout

Open meox opened this issue 2 years ago • 9 comments

I don't understand why I got sporadic {error, timeout}, is this a known issue?

meox avatar Aug 31 '23 15:08 meox

No, I haven't seen any reports of timeouts. It sounds as if a gen_server:call may be timing out.

Checking the code, sends should not timeout. For the send/2 code path:

  • send/2 has a timeout set to infinity
  • depending on the tuntap device active mode, either:
    • {active, false}: the data is written directly to the non-blocking file descriptor using write(2)
    • {active, true | integer()}: the data is written to a port holding the fd using erlang:port_command/2
  • according to the docs, the only error returned by port_command/2 is badarg

Is it possible another gen_server:call is timing out?

msantos avatar Aug 31 '23 18:08 msantos

The scheduler can suspend the caller to apply backpressure if using active mode and a lot of data is being written to the port. This could cause a call upstream to timeout.

The busy check can be bypassed using the force option to port_command/3. Apparently erlang distribution drivers set this flag so it makes sense for tunctl to do so too.

diff --git a/src/tuncer.erl b/src/tuncer.erl
index 6bda60e..7dbe7b0 100644
--- a/src/tuncer.erl
+++ b/src/tuncer.erl
@@ -744,7 +744,7 @@ handle_call({send, Data}, _From, #state{port = false, fd = FD} = State) ->
     {reply, Reply, State};
 handle_call({send, Data}, _From, #state{port = Port} = State) ->
     Reply =
-        try erlang:port_command(Port, Data) of
+        try erlang:port_command(Port, Data, [force]) of
             true ->
                 ok
         catch

msantos avatar Aug 31 '23 18:08 msantos

yes I'm using active mode!

I tried but I'm receiving not_sup:

Screenshot 2023-08-31 at 23 14 31

meox avatar Aug 31 '23 20:08 meox

yes I'm using active mode!

I tried but I'm receiving not_sup:

Screenshot 2023-08-31 at 23 14 31

Thanks for testing. Sorry, there must be a problem with the tuncer tests: the tests passed for me with the change, I'll look into it.

Another option is bypassing both the gen_server and the port for writes: continue using active mode for reads but switch to writing directly to the device fd:

FD = tuncer:getfd(Ref), % calls into the gen_server
tuncer:write(FD, Data).

msantos avatar Aug 31 '23 23:08 msantos

just to be more precise: I'm using it from a common test

meox avatar Sep 01 '23 07:09 meox

Busy Port Limits

open_port/2 has an option to disable the busy limits. The docs lists the fd driver as supported.

(Diff updated to include busy_limits_msgq)

diff --git a/src/tuncer.erl b/src/tuncer.erl
index 6bda60e..b44e13a 100644
--- a/src/tuncer.erl
+++ b/src/tuncer.erl
@@ -845,7 +845,7 @@ code_change(_OldVsn, State, _Extra) ->
 %%% Internal functions
 %%--------------------------------------------------------------------
 set_mode(active, FD) ->
-    open_port({fd, FD, FD}, [stream, binary]);
+    open_port({fd, FD, FD}, [stream, binary, {busy_limits_port, disabled}, {busy_limits_msgq, disabled}]);
 set_mode(passive, Port) ->
     try erlang:port_close(Port) of
         true ->

Tests and port_command([force])

About the tuncer tests and notsup, the tests are using {active,false} only. I'll add some tests covering active mode. Note: with active mode enabled, I still wasn't able to reproduce the notsup but using {busy_limits_port, disabled} looks like a better fix.

msantos avatar Sep 01 '23 10:09 msantos

it's no clear for me if there is some case where it's not safe to disable those limits

meox avatar Sep 01 '23 14:09 meox

it's no clear for me if there is some case where it's not safe to disable those limits

Good point. My theory is tuntap devices have the same issues as writing a distribution driver. There can be a sort of priority inversion where the application is waiting because the scheduler is applying backpressure to the process performing the I/O. Similar to distribution drivers, the busy limits should be disabled either using port_command([force]) or open_port({fd, FD, FD}, [stream, binary, {busy_limits_port, disabled}, {busy_limits_msgq, disabled}]).

My guess is that the {error, timeout} you see is caused by a, e.g., gen_server:call/2 somewhere before the call to send. The default timeout is 5 seconds which is signifcant latency for a network. Disabling the busy limits might reduce the latency. But note other processes in the chain, like the tuncer gen_server, may also be suspended.

Without code to reproduce or a stack trace, that's just speculation. So I'll plan to do the following:

  • add tests covering active mode
  • write a stress test to see if
    • the {error, timeout} is reproducible
    • disabling the busy limits has any effect
  • if disabling busy limits is useful, decide whether to enable or disable by default either by a new option or allowing open_port options to be passed in directly to create/2

msantos avatar Sep 02 '23 14:09 msantos

good! thanks a lot for your support

meox avatar Sep 02 '23 14:09 meox