statix icon indicating copy to clipboard operation
statix copied to clipboard

Functions raise if port dies

Open benwilson512 opened this issue 6 years ago • 10 comments

Hey folks,

Maybe I'm doing something wrong, but if my statsd agent restarts all function calls end up raising as:

> Sensetra.Statsd.increment("elixir.foo", 1)
** (ArgumentError) argument error
    :erlang.port_command(Sensetra.Statsd, [<<1, 31, 189, 172, 31, 22, 97>>, "elixir.foo", 58, "1", 124, "c"])
    (statix) lib/statix/conn.ex:30: Statix.Conn.transmit/2

This seems undesirable, monitoring functions shouldn't nuke the app in the event that the port goes down.

Ideally there'd be some mechanism for trying to reconnect as well, but at a minimum it would seem we probably want to catch this error.

benwilson512 avatar Jan 24 '18 16:01 benwilson512

@benwilson512 I was able to find a solution to the second problem of having the server reconnect. We use a GenServer to manage our Statix connection, which looks a bit like this:

defmodule Metrics do
  use Statix
  use GenServer

  def init(_opts) do
    Process.flag(:trap_exit, true)
    connect()
    {:ok, current_conn()}
  end

  def handle_info({:EXIT, port, reason}, %Statix.Conn{sock: __MODULE__} = state) do
    Logger.error("Port #{inspect(port)} exited with reason #{reason}")
    {:stop, :normal, state}
  end
end

Basically this Metrics module traps exits, so if the port gets closed it can stop and be restarted by its supervisor, which will cause a new connection to be opened.

This could lead to a restart loop, but using appropriate supervisor strategies you can define policies for the restart loop.

By default when you call connect(), your application is not handling exits unless they are something other than :normal, so by explicitly trapping all of them we can cause a reconnect.

--

For the other problem of increment crashing, you can override increment in the GenServer, and catch ArgumentErrors when they happen. Though it's not very elegant.

thecodeboss avatar Feb 13 '18 01:02 thecodeboss

@thecodeboss that solves the issue of reconnecting, but any metrics calls that happen between when it fails and when it restarts will raise. The inability to send a metric shouldn't take down a process.

benwilson512 avatar Mar 07 '18 19:03 benwilson512

Hey, do you have any reliable way to trigger port dying?

lexmag avatar Apr 16 '18 22:04 lexmag

@lexmag :erlang.port_close/1

scrogson avatar May 02 '18 20:05 scrogson

I think the issue brought up in the issue is related to the usage documentation. The problem with the way it is documented is that it tells users to call the connect/0 function directly in the application's start/2 function. As I understand it, this links the port to the application process.

When the port closes and you try to call any of the APIs that call :erlang.port_command/2 it will cause an ArgumentError which will kill the application process.

scrogson avatar May 02 '18 20:05 scrogson

@scrogson thanks, but I wondered more on how (and why) the port gets closed in non-manual way.

lexmag avatar May 02 '18 20:05 lexmag

We've seen the port die if it can't communicate with the agent in a variety of ways: agent restarts, netsplits, etc.

keathley avatar May 02 '18 20:05 keathley

@keathley 👍.

I think the ultimate fix for this would be to provide Statix.child_spec which starts a GenServer that will monitor the port and reopen and re-register it in case of failure.

lexmag avatar May 02 '18 20:05 lexmag

I do think that's a solution, but I also think that wrapping the port operations in a try is critical as well. Even in the best case reconnection isn't instant, and nothing should die in the application just because monitoring is down. In the worst case the re-connection fails (maybe the agent is down).

benwilson512 avatar May 02 '18 20:05 benwilson512

I also think that wrapping the port operations in a try is critical as well.

That's definitely what we should do of course. I think we should also log missed metric in catch/rescue.

lexmag avatar May 02 '18 20:05 lexmag