statix
statix copied to clipboard
Functions raise if port dies
Hey folks,
Maybe I'm doing something wrong, but if my statsd agent restarts all function calls end up raising as:
> Sensetra.Statsd.increment("elixir.foo", 1)
** (ArgumentError) argument error
:erlang.port_command(Sensetra.Statsd, [<<1, 31, 189, 172, 31, 22, 97>>, "elixir.foo", 58, "1", 124, "c"])
(statix) lib/statix/conn.ex:30: Statix.Conn.transmit/2
This seems undesirable, monitoring functions shouldn't nuke the app in the event that the port goes down.
Ideally there'd be some mechanism for trying to reconnect as well, but at a minimum it would seem we probably want to catch this error.
@benwilson512 I was able to find a solution to the second problem of having the server reconnect. We use a GenServer to manage our Statix connection, which looks a bit like this:
defmodule Metrics do
use Statix
use GenServer
def init(_opts) do
Process.flag(:trap_exit, true)
connect()
{:ok, current_conn()}
end
def handle_info({:EXIT, port, reason}, %Statix.Conn{sock: __MODULE__} = state) do
Logger.error("Port #{inspect(port)} exited with reason #{reason}")
{:stop, :normal, state}
end
end
Basically this Metrics module traps exits, so if the port gets closed it can stop and be restarted by its supervisor, which will cause a new connection to be opened.
This could lead to a restart loop, but using appropriate supervisor strategies you can define policies for the restart loop.
By default when you call connect(), your application is not handling exits unless they are something other than :normal, so by explicitly trapping all of them we can cause a reconnect.
--
For the other problem of increment crashing, you can override increment in the GenServer, and catch ArgumentErrors when they happen. Though it's not very elegant.
@thecodeboss that solves the issue of reconnecting, but any metrics calls that happen between when it fails and when it restarts will raise. The inability to send a metric shouldn't take down a process.
Hey, do you have any reliable way to trigger port dying?
@lexmag :erlang.port_close/1
I think the issue brought up in the issue is related to the usage documentation. The problem with the way it is documented is that it tells users to call the connect/0 function directly in the application's start/2 function. As I understand it, this links the port to the application process.
When the port closes and you try to call any of the APIs that call :erlang.port_command/2 it will cause an ArgumentError which will kill the application process.
@scrogson thanks, but I wondered more on how (and why) the port gets closed in non-manual way.
We've seen the port die if it can't communicate with the agent in a variety of ways: agent restarts, netsplits, etc.
@keathley 👍.
I think the ultimate fix for this would be to provide Statix.child_spec which starts a GenServer that will monitor the port and reopen and re-register it in case of failure.
I do think that's a solution, but I also think that wrapping the port operations in a try is critical as well. Even in the best case reconnection isn't instant, and nothing should die in the application just because monitoring is down. In the worst case the re-connection fails (maybe the agent is down).
I also think that wrapping the port operations in a
tryis critical as well.
That's definitely what we should do of course.
I think we should also log missed metric in catch/rescue.