statix
statix copied to clipboard
Functions raise if port dies
Hey folks,
Maybe I'm doing something wrong, but if my statsd agent restarts all function calls end up raising as:
> Sensetra.Statsd.increment("elixir.foo", 1)
** (ArgumentError) argument error
:erlang.port_command(Sensetra.Statsd, [<<1, 31, 189, 172, 31, 22, 97>>, "elixir.foo", 58, "1", 124, "c"])
(statix) lib/statix/conn.ex:30: Statix.Conn.transmit/2
This seems undesirable, monitoring functions shouldn't nuke the app in the event that the port goes down.
Ideally there'd be some mechanism for trying to reconnect as well, but at a minimum it would seem we probably want to catch this error.
@benwilson512 I was able to find a solution to the second problem of having the server reconnect. We use a GenServer
to manage our Statix
connection, which looks a bit like this:
defmodule Metrics do
use Statix
use GenServer
def init(_opts) do
Process.flag(:trap_exit, true)
connect()
{:ok, current_conn()}
end
def handle_info({:EXIT, port, reason}, %Statix.Conn{sock: __MODULE__} = state) do
Logger.error("Port #{inspect(port)} exited with reason #{reason}")
{:stop, :normal, state}
end
end
Basically this Metrics
module traps exits, so if the port gets closed it can stop and be restarted by its supervisor, which will cause a new connection to be opened.
This could lead to a restart loop, but using appropriate supervisor strategies you can define policies for the restart loop.
By default when you call connect()
, your application is not handling exits unless they are something other than :normal
, so by explicitly trapping all of them we can cause a reconnect.
--
For the other problem of increment
crashing, you can override increment
in the GenServer
, and catch ArgumentError
s when they happen. Though it's not very elegant.
@thecodeboss that solves the issue of reconnecting, but any metrics calls that happen between when it fails and when it restarts will raise. The inability to send a metric shouldn't take down a process.
Hey, do you have any reliable way to trigger port dying?
@lexmag :erlang.port_close/1
I think the issue brought up in the issue is related to the usage documentation. The problem with the way it is documented is that it tells users to call the connect/0
function directly in the application's start/2
function. As I understand it, this links the port to the application process.
When the port closes and you try to call any of the APIs that call :erlang.port_command/2
it will cause an ArgumentError
which will kill the application process.
@scrogson thanks, but I wondered more on how (and why) the port gets closed in non-manual way.
We've seen the port die if it can't communicate with the agent in a variety of ways: agent restarts, netsplits, etc.
@keathley 👍.
I think the ultimate fix for this would be to provide Statix.child_spec
which starts a GenServer
that will monitor the port and reopen and re-register it in case of failure.
I do think that's a solution, but I also think that wrapping the port operations in a try
is critical as well. Even in the best case reconnection isn't instant, and nothing should die in the application just because monitoring is down. In the worst case the re-connection fails (maybe the agent is down).
I also think that wrapping the port operations in a
try
is critical as well.
That's definitely what we should do of course.
I think we should also log missed metric in catch
/rescue
.