elixir-nodejs
elixir-nodejs copied to clipboard
Implementation leaks still there (2.0.0)
Oh no, I made it worse :scream:
It looks like my PR that was supposed to get rid of implementation leaks introduced some new ones. I should have tested my changes thoroughly with our production code...
It looks like my GenServer
that calls NodeJS.call(..)
is sometimes getting back the messages sent by Port:
{:ok, [message: {#Port<0.2131>, {:data, {:eol, []}}}, module: NodeJS.Worker, name: #PID<0.1656.0>]}
Also, I'm still getting the timeout messages in my mailbox:
** (FunctionClauseError) no function clause matching in CallerGenServer.handle_info/2
(foo 1.1.0) lib/foo/callerGenServer.ex:91: CallerGenServer.handle_info({#Reference<0.221168823.2908225537.225316>, {:error, :timeout}}, nil)
(stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
(stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
(stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
I suspect some race condition with timeouts...
Also, we might need to add a wrapper GenServer that catches all these messages such that the caller is truly isolated from the worker process.
Unfortunately I wont have much time to investigate exactly when and how it happens, as I'm soon leaving for vacations.
Otherwise I would solve it myself and contribute the solution.
I feel obliged to fix it, but I won't have the time (earliest after ca. 4 weeks).
I'm terribly sorry about that.
@Zinggi could you create a test repo that can easily reproduce this issue?
Yeah, that's the least I can do. Hope I can make it reproducible...
@oohnoitz Alright, I managed to reproduce the issue :relieved: : https://github.com/Zinggi/elixir-nodejs-implementation-leaks
Sample log output:
before call
result
{:error, "Call timed out."}
after call
17:54:37.361 [error] GenServer Leaks.GenServer terminating
** (FunctionClauseError) no function clause matching in Leaks.GenServer.handle_info/2
(elixir_nodejs_leaks 0.1.0) lib/leaks/gen_server.ex:20: Leaks.GenServer.handle_info({#Reference<0.4098332406.3456892929.143995>, {:error, :timeout}}, nil)
(stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
(stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
(stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {#Reference<0.4098332406.3456892929.143995>, {:error, :timeout}}
State: nil
before call
result
{:ok, 42}
after call
17:54:38.575 [error] [message: {#Port<0.8>, {:data, {:eol, []}}}, module: NodeJS.Worker, name: #PID<0.186.0>]
17:54:38.575 [error] [message: {#Port<0.8>, {:data, {:eol, '__elixirnodejs__UOSBsDUP6bp9IF5__[true,42]'}}}, module: NodeJS.Worker, name: #PID<0.186.0>]
before call
result
{:error, "Call timed out."}
after call
17:54:40.612 [error] GenServer Leaks.GenServer terminating
** (FunctionClauseError) no function clause matching in Leaks.GenServer.handle_info/2
(elixir_nodejs_leaks 0.1.0) lib/leaks/gen_server.ex:20: Leaks.GenServer.handle_info({#Reference<0.4098332406.3456892932.142055>, {:error, :timeout}}, nil)
(stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
(stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
(stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {#Reference<0.4098332406.3456892932.142055>, {:error, :timeout}}
State: nil
before call
result
{:ok, 42}
after call
17:54:41.817 [error] [message: {#Port<0.8>, {:data, {:eol, []}}}, module: NodeJS.Worker, name: #PID<0.186.0>]
17:54:41.817 [error] [message: {#Port<0.8>, {:data, {:eol, '__elixirnodejs__UOSBsDUP6bp9IF5__[true,42]'}}}, module: NodeJS.Worker, name: #PID<0.186.0>]
before call
result
{:error, "Call timed out."}
...
Hope it helps.
I might still be able to reply while on vacation, but I won't be able to do any coding
Had some time to do a bit more digging. It's definitely a timeout issue, with a few other things. Timeous are definitely awkward with the way GenServer works and still a bit confusing for me.
Even though we have a timeout specified for GenServer.call
, it doesn't prevent the underlying process to continue running until there is an actual response. The port
/process
still remains running until there is a response, which means that we can be returning an error early, but we'll still send the response back at a much later time. This would explain why we get another message in the inbox with the resolved value from the GenServer.call
we should have ignored.
https://hexdocs.pm/elixir/GenServer.html#call/3-timeouts has some details about the caller doing this. Though, we don't really want to have developers handle the discarding of that message in their code. Would need to mess with some other timeout values and combinations to be sure.
I have this working branch: https://github.com/revelrylabs/elixir-nodejs/commit/20f3c47a4bfe38591183e9eb7c89d7aa0ea6b8ed which does seem to solve the issue? But would need some more testing to make sure this is a good solution.
Thanks for looking into it. I can't really test it out for some time, but that commit looks like it could work.
However, maybe we also need a timeout on the JS side, since without it our next calls to the same worker would be blocked without canceling the previous call. This could be done with Promise.race
and a setTimeout
Hey there Just wanted to let you know that I'm back from vacations and that I'll look into it again once I find time. Hopefully sometimes this week :crossed_fingers: