elixir-nodejs icon indicating copy to clipboard operation
elixir-nodejs copied to clipboard

Implementation leaks still there (2.0.0)

Open Zinggi opened this issue 3 years ago • 6 comments

Oh no, I made it worse :scream:

It looks like my PR that was supposed to get rid of implementation leaks introduced some new ones. I should have tested my changes thoroughly with our production code...

It looks like my GenServer that calls NodeJS.call(..) is sometimes getting back the messages sent by Port:

{:ok, [message: {#Port<0.2131>, {:data, {:eol, []}}}, module: NodeJS.Worker, name: #PID<0.1656.0>]}

Also, I'm still getting the timeout messages in my mailbox:

** (FunctionClauseError) no function clause matching in CallerGenServer.handle_info/2
    (foo 1.1.0) lib/foo/callerGenServer.ex:91: CallerGenServer.handle_info({#Reference<0.221168823.2908225537.225316>, {:error, :timeout}}, nil)
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

I suspect some race condition with timeouts...

Also, we might need to add a wrapper GenServer that catches all these messages such that the caller is truly isolated from the worker process.


Unfortunately I wont have much time to investigate exactly when and how it happens, as I'm soon leaving for vacations.
Otherwise I would solve it myself and contribute the solution. I feel obliged to fix it, but I won't have the time (earliest after ca. 4 weeks).

I'm terribly sorry about that.

Zinggi avatar Aug 21 '20 15:08 Zinggi

@Zinggi could you create a test repo that can easily reproduce this issue?

oohnoitz avatar Aug 21 '20 15:08 oohnoitz

Yeah, that's the least I can do. Hope I can make it reproducible...

Zinggi avatar Aug 21 '20 15:08 Zinggi

@oohnoitz Alright, I managed to reproduce the issue :relieved: : https://github.com/Zinggi/elixir-nodejs-implementation-leaks

Sample log output:

before call
result
{:error, "Call timed out."}
after call

17:54:37.361 [error] GenServer Leaks.GenServer terminating
** (FunctionClauseError) no function clause matching in Leaks.GenServer.handle_info/2
    (elixir_nodejs_leaks 0.1.0) lib/leaks/gen_server.ex:20: Leaks.GenServer.handle_info({#Reference<0.4098332406.3456892929.143995>, {:error, :timeout}}, nil)
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {#Reference<0.4098332406.3456892929.143995>, {:error, :timeout}}
State: nil
before call
result
{:ok, 42}
after call

17:54:38.575 [error] [message: {#Port<0.8>, {:data, {:eol, []}}}, module: NodeJS.Worker, name: #PID<0.186.0>]

17:54:38.575 [error] [message: {#Port<0.8>, {:data, {:eol, '__elixirnodejs__UOSBsDUP6bp9IF5__[true,42]'}}}, module: NodeJS.Worker, name: #PID<0.186.0>]
before call
result
{:error, "Call timed out."}
after call

17:54:40.612 [error] GenServer Leaks.GenServer terminating
** (FunctionClauseError) no function clause matching in Leaks.GenServer.handle_info/2
    (elixir_nodejs_leaks 0.1.0) lib/leaks/gen_server.ex:20: Leaks.GenServer.handle_info({#Reference<0.4098332406.3456892932.142055>, {:error, :timeout}}, nil)
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {#Reference<0.4098332406.3456892932.142055>, {:error, :timeout}}
State: nil
before call
result
{:ok, 42}
after call

17:54:41.817 [error] [message: {#Port<0.8>, {:data, {:eol, []}}}, module: NodeJS.Worker, name: #PID<0.186.0>]

17:54:41.817 [error] [message: {#Port<0.8>, {:data, {:eol, '__elixirnodejs__UOSBsDUP6bp9IF5__[true,42]'}}}, module: NodeJS.Worker, name: #PID<0.186.0>]
before call
result
{:error, "Call timed out."}


...

Hope it helps.

I might still be able to reply while on vacation, but I won't be able to do any coding

Zinggi avatar Aug 21 '20 15:08 Zinggi

Had some time to do a bit more digging. It's definitely a timeout issue, with a few other things. Timeous are definitely awkward with the way GenServer works and still a bit confusing for me.

Even though we have a timeout specified for GenServer.call, it doesn't prevent the underlying process to continue running until there is an actual response. The port/process still remains running until there is a response, which means that we can be returning an error early, but we'll still send the response back at a much later time. This would explain why we get another message in the inbox with the resolved value from the GenServer.call we should have ignored.

https://hexdocs.pm/elixir/GenServer.html#call/3-timeouts has some details about the caller doing this. Though, we don't really want to have developers handle the discarding of that message in their code. Would need to mess with some other timeout values and combinations to be sure.

I have this working branch: https://github.com/revelrylabs/elixir-nodejs/commit/20f3c47a4bfe38591183e9eb7c89d7aa0ea6b8ed which does seem to solve the issue? But would need some more testing to make sure this is a good solution.

oohnoitz avatar Aug 25 '20 01:08 oohnoitz

Thanks for looking into it. I can't really test it out for some time, but that commit looks like it could work.

However, maybe we also need a timeout on the JS side, since without it our next calls to the same worker would be blocked without canceling the previous call. This could be done with Promise.race and a setTimeout

Zinggi avatar Aug 26 '20 15:08 Zinggi

Hey there Just wanted to let you know that I'm back from vacations and that I'll look into it again once I find time. Hopefully sometimes this week :crossed_fingers:

Zinggi avatar Sep 21 '20 11:09 Zinggi