otp `erpc:call` behavior is different on local and remote nodes

`erpc:call` behavior is different on local and remote nodes

Open juhlig opened this issue 7 months ago • 10 comments

Describe the bug erpc:call usually runs the given function in a spawned process. However, if the given Node is the local node and if the given Timeout is infinity (implicit in erpc:call/2 and /4), an optimization is used that instead uses erlang:apply:

https://github.com/erlang/otp/blob/d05de4c400200fabf3b22edd8cd7f75a02d44602/lib/kernel/src/erpc.erl#L255-L260

https://github.com/erlang/otp/blob/d05de4c400200fabf3b22edd8cd7f75a02d44602/lib/kernel/src/erpc.erl#L1267-L1270

Using apply means that the given function is executed in the context of the process calling erpc:call, which can have a row of unintended consequences:

The function could accidentially corrupt the state or interfere with the workings of the calling process, eg by modifying the process dictionary or private/protected ets tables, closing files or sockets, stealing messages, changing the trap_exit flag, linking/monitoring (or unlinking/demonitoring) other processes, etc. All of this can not happen if the function executes in a separate process.
The function could cause memory leaks by opening but not closing files or sockets, creating ets tables, etc. If the function executes in a separate process, it can create quite a mess which would all be neatly cleaned up when it finishes. When executed in the calling process, the mess remains.
When the calling process exits (eg via a link, explicit exit/kill, etc) while executing the function, the function is basically interrupted at whatever it is doing. When executed in a separate process, it runs through to the end, no matter what happens to the calling process.

To Reproduce Exemplified by creating an private ets table. The distinction between executing the function in the calling process (1> and 2>) vs a separate one (3> and 4>) is forced by the timeout.

1> erpc:call(node(), fun() -> ets:new(foo, [named_table, private]) end, infinity).
foo
2> ets:tab2list(foo).
[]
3> erpc:call(node(), fun() -> ets:new(bar, [named_table, private]) end, 1000).
bar
4> ets:tab2list(bar).
** exception error: bad argument
     in function  ets:match_object/2
        called as ets:match_object(bar,'_')
        *** argument 1: the table identifier does not refer to an existing ETS table
     in call from ets:tab2list/1 (ets.erl, line 2266)

As can be seen at 2>, the ets table created in the call at 1> still exists, and is also accessible by the calling process. Conversely at 4>, the ets table created in the call at 3> is gone.

Exemplifying the interruption of the function when the calling process crashes. A function sets a timer to kill itself after 1s, then uses erpc:call to execute a function which waits 2s before sending done back to the shell.

1> Self = self().                                                                                        
<0.81.0>
2> spawn(fun() -> timer:kill_after(1000, self()), erpc:call(node(), fun() -> timer:sleep(2000), Self ! done end, infinity) end).
<0.108.0>
3> receive done -> ok after 5000 -> error end.                                                                                  
error
4> spawn(fun() -> timer:kill_after(1000, self()), erpc:call(node(), fun() -> timer:sleep(2000), Self ! done end, 5000) end).
<0.104.0>
5> receive done -> done after 5000 -> error end.
done

As can be seen at 3>, the done message never arrives when the calling process exits in the wait period of the function that was started at 2>. Conversely, at 5> we get the done message even though the calling process exited.

Expected behavior erpc call should behave the same, no matter the node or timeout. Specifically, it should always execute the given function in a separate process.

Affected versions OTP 27, but probably going back all the way to OTP 23 when erpc was introduced.

Additional context The optimization seems pointless to me. When the function is executed at the local node, it already is fast, no need to try to shave off a few microseconds. Also, it only takes place when the given Timeout is infinity, which indicates that the caller is prepared to wait for however long it takes, not in a hurry.

Jul 03 '24 15:07 juhlig

otp otp copied to clipboard

`erpc:call` behavior is different on local and remote nodes

otp
otp copied to clipboard