otp
otp copied to clipboard
Loading a module that contains on_load may hang
Describe the bug
If module foo
contains an on_load
annotation, then a call to code:load_binary/3
(and probably code:load_abs/1,2
) may hang if there are concurrent calls to these functions, or even code:ensure_loaded/1
) on the same module.
To Reproduce Given these two modules:
%% file: foo.erl
-module(foo).
-on_load(init_module/0).
-export([bar/0]).
bar() -> ok.
init_module() ->
io:format("Initializing ~p....~n", [?MODULE]),
% wait some time, to make the race condition likely to happen
timer:sleep(2_000),
io:format(" done initializing ~p!~n", [?MODULE]).
%% file: repro.erl
-module(repro).
-export([go/0, do_load_binary/1]).
do_load_binary(Parent) ->
{ok, Bin} = file:read_file("foo.beam"),
code:load_binary(foo, "foo.beam", Bin),
Parent ! {done, self()}.
go() ->
Self = self(),
X = spawn(fun() -> do_load_binary(Self) end),
Y = spawn(fun() -> do_load_binary(Self) end),
A = receive
{done, X} -> ok
after 10_000 -> not_ok
end,
B = receive
{done, Y} -> ok
after 10_000 -> not_ok
end,
{A, B}.
Run :
$ erlc foo.erl repro.erl && erl -noshell -eval 'Res = repro:go(), io:format("~p~n", [Res]), init:stop()'
Expected behavior
Both concurrent calls to code:load_binary/3
in repro:go/0
should eventually return, so we expect go()
to return {ok, ok}
. Instead, we see:
Initializing foo....
done initializing foo!
Initializing foo....
done initializing foo!
{ok,not_ok}
Without the after
clause, the second receive will wait forever.
Affected versions Both R25 and R26 show this problem, I didn't try it on older versions.
Additional context
As far as I can see, the problem is in code_server:try_finish_module
, since the action that is passed to handle_pending_on_load
will always return a {noreply, ...}
when the module to load contains an on_load
; however, if the action ends up waiting since there is an on_load in progress, this means that no reply will ever be sent to the caller.
We found the race condition investigating tests that would sometimes block. In our case, the race was between code:ensure_loaded/1
and code:load_binary/3
and the issue happens, again, when the action for load_binary
waits on the on_load from ensure_loaded
to finish, but this combination is harder to reproduce consistently.
Thanks for your report! Just as a heads-up, we're a bit short-handed over the summer vacation period so we likely won't get around to fixing this until most of us are back from vacation.
I don't think on_load/1
is intended to permit arbitrary Erlang code. Some years ago I mentioned using on_load/1
to implement linking support for a statically strongly typed language compiled to the BEAM, and @bjorng strongly advised against that, on the grounds that on_load/1
was only intended for loading NIFs. Maybe that has changed, but I wouldn't bet on it. (But if there are limitations they should be documented.)
on_load/1
complicates code loading in general (it still needs to appear atomic) so it's a fairly bad idea to do anything complicated in it, but it nevertheless shouldn't fail silently in this manner. Even if all you do is call load_nif
it's still possible for this race to occur so we'll need to fix it before long.
Here is a patch that fixes this issue:
diff --git a/lib/kernel/src/code_server.erl b/lib/kernel/src/code_server.erl
index 7689acf124..4a5dbfc3c2 100644
--- a/lib/kernel/src/code_server.erl
+++ b/lib/kernel/src/code_server.erl
@@ -1387,16 +1387,19 @@ handle_pending_on_load_1(Mod, From, [H|T]) ->
handle_pending_on_load_1(_, _, []) -> [].
finish_on_load(PidRef, OnLoadRes, #state{on_load=OnLoad0}=St0) ->
- case lists:keyfind(PidRef, 1, OnLoad0) of
- false ->
- %% Since this process in general silently ignores messages
- %% it doesn't understand, it should also ignore a 'DOWN'
- %% message with an unknown reference.
- St0;
- {PidRef,Mod,Waiting} ->
- St = finish_on_load_1(Mod, OnLoadRes, Waiting, St0),
- OnLoad = [E || {R,_,_}=E <- OnLoad0, R =/= PidRef],
- St#state{on_load=OnLoad}
+ case lists:keytake(PidRef, 1, OnLoad0) of
+ false ->
+ %% Since this process in general silently ignores messages
+ %% it doesn't understand, it should also ignore a 'DOWN'
+ %% message with an unknown reference.
+ St0;
+ {value, {PidRef,Mod,Waiting}, OnLoad} ->
+ %% The first callback we register (the last on the list),
+ %% must be the first to run, as it is the one that effectively
+ %% loads the module. We also must reset the OnLoad state before
+ %% running the actions, otherwise we may add new entries to
+ %% Waiting (which won't ever be picked up).
+ finish_on_load_1(Mod, OnLoadRes, lists:reverse(Waiting), St0#state{on_load=OnLoad})
end.
finish_on_load_1(Mod, OnLoadRes, Waiting, St) ->
The root cause was that we were not clearing up the OnLoad state for that module before triggering the actions, which would cause new entries to be added to Waiting (and those would never be picked up).
I was not able to write a test though due to limited CT knowledge, so I only have the fix. :) Anyone is welcome to use the patch above.
I will try to submit a patch for this one in the upcoming week. EDIT: I will do it once OTP 27 is out.