otp icon indicating copy to clipboard operation
otp copied to clipboard

Loading a module that contains on_load may hang

Open jcpetruzza opened this issue 1 year ago • 5 comments

Describe the bug If module foo contains an on_load annotation, then a call to code:load_binary/3 (and probably code:load_abs/1,2) may hang if there are concurrent calls to these functions, or even code:ensure_loaded/1) on the same module.

To Reproduce Given these two modules:

%% file: foo.erl
-module(foo).

-on_load(init_module/0).

-export([bar/0]).

bar() -> ok.

init_module() ->
    io:format("Initializing ~p....~n", [?MODULE]),
    % wait some time, to make the race condition likely to happen
    timer:sleep(2_000),
    io:format(" done initializing ~p!~n", [?MODULE]).
%% file: repro.erl
-module(repro).

-export([go/0, do_load_binary/1]).

do_load_binary(Parent) ->
  {ok, Bin} = file:read_file("foo.beam"),
  code:load_binary(foo, "foo.beam", Bin),
  Parent ! {done, self()}.

go() ->
  Self = self(),
  X = spawn(fun() -> do_load_binary(Self) end),
  Y = spawn(fun() -> do_load_binary(Self) end),
  A = receive
    {done, X} -> ok
    after 10_000 -> not_ok
  end,
  B = receive
    {done, Y} -> ok
    after 10_000 -> not_ok
  end,
  {A, B}.

Run :

$ erlc foo.erl repro.erl && erl -noshell -eval 'Res = repro:go(), io:format("~p~n", [Res]), init:stop()'

Expected behavior Both concurrent calls to code:load_binary/3 in repro:go/0 should eventually return, so we expect go() to return {ok, ok}. Instead, we see:

Initializing foo....
 done initializing foo!
Initializing foo....
 done initializing foo!
{ok,not_ok}

Without the after clause, the second receive will wait forever.

Affected versions Both R25 and R26 show this problem, I didn't try it on older versions.

Additional context As far as I can see, the problem is in code_server:try_finish_module, since the action that is passed to handle_pending_on_load will always return a {noreply, ...} when the module to load contains an on_load; however, if the action ends up waiting since there is an on_load in progress, this means that no reply will ever be sent to the caller.

We found the race condition investigating tests that would sometimes block. In our case, the race was between code:ensure_loaded/1 and code:load_binary/3 and the issue happens, again, when the action for load_binary waits on the on_load from ensure_loaded to finish, but this combination is harder to reproduce consistently.

jcpetruzza avatar Jul 03 '23 20:07 jcpetruzza

Thanks for your report! Just as a heads-up, we're a bit short-handed over the summer vacation period so we likely won't get around to fixing this until most of us are back from vacation.

jhogberg avatar Jul 04 '23 09:07 jhogberg

I don't think on_load/1 is intended to permit arbitrary Erlang code. Some years ago I mentioned using on_load/1 to implement linking support for a statically strongly typed language compiled to the BEAM, and @bjorng strongly advised against that, on the grounds that on_load/1 was only intended for loading NIFs. Maybe that has changed, but I wouldn't bet on it. (But if there are limitations they should be documented.)

mikpe avatar Jul 05 '23 17:07 mikpe

on_load/1 complicates code loading in general (it still needs to appear atomic) so it's a fairly bad idea to do anything complicated in it, but it nevertheless shouldn't fail silently in this manner. Even if all you do is call load_nif it's still possible for this race to occur so we'll need to fix it before long.

jhogberg avatar Jul 06 '23 08:07 jhogberg

Here is a patch that fixes this issue:

diff --git a/lib/kernel/src/code_server.erl b/lib/kernel/src/code_server.erl
index 7689acf124..4a5dbfc3c2 100644
--- a/lib/kernel/src/code_server.erl
+++ b/lib/kernel/src/code_server.erl
@@ -1387,16 +1387,19 @@ handle_pending_on_load_1(Mod, From, [H|T]) ->
 handle_pending_on_load_1(_, _, []) -> [].
 
 finish_on_load(PidRef, OnLoadRes, #state{on_load=OnLoad0}=St0) ->
-    case lists:keyfind(PidRef, 1, OnLoad0) of
-	false ->
-	    %% Since this process in general silently ignores messages
-	    %% it doesn't understand, it should also ignore a 'DOWN'
-	    %% message with an unknown reference.
-	    St0;
-	{PidRef,Mod,Waiting} ->
-	    St = finish_on_load_1(Mod, OnLoadRes, Waiting, St0),
-	    OnLoad = [E || {R,_,_}=E <- OnLoad0, R =/= PidRef],
-	    St#state{on_load=OnLoad}
+    case lists:keytake(PidRef, 1, OnLoad0) of
+        false ->
+            %% Since this process in general silently ignores messages
+            %% it doesn't understand, it should also ignore a 'DOWN'
+            %% message with an unknown reference.
+            St0;
+        {value, {PidRef,Mod,Waiting}, OnLoad} ->
+            %% The first callback we register (the last on the list),
+            %% must be the first to run, as it is the one that effectively
+            %% loads the module. We also must reset the OnLoad state before
+            %% running the actions, otherwise we may add new entries to
+            %% Waiting (which won't ever be picked up).
+            finish_on_load_1(Mod, OnLoadRes, lists:reverse(Waiting), St0#state{on_load=OnLoad})
     end.
 
 finish_on_load_1(Mod, OnLoadRes, Waiting, St) ->

The root cause was that we were not clearing up the OnLoad state for that module before triggering the actions, which would cause new entries to be added to Waiting (and those would never be picked up).

I was not able to write a test though due to limited CT knowledge, so I only have the fix. :) Anyone is welcome to use the patch above.

josevalim avatar Feb 13 '24 10:02 josevalim

I will try to submit a patch for this one in the upcoming week. EDIT: I will do it once OTP 27 is out.

josevalim avatar Feb 18 '24 17:02 josevalim