Concuerror icon indicating copy to clipboard operation
Concuerror copied to clipboard

Introduce an option similar to --after-timeout for timers

Open dszoboszlay opened this issue 5 years ago • 6 comments

Is your feature request related to a problem? Please describe. Concuerror already supports treating timeouts in receive/after statements as infinity above some threshold with the --after-timeout option. However, this option does not cover the case of timers started with erlang:start_timer. My immediate problem is with gen_statem, that unlike gen_server uses timers which I cannot prevent from firing in my Concuerror tests.

Describe the solution you'd like I'd like either the --after-timeout option to apply to timers started with erlang:start_timer (that is, timers above the specified threshold shall never fire while running my tests), or I'd like a new option, such as --timer-timeout to be introduced specifically for timers. Whichever is more convenient to implement.

Describe alternatives you've considered My only alternative at the moment is to change the code I'm testing to not use finite timeouts with gen_statem when used from a Concuerror test case. However, this would make the code I test and the code I use in production differ for no good reason.

Additional context N/A

dszoboszlay avatar Dec 05 '18 10:12 dszoboszlay

The --after-timeout applying to timers as well makes perfect sense. Thanks for pointing this out! Will be fixed (soon)!

aronisstav avatar Dec 06 '18 15:12 aronisstav

This could be sufficiently solved by adding a suitable call to

concuerror_inspect:inspect(receive, ..., ignored)

above here https://github.com/parapluu/Concuerror/blob/master/src/concuerror_callback.erl#L899.

aronisstav avatar Dec 06 '18 15:12 aronisstav

I tried to implement this feature according to your advise in b76b9ab, but it doesn't work.

I have a test case that tries all three constructs that should be governed by --after-timeout: receive, send_after and start_timer. But only receive works as expected. For the other two cases Concuerror finds a race and then crashes like this:

################################################################################
Interleaving #1
--------------------------------------------------------------------------------
Errors found:
* Blocked at a 'receive' ("deadlocked"; other processes have exited):
    <P> in after_timeout.erl line 34
     Mailbox contents: []
--------------------------------------------------------------------------------
Event trace:
   1: <P>: #Ref<0.0.1.291> = erlang:start_timer(1000, <P>, foo)
    in after_timeout.erl line 32
   2: <P>: 1 = erlang:cancel_timer(#Ref<0.0.1.291>)
    in after_timeout.erl line 33
   3: <Timer #Ref<0.0.1.291>>: is removed
--------------------------------------------------------------------------------
New races found:
*    2: <P>: 1 = erlang:cancel_timer(#Ref<0.0.1.291>)
     3: <Timer #Ref<0.0.1.291>>: is removed

################################################################################
Interleaving #2
--------------------------------------------------------------------------------
Errors found:
* Concuerror crashed
--------------------------------------------------------------------------------
Event trace:
   1: <P>: #Ref<0.0.1.291> = erlang:start_timer(1000, <P>, foo)
    in after_timeout.erl line 32
################################################################################
Exploration completed!
################################################################################
Errors:
--------------------------------------------------------------------------------
* On step 2, replaying a built-in returned a different result than expected:
  original:
    <0.84.0>: is removed
  new:
    blocked
 Please notify the developers, as this is a bug of Concuerror.

(I tweaked the test case a bit, so that <P> blocks and therefore I get an event trace for the first interleaving too.)

I would appreciate some help on which part of the codebase to poke around with.

dszoboszlay avatar Jan 10 '19 10:01 dszoboszlay

Ah, very sorry for the bad lead.

Another acceptable way to do this is to add an extra case for timer-related primitives, that checks the after_timeout setting and just returns a reference (via the relevant callback), but also emitting a warning similar to the one for 'ignoring after timeouts' for receives.

This case should be placed before this: https://github.com/parapluu/Concuerror/blob/master/src/concuerror_callback.erl#L855

aronisstav avatar Jan 10 '19 14:01 aronisstav

But that would mean cancel_timer wouldn't find a running timer with that reference and return false instead of an integer here: https://github.com/parapluu/Concuerror/blob/master/src/concuerror_callback.erl#L822-L823

I think my approach should work too, but I miss some rules about what events race in a timer-specific scenario. The test case with the receive after construct does exactly the same thing internally: it spawns a new process that gets blocked in a receive statement. Only that this time the new process is a special timer process, and therefore Concuerror doesn't recognise the causality between stop_timer and the process terminating.

dszoboszlay avatar Jan 10 '19 14:01 dszoboszlay

Eek. Didn't think about that case.

Indeed the assumption has been that one can always reverse cancel_timer bifs and fake timer processes dying. With a quick look I think the relevant line is:

https://github.com/parapluu/Concuerror/blob/master/src/concuerror_dependencies.erl#L332

and the "time order" is irrelevant due to:

https://github.com/parapluu/Concuerror/blob/master/src/concuerror_dependencies.erl#L75

Unfortunately I can't think of an easy solution.

Faking the rest of the timer API so that e.g. 'cancel_timer' would be an exit signal, and 'read_timer' a 'is_process_alive' requires an extra abstraction layer above built-in ops. This would get you correct dependencies "for free".

Another, 'ugly traditional' way to fix such things in the past has been to add more info to events in the extra field and using it for dependency detection. So you could add the after_timeout to some/all exit events and then fix the offending lines above to only have a dependency with the ones that can really be reversed. But the code in _dependencies has reached the point where this is really painful.

aronisstav avatar Jan 10 '19 14:01 aronisstav