aws-flow-ruby icon indicating copy to clipboard operation
aws-flow-ruby copied to clipboard

rescue StandardError instead of Exception & support ruby 2.0

Open brightball opened this issue 11 years ago • 9 comments

I'm trying to use the Flow framework with Foreman, which just sends a SIGINT to the running process.

I'd like to be able to do something like this:

if __FILE__ == $0
  begin
    worker.start
  rescue SystemExit, Interrupt
    worker.stop # Or some appropriate message that will actually shut things down
    raise
  end
end

Right now, whenever I press CTRL+C with Foreman it's sending a SIGINT which results in this error:

─$ foreman start
11:58:56 decider.1 | started with pid 17132
^CSIGINT received
/usr/local/foreman/lib/foreman/engine.rb:226:in `synchronize': can't be called from trap context (ThreadError)
    from /usr/local/foreman/lib/foreman/engine.rb:226:in `output_with_mutex'
    from /usr/local/foreman/lib/foreman/engine.rb:232:in `system'
    from /usr/local/foreman/lib/foreman/engine.rb:304:in `terminate_gracefully'
    from /usr/local/foreman/lib/foreman/engine.rb:41:in `block in start'
    from /usr/local/foreman/lib/foreman/engine.rb:289:in `call'
    from /usr/local/foreman/lib/foreman/engine.rb:289:in `wait2'
    from /usr/local/foreman/lib/foreman/engine.rb:289:in `watch_for_termination'
    from /usr/local/foreman/lib/foreman/engine.rb:48:in `start'
    from /usr/local/foreman/lib/foreman/cli.rb:40:in `start'
    from /usr/local/foreman/vendor/gems/thor-0.16.0/lib/thor/task.rb:27:in `run'
    from /usr/local/foreman/vendor/gems/thor-0.16.0/lib/thor/invocation.rb:120:in `invoke_task'
    from /usr/local/foreman/vendor/gems/thor-0.16.0/lib/thor.rb:275:in `dispatch'
    from /usr/local/foreman/vendor/gems/thor-0.16.0/lib/thor/base.rb:425:in `start'
    from /usr/bin/foreman:15:in `<main>'

brightball avatar Jan 16 '14 17:01 brightball

At first glance, looks like an issue with using trap context in a way that is deadlockable(see this). In theory flow workers are supposed to have a graceful shutdown mechanic, but I can see how it would be problematic in 2.0 if they have deadlockable detection on traps. Can you confirm that you are getting this error while running ruby >= 2.0 ?

mjsteger avatar Jan 17 '14 01:01 mjsteger

It is running 2.1.0 actually.

brightball avatar Jan 17 '14 03:01 brightball

Still working on this, sorry for the slow progress.

mjsteger avatar Jan 31 '14 21:01 mjsteger

FWIW, the flow framework does rescue Exception everywhere which blocks signals regardless of the deadlock issue you describe. You should consider replacing all of these with less aggressive rescues.

https://github.com/aws/aws-flow-ruby/blob/master/aws-flow/lib/aws/decider/task_poller.rb#L259 is a big culprit.

marcbowes avatar Feb 14 '14 06:02 marcbowes

This is something we flagged pretty early on, but we unfortunately can't change the behavior without doing a major version bump. We definitely are planning to replace all the rescue Exception with something less aggressive like rescue StandardError. This may take a bit of time, as we will want to package together reverse-incompatible changes. I'll get together a list of the changes we are planning on making that may be reverse-incompatible and add them to the repository.

mjsteger avatar Feb 17 '14 22:02 mjsteger

IMO, rescue Exception is a bug and not a backwards-incompatible change. There should not be code depending on this behavior.

justincampbell avatar Feb 25 '14 16:02 justincampbell

Any ETA on this being resolved?

kinsersh avatar Oct 16 '14 02:10 kinsersh

Changing the title to better reflect the current issue.

  1. The original issue was that ruby 2.0 doesn't allow any locking (rightly so) in trap contexts. While we don't explicitly acquire any locks, we do initiate shutdown of workers when we receive signals. This forces Flow to log some stuff. Logger internally tries to lock on a mutex and it fails.

  2. Workers do indeed shutdown when they receive signals just not in the most graceful way. i.e. they don't handle the long polling behavior of SWF well. Refer to #31 to get more context.

  3. We need to rescue StandardError instead of Exception.

@kinsersh can you tell me what issue you were referring to?

pmohan6 avatar Oct 30 '14 19:10 pmohan6

I was referring to number 3 in your list.

kinsersh avatar Oct 31 '14 16:10 kinsersh