aws-flow-ruby Future drilling/wrapping with exponential retry

Hi,

We're having a bit of an issue trying to get future response working with exponential_retry set, this seems to be the same issue as described at https://github.com/aws/aws-flow-ruby/issues/11 and https://github.com/aws/aws-flow-ruby/issues/48.

We're just trying a very simple asynchronous call that just returns "done" when called with no input, however future.get is not getting the result "done" and errors out as get does not appear to be a method. We've set the activity to retry up to 3 times using exponential_retry.

This is the current response from this basic workflow/activity using exponential retry in asynchronous call

[#<AWS::Flow::Utilities::AddressableFuture:0x000000016cc018 @_metadata=nil, 
@return_value=#<AWS::Flow::Core::Future:0x000000016d7f80 @conditional=#
<AWS::Flow::Core::FiberConditionVariable:0x000000016d7f58 @waiters=[]>,  @set=true, @result=#
<AWS::Flow::Utilities::AddressableFuture:0x000000016dc378 @_metadata=nil, @return_value=#
<AWS::Flow::Core::Future:0x000000016dc350 @conditional=#
<AWS::Flow::Core::FiberConditionVariable:0x000000016dc328 @waiters=[]>, @set=true, @result=#
<AWS::Flow::Utilities::AddressableFuture:0x000000016e8d80 @_metadata=#
<AWS::Flow::ActivityMetadata:0x000000016e8b78 @activity_id="Activity1">, @return_value=#
<AWS::Flow::Core::Future:0x000000016e8d58 @conditional=#
<AWS::Flow::Core::FiberConditionVariable:0x000000016e8ce0 @waiters=[]>, @set=true, 
@result="done">>>>, @listeners=[#<Proc:0x000000016dd930@/home/ec2-user/.rvm/gems/ruby-  
2.2.2/gems/aws-flow-3.1.0/lib/aws/flow/implementation.rb:221>]>>]

It appears the response is wrapped in Addressable Future a few times. If we do not use exponential_retry then this issue does not occur and we can use future.get successfully.

I have tried the drill_on_future methods as described in the other tickets, however none of them seem to give me the "result".

Any advise is greatly appreciated. Thanks

Aug 27 '15 21:08 nabeelamjad

I'm seeing this as well. And I too am using exponential retries. However I haven't tested if it works without that.

Here is some example code:

        futures = []
        futures << activities_client.send_async(:activity_one)
        futures << activities_client.send_async(:activity_two)
        futures << activities_client.send_async(:activity_three)
        wait_for_all(futures)

I want to do something like this: futures.map(&:get)

But the result is buried in layered futures: futures.first.get.get.get

Please advice.

Sep 10 '15 19:09 jcavalieri

Just verified that removing exponential_retry fixes this issue, but I really want to have the exponential_retry option enabled.

Please advice. Thanks.

Sep 10 '15 20:09 jcavalieri

@jcavalieri I'm not able to replicate the problem you were seeing with layered/buried futures. Would it be possible to post the workflow history and workflow/activity definitions that caused you to get a bunch of layered futures? I tried replicating here. You can run just the recent test with rspec -t focus, which (pretty naively) tries to replicate the example you gave above.

Oct 05 '15 23:10 mjsteger

Hi @mjsteger , what if it has failed attempts. That might be what is nesting.

I'm pretty sure I had failed activities that were retried.

Oct 07 '15 13:10 jcavalieri

So, I'm able to get to a point where the futures are under some number of layers(by having an activity which fails at first but eventually succeeds, as you suggested), but I'm still able to use AWS::Flow::Utilities::drill_on_future to get the value. Mapping it onto a list required the slightly obtuse futures.map(&AWS::Flow::Utilities.method(:drill_on_future)), so there's probably room for a convenience method that hides some of that ugliness away. It's especially bad because drill_on_future will gladly accept a list as argument, and will return an addressable future, so if you weren't aware of the (not well specified) contract that it takes only a single future, you could get majorly tripped up. @nabeelamjad I see you were trying to operate on a list: does your problem persist if you try the map method I laid out above/drill_on_future on just a single of the futures you get back?

Oct 07 '15 17:10 mjsteger

Sorry for the late response @mjsteger, I will test it today and let you know. We decided not to use SWF as it seems to leak memory (even with the default tutorial Booking app). Our custom SWF app starts at 1.8 GB memory and within 12 hours we're at 4.2 GB memory. I've also noticed that SWF has crashed our instance a couple of times when approaching more and more memory. Using SQS for the same approach takes a total of 70 MB and has no issues at all whatsoever.

I will in any case test the future drilling withthe map method and see if it returns the values.

Thanks again.

Oct 08 '15 11:10 nabeelamjad

@nabeelamjad sorry to hear about your experience :(. I'm not able to personally replicate the memory leak: what I've tried so far is to run the booking example's "starter.rb" a couple thousand times while running the workers in a seperate session. What I've found is that the memory usage bounces up and down(as might be expected with GC), but doesn't grow unboundedly. What were the steps that lead you to the memory leak?(It's definitely possible that my test is too simplistic, and should be noted that I'm running ruby 2.2.0 on a MBP).

It definitely hoovers up a bunch of memory to start(it starts ~5 of each worker, and 10 forks per each activity worker, which is likely excessive).

Oct 09 '15 22:10 mjsteger

@mjsteger the map method seems to have worked at least, thanks!

We were using ruby 2.2.2, it ramps up and then peaks at a certain memory percentage. We noticed that the more instances we had polling SWF to the same task_list the more the memory would accumulate (example with 2 instances it would slowly ramp up and then peak, with 4 instances it would ramp up almost double as fast and then peak even higher).

Example of one of the three instances we launched yesterday to test this out (running t2.large on amazon linux, SWF running in a docker container).

http://i.imgur.com/QJUyU48.png (both red and yellow are SWF apps and they simply carry out some API calls with our internal API, some API calls with AWS-SDK and store the data in dynamo_db)

Oct 09 '15 23:10 nabeelamjad

@nabeelamjad glad to hear it! The docs should definitely be changed and/or drill_on_future should be changed to accept a list of arguments(though, I'd argue that in future versions, the default for "get" should be to just drill down all the way). As to the leak issue, that sounds like the ruby processes grow to a certain ceiling, then gc within it. Crashing your machines is obviously a serious problem, and the initial memory footprint can/should probably be optimized. When you were getting crashes, did you tune the number of workers, or use a default set somewhere?

Oct 09 '15 23:10 mjsteger

@mjsteger We noticed the crashes after 3 days of SWF running, then for a band-aid fix we decided to just re-start the SWF apps on a cron every day on each instance at different timings during low/empty workload.

We did tune the workers to 2 workflow workers, 2 activity workers and 2 forks for each activity workers. We've then just made a transition to SQS which has been running far better, easier to debug and test, almost no memory footprint (for the memory we spent on SWF we can launch literally 100+ SQS workers). We've got de-duplication in place as well (further with detection of progress using counters in dynamo_db to denote which stage should SQS crash so it resumes where it left off). I just feel personally that SWF isn't really a finished product, it does have its perks and NASA have used it so no doubt we're doing something wrong! (or they're not using the Ruby version but Java and are generally a lot smarter than we are :P).

Oct 09 '15 23:10 nabeelamjad

That's definitely pretty worrisome. It sounds like you weren't running that many workers and still getting an accretion of memory. There definitely don't appear to be a lot of longer smoke tests, so it's possible that something like what you have described could have slipped through. I'll see if I can replicate by having some machines do empty polling against SWF (and maybe simulate loading polls if that doesn't work, since doing stress testing work against SWF is probably pricier than I'd prefer). Sorry again about the lackluster experience, and glad to hear you were able to get a working solution nevertheless :)

Oct 13 '15 22:10 mjsteger

I had a response yesterday from AWS that they've successfully replicated the increasing memory issue using ruby 2.2 on their activity workers with simple tasks and they've reported it to the SWF team.

Oct 13 '15 22:10 nabeelamjad

Well, glad to hear that they are looking into it. Thanks for keeping me abreast, that's definitely a problem I was not looking forward to debugging. Out of curiosity, were there other problems that made you felt that SWF was unfinished?(The one that most comes to mind for me is that the API design is a little rough).

Oct 15 '15 22:10 mjsteger

Everything else was fine to be honest, no other real issues we stumbled upon other than the huge memory hog (debugging SWF was never fun either when you're presented with something like >32k characters).

Oct 15 '15 22:10 nabeelamjad

I'm also seeing what look like memory leaks under Ruby 2.2.

Nov 04 '15 11:11 mjbellantoni

Any progress on this? We too see mem leaks.

Jan 20 '16 17:01 mustafashabib

My 2cents -- use Sidekiq, we bought enterprise and it has been rock solid and easy to use.

John Cavalieri [email protected]

On Wed, Jan 20, 2016 at 11:22 AM, Mustafa Shabib [email protected] wrote:

Any progress on this? We too see mem leaks.

— Reply to this email directly or view it on GitHub https://github.com/aws/aws-flow-ruby/issues/102#issuecomment-173285270.

Jan 20 '16 18:01 jcavalieri