litmus-go icon indicating copy to clipboard operation
litmus-go copied to clipboard

[ec2-terminate-by-tag] Handle interval 0/1 case

Open yogeek opened this issue 3 years ago • 1 comments

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

What happened:

In the case of ec2-terminate-by-tag with MANAGED_SUBGROUP=enable (when EC2 instances are managed by an ASG), there is an issue when trying to execute only once the chaos. In general, setting CHAOS_INTERVAL=TOTAL_CHAOS_DURATION is the way to get a single time execution. But if we set CHAOS_INTERVAL<TOTAL_CHAOS_DURATION the chaos is failing because of the following behavior :

it seems that the code does a loop during all the CHAOS_DURATION https://github.com/litmuschaos/litmus-go/blob/b6d04fbd2b15b7dc8e54f892df2ac13ed3e5ad81/chaoslib/litmus/ec2-terminate-by-tag/lib/ec2-terminate-by-tag.go#L82 and inside it, it loops over the instanceIDList so it can try to stop the same instance multiple times during the Chaos duration. In the case of (MANAGED_SUBGROUP=disable), the instance is "stopped" instead of terminated so it will stop/start/stop... the same instance without any issue. But in the case of MANAGED_SUBGROUP=enable, the instance is "terminated" and it causes an issue as the instance has not been removed from the instanceIDList , it cannot be stopped as it is not existing anymore in the next iteration...

The only way to have a success is to set CHAOS_INTERVAL=TOTAL_CHAOS_DURATION but then we have to wait CHAOS_INTERVAL_TOTAL for nothing at the end of the chaos first (and only) iteration.

=> the case when the interval is 0/1 should be handled

The details are explained in this Slack discussion : https://kubernetes.slack.com/archives/CNXNB0ZTN/p1643826054494339?thread_ts=1643739932.025119&cid=CNXNB0ZTN

What you expected to happen:

In the case of MANAGED_SUBGROUP=enable, the instance has to be removed from the instanceIDList to avoid trying to stop it again in the next iterations.

How to reproduce it (as minimally and precisely as possible):

  • tag an instance with chaos=allowed
  • Launch the experiment with : MANAGED_SUBGROUP=enable, TOTAL_CHAOS_DURATION=500s (a sufficient time to allow the ASG to terminate the stopped instance), CHAOS_INTERVAL=0 (or any value < TOTAL_CHAOS_DURATION), and `INSTANCE_TAG= 'chaos:allowed'
  • the instance is stopped, after several minutes, the instance is terminated by the ASG
  • the code is waiting CHAOS_INTERVAL => 0 seconds
  • the instance is still in the list of instances to stop => the experiment is failing with err: ec2 instance failed to stop, err: IncorrectInstanceState: This instance 'i-0fd0da669ea93c044' is not in a state from which it can be stopped.

The only way to not fail is to set CHAOS_INTERVAL=TOTAL_CHAOS_DURATION but then :

  • the instance is stopped, after several minutes, the instance is terminated by the ASG
  • the code is waiting CHAOS_INTERVAL ! (so 400s for nothing)
  • the experiment is successful (but with a useless waiting period of CHAOS_INTERVAL)

Anything else we need to know?:

@ksatchit and @uditgaurav already agreed on that missing behavior, thanks to them for the support to understand this issue 👍

yogeek avatar Feb 07 '22 15:02 yogeek

Additionnal info : the above behavior also causes an issue if I tag more than one instance with the chaos: allowed Indeed, if I tag 2 instances, 2 targeted instances are detected, the first is stopped, terminated, and then as the instanceId is still in the list, the code tries to stop the 1st instance again, which is terminated, and it loops over this error... the workflow is never ending, I have to delete it manually and of course the 2nd instance is never stopped.

yogeek avatar Feb 08 '22 09:02 yogeek