AutoSpotting icon indicating copy to clipboard operation
AutoSpotting copied to clipboard

Inability to DescribeRegions does not cause panic

Open gabegorelick opened this issue 5 years ago • 5 comments

Issue type

Bug Report

Build number

Custom build of 9b438dc with no diff.

Environment

  • AWS region: us-east-1
  • Type of environment: VPC

Summary

The AutoSpotting Lambda logs errors when it fails to describe regions, but it does not panic. This means that unless you have something monitoring your logs for errors, AutoSpotting may silently fail (since it exits normally, AWS will not consider the Lambda invocation to be an error).

In other places AutoSpotting panics, which has the desirable effect of informing AWS that the Lambda invocation failed. However, AutoSpotting appears to be inconsistent with this behavior since elsewhere it does not panic.

Relevant code: https://github.com/AutoSpotting/AutoSpotting/blob/9b438dc47cbfd9587b042d5b2aaac2765c28a4e0/core/main.go#L39-L44

Steps to reproduce

Remove ec2:DescribeRegions permission.

Expected results

Error count goes up. Alarms built off error count go off.

Actual results

No errors. Alarms do not fire despite the error being in the logs.

image

image

gabegorelick avatar Jan 23 '20 17:01 gabegorelick

Thanks for reporting this issue, please create a PR with a fix.

cristim avatar Jan 23 '20 17:01 cristim

@cristim The DescribeRegions case is easy enough to fix, since there's no reason not to exit with an error code in that case. But what about errors that aren't necessarily fatal? For example, failing to describe a cloudformation stack. Presumably, such an error is not worthy of exiting immediately, instead we should continue on.

Is there a recommended way to monitor AutoSpotting for these kinds of errors? For example, are the logs consistent enough that one can monitor for the string "Failed"?

gabegorelick avatar Jan 24 '20 01:01 gabegorelick

To be honest I don't think it's critical to handle/monitor these cases, so far it was enough to assume the IAM policy is correct, people are not supposed to change it.

But if you strongly believe it should be handled differently feel free to send a PR and I'll gladly accept it.

But I would like to learn more about your use case for these requirements.

cristim avatar Jan 24 '20 08:01 cristim

But I would like to learn more about your use case for these requirements.

I'm setting up AutoSpotting in a multi-tenant environment where I don't want misconfiguration errors, e.g. the wrong TAG_FILTERS, to mess with other ASGs. So I'm limiting the IAM permissions to specific resources. This means I have to use a custom CloudFormation stack, which means I may get things wrong 😉.

so far it was enough to assume the IAM policy is correct

Authorization errors are just one example. AutoSpotting could also run into rate limiting errors, AWS service limits, etc. I think those are worth monitoring, no?

gabegorelick avatar Jan 24 '20 14:01 gabegorelick

I see, good luck with that project sounds like fun!

In a previous setup we used to monitor all these using a custom Splunk search executed over the CloudTrail logs. But AutoSpotting was always executed with the upstream permissions.

Once you're done I'd love if you could share how you did this, it might be useful for other folks as well.

cristim avatar Jan 24 '20 15:01 cristim

hi @gabegorelick,

I'll close this for now but I'd love to have a chat in more detail about your use case of using AutoSpotting in such a multi-tenant setup.

cristim avatar Mar 06 '23 15:03 cristim