dataall icon indicating copy to clipboard operation
dataall copied to clipboard

Send alerts when share verifier ECS task fails

Open TejasRGitHub opened this issue 1 year ago • 5 comments

Is your idea related to a problem? Please describe. Currently, the ECS task for share verifier runs on a schedule. When the share verifier service fails it doesn't send any alerts which will inform user about the failure.

Describe the solution you'd like Send an alert in the form of an email, etc to inform the user about failure while running an ECS task.

P.S. Don't attach files. Please, prefer add code snippets directly in the message body.

TejasRGitHub avatar May 23 '24 14:05 TejasRGitHub

Hi @TejasRGitHub - curious to hear more about what you had in mind here...

Are we talking about notifications for when the schedule share verifier task determines a new share item as unhealthy? Or are we talking about when the actual ECS Task fails when running with some type of error

If the former (notifications when state change to unhealthy):

  • (1) We can make changes to ECS Task Definition to incorporate notify step after detecting unhealthy items
  • (2) Maybe we also include this behavior for on-demand runs of verifier ++ add logic to persistent reminders feature to incorporate this as well

If the latter (notifications on failing ECS task):

  • We do include alarms around ECS but maybe can think of a way to propagate those alarms to be more apparent for DA Admin teams

I would think you more so are discussing when state changes to Unhealthy but please correct me if I am wrong!

Curious to hear your thoughts here and how you would want data.all notifications to be handled for this use case

noah-paige avatar Jul 12 '24 18:07 noah-paige

Hi @noah-paige , Thanks for the taking a look at this issue.

I was actually talking about the latter. I know that for share-manager an alert is raised if something fails. I think this function is missing the share verifier ECS task.

Can you tell me where these alerts from share manager are sent ? Are they sent to an SNS topic ?

TejasRGitHub avatar Jul 15 '24 16:07 TejasRGitHub

Hi @TejasRGitHub - I see yes you are correct. For verify - if any of the checks raise an exception we add it to the error list and deem the share unhealthy

But you are correct - we do not handle verify failure if some other step fails during verify process (separate from checks), such as if hypothetically update_share_item_health_status raised an error

We can look to find some time to incorporate those edge cases to be treated the same as with approve or revoke shares as well

And yes the alerts from share manager are sent to an SNS Topic - the ARN of the topic is stored as a SSM parameter with the parameter name - /dataall/{self.envname}/sns/alarmsTopic'. This alarmsTopic is created in the MonitoringStack of backend resources at dataall/deploy/stacks/monitoring.py

DA Admins can subscribe to the SNS topic to receive errors or send to downstream processing

noah-paige avatar Sep 10 '24 21:09 noah-paige

Thanks @noah-paige for that information. Let me check our SNS topic containing alarmsTopic and configure to receive emails. Do you know if this can be easily done ? Do you maybe have any steps ?

Also, I think for few ECS tasks like Share verifier and re-applier any failure information is not sent to SNS topic - alarmsTopic

TejasRGitHub avatar Sep 10 '24 22:09 TejasRGitHub

Will be contributed as a part of - https://github.com/data-dot-all/dataall/issues/1420#issuecomment-2552403340

TejasRGitHub avatar May 27 '25 22:05 TejasRGitHub