storj icon indicating copy to clipboard operation
storj copied to clipboard

[storagenode] The email about suspension is coming only after check-in to satellite

Open AlexeyALeonov opened this issue 3 years ago • 10 comments

Description

See

  • https://forum.storj.io/t/whats-the-point-of-sending-the-emails-about-node-suspension/15753
  • https://forum.storj.io/t/power-outage-wiped-my-node-config-then-more-woes-came/14994/6?u=alexey

The suspension email is coming when the problem either fixed or at least when the node is performed check-in on the satellites, not before. The whole point of suspension email is to notify operator ASAP.

Steps to reproduce the issue:

  1. Make node offline to have a suspension level - online score below 60%
  2. There would not be any email
  3. Bring your node online
  4. Receive suspension emails.

Describe the results you expected: Suspension email should come when suspension is happened, not when the node become online.

Describe the results you received: Suspension email is coming when your bring your node online.

Your environment

  • Operating system and version: any
  • Additional environment details (Raspberry PI, Docker, VMWare, etc.): binary, docker

AlexeyALeonov avatar Oct 21 '21 05:10 AlexeyALeonov

I believe we filter out offline nodes because we don't want to send emails to nodes that left the network intentionally. We can change that but the next issue would be that we send out too many spam mails. We need to find a good balance.

Personally, I would prefer if we could get rid of the satellite guessing and sending emails and replace that with grafana. That way the operator has full control over email alerts and can fine-tune the alerts much better.

littleskunk avatar Oct 22 '21 09:10 littleskunk

Where is this grafana should be hosted? On SNO's side? Or Storj's side?

AlexeyALeonov avatar Oct 29 '21 08:10 AlexeyALeonov

I believe we filter out offline nodes because we don't want to send emails to nodes that left the network intentionally. We can change that but the next issue would be that we send out too many spam mails. We need to find a good balance.

Personally, I would prefer if we could get rid of the satellite guessing and sending emails and replace that with grafana. That way the operator has full control over email alerts and can fine-tune the alerts much better.

The reason to have a node offline is not only because someones want to left the nextwork intencionally.

https://forum.storj.io/t/unable-to-get-satellite-system-time/15725/2

this happened to me 2 weeks ago. ¿How are we supposed to fix things like this if storj doesn't want to advise us that there is some problem? solutions like uptimerobot can't detect things like this,the only ones that truly can see if a node is connected or not to the network and warn the operatore are the sattelites

camulodunum avatar Oct 29 '21 16:10 camulodunum

Where is this grafana should be hosted? On SNO's side? Or Storj's side?

It would be hosted on the storage node side. Storage node operators would be able to add any alert rules they need.

How are we supposed to fix things like this if storj doesn't want to advise us that there is some problem?

I don't understand your question. I explained what the root cause is and that we would get a lot of spam mails if we change that. Feel free to share your ideas on how to avoid spam mails.

littleskunk avatar Nov 04 '21 14:11 littleskunk

I don't understand your question. I explained what the root cause is and that we would get a lot of spam mails if we change that. Feel free to share your ideas on how to avoid spam mails.

I don't know how the rates of node operators shutting down are,but if the problem is a worry to spam people you can just decide one or various conditions to decide if someones is likely to be leaving the network without a grace exit or he is having problems.

like,for example,only send emails to suddenly DC nodes to those who have been X time in the network.

take that conditions to a level where you can feel confortable sending emails,like you're now with other node issues that not impy an offline node

but the actual system is worthless unless you're having a problem with the data in your node (a problem that is more likely to end with a corrupted data and a need to remade server/exit the network, than just a problem that make you go offline).

camulodunum avatar Nov 05 '21 11:11 camulodunum

I think the better approach would be to allow SNO to subscribe to email alerts. It doesn't really matter for me, how it would be implemented - as a Grafana somewhere, Prometheus Alert Manager, whatever.

AlexeyALeonov avatar Nov 11 '21 06:11 AlexeyALeonov

Please fix this so that the emails are sent once a node is suspended and not only once it is online again.

Having the SNO run additional monitoring when Storj already has the needed data just seems to overcomplicate things for no real reason. This would require an SNO to have a separate machine, ideally on a separate network?

I don't understand your question. I explained what the root cause is and that we would get a lot of spam mails if we change that. Feel free to share your ideas on how to avoid spam mails.

I don't think anyone would consider such an email spam. After all SNOs are trying to provide a service. It would make sense to have a threshold below which an email is sent, sending it on a single offline audit would not be ok, sure, but that's not what this issue is about. And I don't believe that many nodes are repeatedly going below 90% online score and back.

iSOcH avatar Dec 11 '21 08:12 iSOcH

The email from the satellites will be always too late. If the satellite noticed the problem - your node likely already suspended or disqualified.

Please, take a look: https://forum.storj.io/t/tech-preview-email-alerts-with-grafana-and-prometheus/16156

AlexeyALeonov avatar Dec 19 '21 19:12 AlexeyALeonov

And I don't believe that many nodes are repeatedly going below 90% online score and back.

I don't think that's even a case to consider. Suspension happens when your node online score gets below 60%. But yeah, totally agree with @iSOcH.

How could an e-mail alerting you that you're about to lose your node (as in your node just entered suspension mode) be considered like SPAM? I don't get that? In my opinion, that's an issue that should be reported as soon as the satellite suspends the node, whether it is online or offline. Anyone who's willing to intentionally leave the network won't be bothered by the only 2 e-mails they will receive telling them that their node got suspended, and then disqualified (as expected).

Personally, I would prefer if we could get rid of the satellite guessing and sending emails and replace that with grafana. That way the operator has full control over email alerts and can fine-tune the alerts much better.

I really don't think you should force SNOs to install massive monitoring systems like Grafana for being notified when the node needs urgent attention. You may add an option to disable e-mail notifications sent by satellites for those who want to handle all that themselves, but by default satellites e-mail notifications for suspension mode should be enabled, and sent on time of suspension, in my opinion.

pacproduct avatar Mar 17 '22 12:03 pacproduct

This issue has been mentioned on Storj Community Forum (official). There might be relevant details there:

https://forum.storj.io/t/node-startet-nicht-mehr/19361/10

storjrobot avatar Aug 12 '22 07:08 storjrobot

This issue has been mentioned on Storj Community Forum (official). There might be relevant details there:

https://forum.storj.io/t/suspension-without-reason/20193/11

storjrobot avatar Nov 11 '22 07:11 storjrobot

This issue has been mentioned on Storj Community Forum (official). There might be relevant details there:

https://forum.storj.io/t/v1-67-1-suspension-percentage/20331/13

storjrobot avatar Nov 18 '22 15:11 storjrobot

This issue has been mentioned on Storj Community Forum (official). There might be relevant details there:

https://forum.storj.io/t/freezed-onlinescore/20473/5

storjrobot avatar Nov 27 '22 04:11 storjrobot

This issue has been mentioned on Storj Community Forum (official). There might be relevant details there:

https://forum.storj.io/t/node-suspended-no-email/11556/20

storjrobot avatar Jan 31 '23 04:01 storjrobot

We implemented a new email notification system

AlexeyALeonov avatar Mar 29 '23 08:03 AlexeyALeonov