mailinabox
mailinabox copied to clipboard
Retry PTR lookup when authoritative servers aren't responding properly
For: https://github.com/mail-in-a-box/mailinabox/issues/628
I have been able to somewhat reproduce the reverse lookup failures on a DO droplet. I have created this branch that has an option on the command line for the status checks to only run the ptr check.
management/status_checks.py --check-ptr
I added this to a cron job that ran the command every minute. When ever the check failed I send myself an email. This would fail approximately 1 in 10 times. When the job failed I would also fire off several dig requests.
dig MY_IP.in-addr.arpa
dig PTR MY_IP.in-addr.arpa
dig PTR MY_IP.in-addr.arpa @ns1.digitalocean.com
dig PTR MY_IP.in-addr.arpa @ns2.digitalocean.com
dig PTR MY_IP.in-addr.arpa @ns3.digitalocean.com
Comparing this to a normal dig showed that sometimes there were fewer servers available to respond to the request, in dig PTR MY_IP.in-addr.arpa. Indicating that there was a problem, also the response times would increase.
Adding some extra logging to the status checks showed that the dns library said that the response from the servers was bad.
Adding a retry to only the PTR lookups and only in this specific case would result in far less errors.
With the changes made, I made my cronjob trigger 3 times a minute and left that running for about 12 hours. This resulted in one failed lookup (that would have to fail in 3 consecutive tries). This simulated nearly 6 years of status checks.
This will not actually fix the authoritative servers of course, but it will check that MIAB is configured properly, which is the intent of the status checks.
The retries will cause a a delay of up to 4 seconds. The failed responses trigger quite fast.
If somebody could try this on a different environment, it would be appreciated.
Quick status update: I have stopped the checks every minute. Haven't had a single failure since the opening of the PR. If you agree conceptually with the solution I could ask others to test.
I haven't been getting the problem on my DO box lately, so it might be a coincidence...
I think it might just be luck. When I did the check 3 times every minute I saw it occurring every 10-20 minutes. That sort of matches with once every 1-2 months. While testing on Scaleway I had it twice in 10 days (, that is without this patch)
I think we decided this was a coincidence?
I am still receiving these notifications. Might it be worthwile to reconsider this patch?
@ponychicken Yes, I agree. This patch would be a great benefit.
I'd even consider upping the retry to 4 times, would make extra sure we don't get that faulty notification.
Try that and report back, please.
The last 21 days haven't reported any changes in the reverse DNS. It appears the multiple checks have "fixed" the issue. I'd suggest merging this.
I still have this sometimes. I merged the branch with master, it works for me.
@yodax You mean the PR fixes the problem for you?
Well both no and yes. I sometimes get the error. Maybe bi-weekly.
Technically this isn't a problem with MIAB. During my debugging I captured the dns responses and they do sometimes report that the rdns isn't found/configured.
This does remove the symptoms of the error. It will try a few times and if one of those requests is successful it passes the status check.
For the status checks we care if the server is configured correctly not if it is reliable.
So no it doesn't solve the underlying problem (we can't) but yes it reports if the rdns is configured properly.
Imho merging this depends on if we want to hide reliability issues with the dns server. (And code quality of course)
Bi-weekly (twice a week? once every two weeks?) is still a lot, so I don't really think this counts much as a fix. And you're right that we may not want to hide reliability issues of the ISP. So I'm -1 on this for now.
@JoshData If you were to increase the number of checks to 4 (or even 5), it would make this happen much less often (few times a year). You could always replace this multi-check system down the road if a proper fix comes along.
You could also add a note in the status check response (Reverse DNS is set correctly at ISP. [%s ↦ %s] (but the responses weren't consistent)), although I'm not sure if this change would be sent in an email too. Alternatively, you could log it, just something that doesn't spam people with emails.
With so many people having this issue, and there not being any "real" fix in sight (as this has nothing to do with MIAB), I see no reason why this shouldn't be merged for the benefit of the MIAB community.
It happens once every two weeks at least without the patch. With the patch it doesn't occur.
(I seem to have a problem expressing my self clearly with this issue :smile:)
But I agree on the -1 for hiding errors. We could allow the admin to opt-in to hide this error. Disable the check during scheduled status checks.
Ok, ok.
I'm getting these mails every other day too on my box hosted at Vultr
@yodax wondered "if we want to hide reliability issues with the dns server". I have the same concern, so I wonder if there are alternatives to the "retry approach." @sbuller noted on #628 "I have yet to see dig in ptr 190.31.xx.yy.in-addr.arpa @ns1.digitalocean.com. +short fail." I'm not qualified to say whether that's comparable to the currently-used DNS resolver query - can anybody weigh in on that? If it IS comparable, is that possibly a better approach (because it wouldn't "hide reliability issues with the dns server")?
P.S. I'm running on DO and get these emails weekly on average (although they can be as frequent as 2-3 times per week).
So, whose fault is it? Namecheap (in the case of my domain) or Vultr (in the case of my box)? I'd like to report the issue to them.
I agree that MIAB shouldn't hide issues with the service provider. It may be worth patching something to help the user understand this problem, however, since it seems to be common. Maybe by appending the status message saying there are known issues with reliability of some services
Reverse dns lookups are done on the dns servers from the ip owner, which is generally your hosting provider. So in your case vultr.
I contacted Vultr the first couple of times this happened to me but they say their servers are fine, maybe they aren't responding fast enough and miab just doesn't wait long enough?
This isn’t a timing issue.
Adding some extra logging to the status checks showed that the dns library said that the response from the servers was bad.
The implication of my testing was that the DNS provider was fine, and that the test script was buggy. I didn't dig any further.