mailinabox icon indicating copy to clipboard operation
mailinabox copied to clipboard

Retry PTR lookup when authoritative servers aren't responding properly

Open yodax opened this issue 9 years ago • 23 comments

For: https://github.com/mail-in-a-box/mailinabox/issues/628

I have been able to somewhat reproduce the reverse lookup failures on a DO droplet. I have created this branch that has an option on the command line for the status checks to only run the ptr check.

management/status_checks.py --check-ptr

I added this to a cron job that ran the command every minute. When ever the check failed I send myself an email. This would fail approximately 1 in 10 times. When the job failed I would also fire off several dig requests.

dig MY_IP.in-addr.arpa
dig PTR MY_IP.in-addr.arpa
dig PTR MY_IP.in-addr.arpa @ns1.digitalocean.com
dig PTR MY_IP.in-addr.arpa @ns2.digitalocean.com
dig PTR MY_IP.in-addr.arpa @ns3.digitalocean.com

Comparing this to a normal dig showed that sometimes there were fewer servers available to respond to the request, in dig PTR MY_IP.in-addr.arpa. Indicating that there was a problem, also the response times would increase.

Adding some extra logging to the status checks showed that the dns library said that the response from the servers was bad.

Adding a retry to only the PTR lookups and only in this specific case would result in far less errors.

With the changes made, I made my cronjob trigger 3 times a minute and left that running for about 12 hours. This resulted in one failed lookup (that would have to fail in 3 consecutive tries). This simulated nearly 6 years of status checks.

This will not actually fix the authoritative servers of course, but it will check that MIAB is configured properly, which is the intent of the status checks.

The retries will cause a a delay of up to 4 seconds. The failed responses trigger quite fast.

If somebody could try this on a different environment, it would be appreciated.

yodax avatar Mar 01 '16 10:03 yodax

Quick status update: I have stopped the checks every minute. Haven't had a single failure since the opening of the PR. If you agree conceptually with the solution I could ask others to test.

yodax avatar Mar 10 '16 09:03 yodax

I haven't been getting the problem on my DO box lately, so it might be a coincidence...

JoshData avatar Mar 23 '16 20:03 JoshData

I think it might just be luck. When I did the check 3 times every minute I saw it occurring every 10-20 minutes. That sort of matches with once every 1-2 months. While testing on Scaleway I had it twice in 10 days (, that is without this patch)

yodax avatar Mar 24 '16 14:03 yodax

I think we decided this was a coincidence?

JoshData avatar Jul 29 '16 12:07 JoshData

I am still receiving these notifications. Might it be worthwile to reconsider this patch?

ponychicken avatar Oct 20 '16 17:10 ponychicken

@ponychicken Yes, I agree. This patch would be a great benefit.

chris13524 avatar Nov 02 '16 13:11 chris13524

I'd even consider upping the retry to 4 times, would make extra sure we don't get that faulty notification.

chris13524 avatar Nov 02 '16 13:11 chris13524

Try that and report back, please.

JoshData avatar Nov 02 '16 13:11 JoshData

The last 21 days haven't reported any changes in the reverse DNS. It appears the multiple checks have "fixed" the issue. I'd suggest merging this.

chris13524 avatar Nov 23 '16 15:11 chris13524

I still have this sometimes. I merged the branch with master, it works for me.

yodax avatar Apr 01 '17 10:04 yodax

@yodax You mean the PR fixes the problem for you?

JoshData avatar Apr 02 '17 11:04 JoshData

Well both no and yes. I sometimes get the error. Maybe bi-weekly.

Technically this isn't a problem with MIAB. During my debugging I captured the dns responses and they do sometimes report that the rdns isn't found/configured.

This does remove the symptoms of the error. It will try a few times and if one of those requests is successful it passes the status check.

For the status checks we care if the server is configured correctly not if it is reliable.

So no it doesn't solve the underlying problem (we can't) but yes it reports if the rdns is configured properly.

Imho merging this depends on if we want to hide reliability issues with the dns server. (And code quality of course)

yodax avatar Apr 02 '17 11:04 yodax

Bi-weekly (twice a week? once every two weeks?) is still a lot, so I don't really think this counts much as a fix. And you're right that we may not want to hide reliability issues of the ISP. So I'm -1 on this for now.

JoshData avatar Apr 02 '17 12:04 JoshData

@JoshData If you were to increase the number of checks to 4 (or even 5), it would make this happen much less often (few times a year). You could always replace this multi-check system down the road if a proper fix comes along.

You could also add a note in the status check response (Reverse DNS is set correctly at ISP. [%s ↦ %s] (but the responses weren't consistent)), although I'm not sure if this change would be sent in an email too. Alternatively, you could log it, just something that doesn't spam people with emails.

With so many people having this issue, and there not being any "real" fix in sight (as this has nothing to do with MIAB), I see no reason why this shouldn't be merged for the benefit of the MIAB community.

chris13524 avatar Apr 02 '17 12:04 chris13524

It happens once every two weeks at least without the patch. With the patch it doesn't occur.

(I seem to have a problem expressing my self clearly with this issue :smile:)

But I agree on the -1 for hiding errors. We could allow the admin to opt-in to hide this error. Disable the check during scheduled status checks.

yodax avatar Apr 02 '17 12:04 yodax

Ok, ok.

JoshData avatar Apr 02 '17 12:04 JoshData

I'm getting these mails every other day too on my box hosted at Vultr

ThisDevDane avatar Sep 15 '17 10:09 ThisDevDane

@yodax wondered "if we want to hide reliability issues with the dns server". I have the same concern, so I wonder if there are alternatives to the "retry approach." @sbuller noted on #628 "I have yet to see dig in ptr 190.31.xx.yy.in-addr.arpa @ns1.digitalocean.com. +short fail." I'm not qualified to say whether that's comparable to the currently-used DNS resolver query - can anybody weigh in on that? If it IS comparable, is that possibly a better approach (because it wouldn't "hide reliability issues with the dns server")?

P.S. I'm running on DO and get these emails weekly on average (although they can be as frequent as 2-3 times per week).

dan-jensen avatar Oct 04 '17 16:10 dan-jensen

So, whose fault is it? Namecheap (in the case of my domain) or Vultr (in the case of my box)? I'd like to report the issue to them.

I agree that MIAB shouldn't hide issues with the service provider. It may be worth patching something to help the user understand this problem, however, since it seems to be common. Maybe by appending the status message saying there are known issues with reliability of some services

alexgleason avatar Oct 15 '17 18:10 alexgleason

Reverse dns lookups are done on the dns servers from the ip owner, which is generally your hosting provider. So in your case vultr.

yodax avatar Oct 15 '17 18:10 yodax

I contacted Vultr the first couple of times this happened to me but they say their servers are fine, maybe they aren't responding fast enough and miab just doesn't wait long enough?

ThisDevDane avatar Oct 15 '17 20:10 ThisDevDane

This isn’t a timing issue.

Adding some extra logging to the status checks showed that the dns library said that the response from the servers was bad.

yodax avatar Oct 16 '17 07:10 yodax

The implication of my testing was that the DNS provider was fine, and that the test script was buggy. I didn't dig any further.

sbuller avatar Oct 16 '17 16:10 sbuller