Haraka
Haraka copied to clipboard
outbound: constant reconnect to unreachable servers
Describe the bug Haraka tries non-stop to connect to unreachable servers or servers that refuse a connection.
Expected behavior Haraka should not try to connect non-stop.
Additional context
This issue is same as https://github.com/haraka/Haraka/issues/2694 and possibly https://github.com/haraka/Haraka/issues/2507 .
I have debugged this problem and found that the core problem is this:
-
Outbound uses generic-pool module to manage socket connections per IP. When a server is unreachable (i.e MX points to an IP address which as no mail server) or the server refuses connection (maybe some IP based gray listing), then the socket creation fails at https://github.com/haraka/Haraka/blob/master/outbound/client_pool.js#L50 . This results in
create
factory method returning an error to generic-pool module. -
The issue is upstream generic-pool module has no code to make
acquire
return an error. So, https://github.com/haraka/Haraka/blob/master/outbound/client_pool.js#L110 never returns! The relevant issues upstream are https://github.com/coopernurse/node-pool/issues/175 and https://github.com/coopernurse/node-pool/issues/197 . The upstream code foracquire
is (in essence) a while true loop callingcreate
non-stop until it succeeds. There is no back off logic or anything. This causes Haraka to non-stop connect to unreachable servers.
Issue is easy to reproduce: just enable debug logs and send mail to a random IP.
A way to mitigate the issue is to set acquireTimeoutMillis
. This will make acquire
fail after acquireTimeoutMillis
milli seconds (the default is to try forever). Note that this will busy loop for acquireTimeoutMillis
at the minimum :-(
Not sure, what else we can do here since upstream module lacks abort feature.
First, great job debugging the problem. I think the bits missing from outbound is an .on('factoryCreateError', ....
handler. It's referenced in both of the upstream issues you cited, we have in smtp_client, be I didn't see it in a quick glance of the outbound code.
@msimerson the factoryCreateError
workaround posted in the upstream issues, just deque the last item. This won't work if we have multiple items (sockets) being created in parallel, no? There is a race and that code is not making sure that the work item being dequed, is the one that actually errored.
It seems this is creating quite a few 100% CPU on many of our servers. @msimerson Do you have any suggestions for a workaround/fix ?
For the moment, I can submit a PR to set acquireTimeoutMillis
to be a hardcoded 10 seconds. Does that sound like a reasonable timeout ? (Don't want to make this configurable since this is just a workaround till we work on a clear fix).
Yes, anything between 10 and 60 seconds seems eminently reasonable to me.
In looking into this further, I'm sorely tempted to just jettison all the pool code. It has been a persistent source of headaches and difficult to track down bugs for ages. And the documentation for it stinks. While debugging this and a couple other issues related to generic-pool
, I've just found a place where it was throwing errors because a function in the module was removed. It wasn't mentioned in the upgrading guide.
I would agree with the change. From my (casual) reading of the code when I was debugging this, that module offers a lot of options and feature which we don't quite need. And also the usage in Haraka is quite different from what the module is designed for. generic-pool
wants to pool database connections. A database is expected to be always available. In Haraka, we try to pool connections to servers which may or may not be there. The pooling logic has to be aware of such, which it currently isn't. We can either propose a fix upstream but IMO it's easier to just write a our own simple pool (with similar API as upstream, to help migration initially).
I agree with this (with the caveat that it sounds like a lot of work). Thankfully I can say I didn't write the original :)
On Wed, Nov 30, 2022 at 4:12 AM Girish Ramakrishnan < @.***> wrote:
I would agree with the change. From my (casual) reading of the code when I was debugging this, that module offers a lot of options and feature which we don't quite need. And also the usage in Haraka is quite different from what the module is designed for. generic-pool wants to pool database connections. A database is expected to be always available. In Haraka, we try to pool connections to servers which may or may not be there. The pooling logic has to be aware of such, which it currently isn't. We can either propose a fix upstream but IMO it's easier to just write a our own simple pool (with similar API as upstream, to help migration initially).
— Reply to this email directly, view it on GitHub https://github.com/haraka/Haraka/issues/3100#issuecomment-1331848221, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFBWYZZWBROZYIEXXGEQADWK4LBBANCNFSM6AAAAAAQVUSHKE . You are receiving this because you are subscribed to this thread.Message ID: @.***>