maddy icon indicating copy to clipboard operation
maddy copied to clipboard

queue: infinite retries after reducing max_tries and bogus retry interval

Open lukastribus opened this issue 1 year ago • 2 comments

Describe the bug

Likely to due a == comparison as opposed to >= , when a email in queue has a retry count higher than max_retries, it will be retried infinitely as the condition never matches:

https://github.com/foxcpp/maddy/blob/cee577790b4c6c9533c46492282d520eef420f7e/internal/target/queue/queue.go#L405

A unrelated secondary issue is seen with the retry delay; which does not match the documented formula; in fact the first 4 retry delays are the same exact 14m59.99 (see logs below). I can file a separate issue if needed.

Steps to reproduce

This can happen in the following situation:

  • an email is send, while the configuration is "max_tries = 20"
  • email fails temporarily multiple times, email retry counter is now at 10
  • admin reduces max_tries to 8
  • email will not permfail anymore as the attemt counter is well above max_retries so a == comparison will not stop it anymore

Log files

root@srv1:/var/lib/maddy/remote_queue# journalctl -xeu maddy.service | grep "will retry" | grep ef97411f
Feb 09 16:52:40 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"[email protected]":1},"msg_id":"ef97411f","next_try_delay":"14m59.999996196s","rcpts":["[email protected]"]}
Feb 09 17:07:40 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"[email protected]":2},"msg_id":"ef97411f","next_try_delay":"14m59.99999822s","rcpts":["[email protected]"]}
Feb 09 17:22:40 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"[email protected]":3},"msg_id":"ef97411f","next_try_delay":"14m59.999997532s","rcpts":["[email protected]"]}
Feb 09 17:37:41 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"[email protected]":4},"msg_id":"ef97411f","next_try_delay":"14m59.999996182s","rcpts":["[email protected]"]}
Feb 09 17:52:41 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"[email protected]":5},"msg_id":"ef97411f","next_try_delay":"29m59.999995718s","rcpts":["[email protected]"]}
Feb 09 18:22:41 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"[email protected]":6},"msg_id":"ef97411f","next_try_delay":"44m59.999996344s","rcpts":["[email protected]"]}
Feb 09 19:07:41 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"[email protected]":7},"msg_id":"ef97411f","next_try_delay":"44m59.999996027s","rcpts":["[email protected]"]}
Feb 09 19:52:41 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"[email protected]":8},"msg_id":"ef97411f","next_try_delay":"59m59.999996174s","rcpts":["[email protected]"]}
Feb 09 20:52:42 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"[email protected]":9},"msg_id":"ef97411f","next_try_delay":"1h14m59.999997755s","rcpts":["[email protected]"]}
Feb 09 22:07:42 srv1 maddy[923]: queue: will retry        {"attempts_count":{"[email protected]":10},"msg_id":"ef97411f","next_try_delay":"1h44m59.999995644s","rcpts":["[email protected]"]}
Feb 09 23:52:42 srv1 maddy[923]: queue: will retry        {"attempts_count":{"[email protected]":11},"msg_id":"ef97411f","next_try_delay":"2h14m59.99999673s","rcpts":["[email protected]"]}
Feb 10 02:07:42 srv1 maddy[923]: queue: will retry        {"attempts_count":{"[email protected]":12},"msg_id":"ef97411f","next_try_delay":"2h44m59.99999787s","rcpts":["[email protected]"]}
Feb 10 04:52:42 srv1 maddy[923]: queue: will retry        {"attempts_count":{"[email protected]":13},"msg_id":"ef97411f","next_try_delay":"3h29m59.999997578s","rcpts":["[email protected]"]}
Feb 10 08:22:43 srv1 maddy[923]: queue: will retry        {"attempts_count":{"[email protected]":14},"msg_id":"ef97411f","next_try_delay":"4h29m59.999997642s","rcpts":["[email protected]"]}
Feb 10 12:52:43 srv1 maddy[41407]: queue: will retry        {"attempts_count":{"[email protected]":15},"msg_id":"ef97411f","next_try_delay":"5h29m59.999996433s","rcpts":["[email protected]"]}
Feb 10 18:22:43 srv1 maddy[41407]: queue: will retry        {"attempts_count":{"[email protected]":16},"msg_id":"ef97411f","next_try_delay":"6h59m59.999997628s","rcpts":["[email protected]"]}
Feb 11 01:22:43 srv1 maddy[41407]: queue: will retry        {"attempts_count":{"[email protected]":17},"msg_id":"ef97411f","next_try_delay":"8h44m59.999996512s","rcpts":["[email protected]"]}
Feb 11 10:07:43 srv1 maddy[41407]: queue: will retry        {"attempts_count":{"[email protected]":18},"msg_id":"ef97411f","next_try_delay":"10h59m59.999996882s","rcpts":["[email protected]"]}
root@srv1:/var/lib/maddy/remote_queue#

When maddy was restarted (with different configuration) can be seen by the changing PID.

Configuration file

target.queue remote_queue {
    max_tries 8
    target &outbound_delivery

Environment information

  • maddy version: cee577790b4c6c9533c46492282d520eef420f7e

lukastribus avatar Feb 11 '24 16:02 lukastribus

Pushed 2da4ece that changes the condition to use >=. Will look into what is wrong with delay calculation...

foxcpp avatar Feb 12 '24 19:02 foxcpp

For the record, here's an output without any restarts in between and with the default of 20 retries. The To address [email protected] is real (and unreachable, so you can reproduce; there is a TLS issue which makes it retry) :

lukas@srv1:~$ sudo journalctl -xeu maddy.service | grep -E "retry.+c5eb84cd|permanent.+c5eb84cd"
Feb 17 23:54:35 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":1},"msg_id":"c5eb84cd","next_try_delay":"14m59.99999903s","rcpts":["[email protected]"]}
Feb 18 00:09:36 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":2},"msg_id":"c5eb84cd","next_try_delay":"14m59.999997305s","rcpts":["[email protected]"]}
Feb 18 00:24:36 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":3},"msg_id":"c5eb84cd","next_try_delay":"14m59.999996852s","rcpts":["[email protected]"]}
Feb 18 00:39:36 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":4},"msg_id":"c5eb84cd","next_try_delay":"14m59.99999707s","rcpts":["[email protected]"]}
Feb 18 00:54:36 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":5},"msg_id":"c5eb84cd","next_try_delay":"29m59.999997265s","rcpts":["[email protected]"]}
Feb 18 01:24:37 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":6},"msg_id":"c5eb84cd","next_try_delay":"44m59.999996767s","rcpts":["[email protected]"]}
Feb 18 02:09:37 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":7},"msg_id":"c5eb84cd","next_try_delay":"44m59.999997176s","rcpts":["[email protected]"]}
Feb 18 02:54:37 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":8},"msg_id":"c5eb84cd","next_try_delay":"59m59.999996711s","rcpts":["[email protected]"]}
Feb 18 03:54:37 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":9},"msg_id":"c5eb84cd","next_try_delay":"1h14m59.999997516s","rcpts":["[email protected]"]}
Feb 18 05:09:38 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":10},"msg_id":"c5eb84cd","next_try_delay":"1h44m59.999997928s","rcpts":["[email protected]"]}
Feb 18 06:54:38 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":11},"msg_id":"c5eb84cd","next_try_delay":"2h14m59.999998559s","rcpts":["[email protected]"]}
Feb 18 09:09:38 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":12},"msg_id":"c5eb84cd","next_try_delay":"2h44m59.99995706s","rcpts":["[email protected]"]}
Feb 18 11:54:38 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":13},"msg_id":"c5eb84cd","next_try_delay":"3h29m59.999994638s","rcpts":["[email protected]"]}
Feb 18 15:24:38 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":14},"msg_id":"c5eb84cd","next_try_delay":"4h29m59.999997815s","rcpts":["[email protected]"]}
Feb 18 19:54:39 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":15},"msg_id":"c5eb84cd","next_try_delay":"5h29m59.999998277s","rcpts":["[email protected]"]}
Feb 19 01:24:39 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":16},"msg_id":"c5eb84cd","next_try_delay":"6h59m59.999996306s","rcpts":["[email protected]"]}
Feb 19 08:24:39 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":17},"msg_id":"c5eb84cd","next_try_delay":"8h44m59.999997053s","rcpts":["[email protected]"]}
Feb 19 17:09:39 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":18},"msg_id":"c5eb84cd","next_try_delay":"10h59m59.999998151s","rcpts":["[email protected]"]}
Feb 20 04:09:40 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"[email protected]":19},"msg_id":"c5eb84cd","next_try_delay":"13h44m59.999995836s","rcpts":["[email protected]"]}
Feb 20 17:54:40 srv1 maddy[501510]: queue: not delivered, permanent error        {"msg_id":"c5eb84cd","rcpt":"[email protected]"}
lukas@srv1:~$

I did notice the the documentations talks about a factor of 1.2 while the code has 1.25, but clearly there is more to this than a simple disparity of the factor.

lukastribus avatar Feb 20 '24 17:02 lukastribus