Timeout error in mail URL fetcher
Hi, I have a problem with the Mail URL collector bot, when the file size is more than 80M it sometimes fails to download the file, but it marks the email as read, and it's really confusing cause there is no information about which email caused the timeout. here is the error message:
2020-10-13 09:15:05,114 - Mail-URL-Fetcher-Shadow - ERROR - Bot has found a problem.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/contrib/pyopenssl.py", line 313, in recv_into
return self.connection.recv_into(*args, **kwargs)
File "/usr/lib/python3/dist-packages/OpenSSL/SSL.py", line 1717, in recv_into
self._raise_ssl_error(self._ssl, result)
File "/usr/lib/python3/dist-packages/OpenSSL/SSL.py", line 1523, in _raise_ssl_error
raise WantReadError()
OpenSSL.SSL.WantReadError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/contrib/pyopenssl.py", line 313, in recv_into
return self.connection.recv_into(*args, **kwargs)
File "/usr/lib/python3/dist-packages/OpenSSL/SSL.py", line 1717, in recv_into
self._raise_ssl_error(self._ssl, result)
File "/usr/lib/python3/dist-packages/OpenSSL/SSL.py", line 1523, in _raise_ssl_error
raise WantReadError()
OpenSSL.SSL.WantReadError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 437, in _error_catcher
yield
File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 519, in read
data = self._fp.read(amt) if not fp_closed else b""
File "/usr/lib/python3.6/http/client.py", line 459, in read
n = self.readinto(b)
File "/usr/lib/python3.6/http/client.py", line 503, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.6/dist-packages/urllib3/contrib/pyopenssl.py", line 328, in recv_into
return self.recv_into(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/urllib3/contrib/pyopenssl.py", line 326, in recv_into
raise timeout("The read operation timed out")
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 751, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 576, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 541, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 442, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='dl.shadowserver.org', port=443): Read timed out.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/dev_intelmq/intelmq/lib/bot.py", line 267, in start
self.process()
File "/opt/dev_intelmq/intelmq/bots/collectors/_mail/lib.py", line 54, in process
if self.process_message(uid, message):
File "/opt/dev_intelmq/intelmq/bots/collectors/_mail/collector_mail_url.py", line 49, in process_message
resp = self.session.get(url=url)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 543, in get
return self.request('GET', url, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 683, in send
r.content
File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 829, in content
self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 758, in generate
raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='dl.shadowserver.org', port=443): Read timed out.
As you haven't posted the full log but just an excerpt I need to guess the other parts. Therefore, the conclusions might not be correct.
I think I identified the culprit and can remember a discussion in the past about this. If the download times out multiple times (setting http_timeout_max_tries), the mail is marked as read even if there was an error. However, this behavior prevents endless connections (and timeout) as the connection already needs to timeout repeatedly. As a countermeasure, the error is logged. What we have here is a semi-permanent error and endless retries are probably not fixing the issue in your setup.
Have you already tried to increase the http_timeout_sec setting?
Thanks for your reply.
This is the whole error message from my URL-fetcher logs.
I'm going to try increasing the timeout setting. 🙏
I think that it would be helpful if instead of an exception I could get an error message in my logs, maybe with the email subject/URL of the failed message.
I just noticed that the thrown exception is requests.exceptions.ConnectionError and not requests.exceptions.Timeout. The latter would be already treated specially. What is your error_procedure setting? If it is pass, then the behavior (marking the mail as read) is expected, like I described above, otherwise not.
I think that it would be helpful if instead of an exception I could get an error message in my logs,
You can disable the exception logging by setting error_log_exception to false.
maybe with the email subject/URL of the failed message.
Logging the Subject (and date?) is a good idea. The URL is already logged at INFO level, but not specifically on errors as well.
I was trying to regenerate the error but it seems like everything works fine right now.
If I set error_log_exception to false does it log any error messages? (My log level is INFO)
cause last time I just got this exception without any other information about that email.
I'm afraid if I set that to false I don't get any trace that an error had even happened.
And if I may make a suggestion, maybe it would be useful if there was an option to move the read emails to some folder and if there was an error, send those emails to another folder.
I was trying to regenerate the error but it seems like everything works fine right now. If I set
error_log_exceptionto false does it log any error messages? (My log level is INFO) cause last time I just got this exception without any other information about that email. I'm afraid if I set that to false I don't get any trace that an error had even happened.
From the stacktrace above, the only information left would be:
2020-10-13 09:15:05,114 - Mail-URL-Fetcher-Shadow - ERROR - requests.exceptions.ConnectionError: HTTPSConnectionPool(host='dl.shadowserver.org', port=443): Read timed out.
2020-10-13 09:15:05,114 - Mail-URL-Fetcher-Shadow - ERROR - Bot has found a problem.
And if I may make a suggestion, maybe it would be useful if there was an option to move the read emails to some folder and if there was an error, send those emails to another folder.
Good idea, I will create a separate issue for this.