intelmq icon indicating copy to clipboard operation
intelmq copied to clipboard

Timeout error in mail URL fetcher

Open G0meisa opened this issue 5 years ago • 5 comments

Hi, I have a problem with the Mail URL collector bot, when the file size is more than 80M it sometimes fails to download the file, but it marks the email as read, and it's really confusing cause there is no information about which email caused the timeout. here is the error message:

2020-10-13 09:15:05,114 - Mail-URL-Fetcher-Shadow - ERROR - Bot has found a problem.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/contrib/pyopenssl.py", line 313, in recv_into
    return self.connection.recv_into(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/OpenSSL/SSL.py", line 1717, in recv_into
    self._raise_ssl_error(self._ssl, result)
  File "/usr/lib/python3/dist-packages/OpenSSL/SSL.py", line 1523, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/contrib/pyopenssl.py", line 313, in recv_into
    return self.connection.recv_into(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/OpenSSL/SSL.py", line 1717, in recv_into
    self._raise_ssl_error(self._ssl, result)
  File "/usr/lib/python3/dist-packages/OpenSSL/SSL.py", line 1523, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 437, in _error_catcher
    yield
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 519, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/usr/lib/python3.6/http/client.py", line 459, in read
    n = self.readinto(b)
  File "/usr/lib/python3.6/http/client.py", line 503, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/contrib/pyopenssl.py", line 328, in recv_into
    return self.recv_into(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/contrib/pyopenssl.py", line 326, in recv_into
    raise timeout("The read operation timed out")
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 751, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 576, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 541, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 442, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='dl.shadowserver.org', port=443): Read timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/dev_intelmq/intelmq/lib/bot.py", line 267, in start
    self.process()
  File "/opt/dev_intelmq/intelmq/bots/collectors/_mail/lib.py", line 54, in process
    if self.process_message(uid, message):
  File "/opt/dev_intelmq/intelmq/bots/collectors/_mail/collector_mail_url.py", line 49, in process_message
    resp = self.session.get(url=url)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 683, in send
    r.content
  File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 829, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 758, in generate
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='dl.shadowserver.org', port=443): Read timed out.

G0meisa avatar Oct 14 '20 08:10 G0meisa

As you haven't posted the full log but just an excerpt I need to guess the other parts. Therefore, the conclusions might not be correct.

I think I identified the culprit and can remember a discussion in the past about this. If the download times out multiple times (setting http_timeout_max_tries), the mail is marked as read even if there was an error. However, this behavior prevents endless connections (and timeout) as the connection already needs to timeout repeatedly. As a countermeasure, the error is logged. What we have here is a semi-permanent error and endless retries are probably not fixing the issue in your setup.

Have you already tried to increase the http_timeout_sec setting?

ghost avatar Oct 21 '20 08:10 ghost

Thanks for your reply. This is the whole error message from my URL-fetcher logs. I'm going to try increasing the timeout setting. 🙏
I think that it would be helpful if instead of an exception I could get an error message in my logs, maybe with the email subject/URL of the failed message.

G0meisa avatar Oct 21 '20 11:10 G0meisa

I just noticed that the thrown exception is requests.exceptions.ConnectionError and not requests.exceptions.Timeout. The latter would be already treated specially. What is your error_procedure setting? If it is pass, then the behavior (marking the mail as read) is expected, like I described above, otherwise not.

I think that it would be helpful if instead of an exception I could get an error message in my logs,

You can disable the exception logging by setting error_log_exception to false.

maybe with the email subject/URL of the failed message.

Logging the Subject (and date?) is a good idea. The URL is already logged at INFO level, but not specifically on errors as well.

ghost avatar Oct 21 '20 11:10 ghost

I was trying to regenerate the error but it seems like everything works fine right now. If I set error_log_exception to false does it log any error messages? (My log level is INFO) cause last time I just got this exception without any other information about that email. I'm afraid if I set that to false I don't get any trace that an error had even happened.

And if I may make a suggestion, maybe it would be useful if there was an option to move the read emails to some folder and if there was an error, send those emails to another folder.

G0meisa avatar Oct 22 '20 08:10 G0meisa

I was trying to regenerate the error but it seems like everything works fine right now. If I set error_log_exception to false does it log any error messages? (My log level is INFO) cause last time I just got this exception without any other information about that email. I'm afraid if I set that to false I don't get any trace that an error had even happened.

From the stacktrace above, the only information left would be:

2020-10-13 09:15:05,114 - Mail-URL-Fetcher-Shadow - ERROR - requests.exceptions.ConnectionError: HTTPSConnectionPool(host='dl.shadowserver.org', port=443): Read timed out.
2020-10-13 09:15:05,114 - Mail-URL-Fetcher-Shadow - ERROR - Bot has found a problem.

And if I may make a suggestion, maybe it would be useful if there was an option to move the read emails to some folder and if there was an error, send those emails to another folder.

Good idea, I will create a separate issue for this.

ghost avatar Oct 22 '20 13:10 ghost