Incomplete download from ShadowServer URLs
Hi, I'm using the Mail URL Fetcher bot and I have experienced some problems with the Shadowserver emails. Sometimes the downloaded files from the URLs are empty (with "Got empty reponse from server" Warning) or have a smaller size than they should (with no error or warning). When I download those files again, everything is ok. Since Shadowserver sends the file size and the number of records in the email, it will be useful to display the error or re-download it if the file is not downloaded correctly.
(Maybe I'm the only one who faced this issue. :worried:)
Hi.
I made the same experience and I wasn't sure if it us, some code or ShadowServer to blame. We're using RTIR and therefore have RT collector bots to fetch the ShadowServer reports. I regulary (every few days) see an error message WARNING - Ignoring report without raw field. Possible bug or misconfiguration of this bot. which indicates that the download resulted in no data. That is not an error per-se as a report could be empty.
Checking the size of the downloaded archive against the size given in the message sounds like a good idea!
Another option would be to treat the warnings we are getting ("Got empty reponse from server" in your case with the IMAP collector, "Ignoring report without raw field" in our case for the RT collector) as an error if a certain parameter, e.g. raise_for_empty_report is set. That parameter should then be set to true for the ShadowServer feeds, independent of the used collector. I'm not sure which solution is better.
In any case, I will additionally reach out to ShadowServer to check if they are aware of any issues, as the issues affects multiple organizations.
Thanks for the reply. I made some changes to the code to store the downloaded files in a folder so that I can check for empty files in the event of a warning. As I mentioned, I noticed that some of the reports are partially downloaded with no warning( e.g. The size should be 100Mb but an incomplete 2Mb version is downloaded). For this reason, just showing an error (in case of a warning or an empty file downloaded) may not be enough, because in these cases we cannot identify the error.
As I mentioned, I noticed that some of the reports are partially downloaded with no warning( e.g. The size should be 100Mb but an incomplete 2Mb version is downloaded).
How wonder how that can work given that the downloaded file is an archive, isn't it?
How wonder how that can work given that the downloaded file is an archive, isn't it?
I also thought that files are archived, but now I'm not sure anymore.
How wonder how that can work given that the downloaded file is an archive, isn't it?
I also thought that files are archived, but now I'm not sure anymore.
The downloads are just CSV. If the download would stop in the middle of a CSV line, there would be parsing errors, but I have seen none yet here. That means that the downloads never start (empty raw/no response) or stop during the transmission, but exactly between two lines.
I fixed my problem with this code. I know that it is not a good solution (because someone might use the Mail_URL bot for other sources) but in my case, it solved my problem.
def process_message(self, uid, message):
erroneous = False # If errors occurred this will be set to true.
seen = False
incomplete = False
for body in message.body['plain']:
incomplete = False
body = str(body.decode('utf-8') if isinstance(body, bytes) else body)
match = re.search(self.parameters.url_regex, body)
# remove the thousand separator
event_num = re.search('contains (\d+) event', body.replace(',',''))
if match:
if event_num:
event_num = int(event_num.group(1))
else:
event_num = 0
url = match.group()
# strip leading and trailing spaces, newlines and
# carriage returns
url = url.strip()
self.logger.info("Downloading report from %r.", url)
try:
resp = self.session.get(url=url)
except requests.exceptions.Timeout:
self.logger.error("Request timed out %i times in a row. " %
self.http_timeout_max_tries)
erroneous = True
# The download timed out too often, leave the Loop.
continue
if resp.status_code // 100 != 2:
self.logger.error('HTTP response status code was {}.'
''.format(resp.status_code))
erroneous = True
continue
if not resp.content:
self.logger.warning('Got empty reponse from server.')
else:
self.logger.info("Report downloaded.")
template = self.new_report()
template["feed.url"] = url
template["extra.email_subject"] = message.subject
template["extra.email_from"] = ','.join(x['email'] for x in message.sent_from)
template["extra.email_message_id"] = message.message_id
template["extra.file_name"] = file_name_from_response(resp)
data = io.BytesIO(resp.content)
reader = csv.reader(codecs.iterdecode(data, 'utf-8'))
lines = len(list(reader))
if int(event_num) == int(lines) - 1:
for report in generate_reports(template, io.BytesIO(resp.content),
self.chunk_size,
self.chunk_replicate_header):
self.send_message(report)
else:
self.logger.warning(str(template["extra.file_name"]) + ' downloaded incompletely')
incomplete = True
erroneous = True
if incomplete:
seen = False
else:
seen = True
If I get an incomplete download I keep the email as unread so it can be downloaded next time. BTW almost every day I get some incomplete warnings!!
Thanks for your PoC!
The line event_num = re.search('contains (\d+) event', body.replace(',','')) could be generalized by moving the regex to a separate parameter. lines = len(list(reader)) not really performant, counting newlines would work faster. But is still depends on the format and it's extra data (like CSV headers, eventually comments) and is therefore not future-proof. Comparing bytes has the big advantage that does not depend on this issues, but Shadowserver provides only a rounded size ("157K bytes") unfortunately.
I hate (network) connection issues like this.
cc @elsif2