intelmq icon indicating copy to clipboard operation
intelmq copied to clipboard

Incomplete download from ShadowServer URLs

Open G0meisa opened this issue 5 years ago • 8 comments

Hi, I'm using the Mail URL Fetcher bot and I have experienced some problems with the Shadowserver emails. Sometimes the downloaded files from the URLs are empty (with "Got empty reponse from server" Warning) or have a smaller size than they should (with no error or warning). When I download those files again, everything is ok. Since Shadowserver sends the file size and the number of records in the email, it will be useful to display the error or re-download it if the file is not downloaded correctly.

(Maybe I'm the only one who faced this issue. :worried:)

G0meisa avatar Dec 16 '20 08:12 G0meisa

Hi.

I made the same experience and I wasn't sure if it us, some code or ShadowServer to blame. We're using RTIR and therefore have RT collector bots to fetch the ShadowServer reports. I regulary (every few days) see an error message WARNING - Ignoring report without raw field. Possible bug or misconfiguration of this bot. which indicates that the download resulted in no data. That is not an error per-se as a report could be empty.

Checking the size of the downloaded archive against the size given in the message sounds like a good idea! Another option would be to treat the warnings we are getting ("Got empty reponse from server" in your case with the IMAP collector, "Ignoring report without raw field" in our case for the RT collector) as an error if a certain parameter, e.g. raise_for_empty_report is set. That parameter should then be set to true for the ShadowServer feeds, independent of the used collector. I'm not sure which solution is better.

In any case, I will additionally reach out to ShadowServer to check if they are aware of any issues, as the issues affects multiple organizations.

ghost avatar Dec 18 '20 15:12 ghost

Thanks for the reply. I made some changes to the code to store the downloaded files in a folder so that I can check for empty files in the event of a warning. As I mentioned, I noticed that some of the reports are partially downloaded with no warning( e.g. The size should be 100Mb but an incomplete 2Mb version is downloaded). For this reason, just showing an error (in case of a warning or an empty file downloaded) may not be enough, because in these cases we cannot identify the error.

G0meisa avatar Dec 18 '20 16:12 G0meisa

As I mentioned, I noticed that some of the reports are partially downloaded with no warning( e.g. The size should be 100Mb but an incomplete 2Mb version is downloaded).

How wonder how that can work given that the downloaded file is an archive, isn't it?

ghost avatar Dec 19 '20 10:12 ghost

How wonder how that can work given that the downloaded file is an archive, isn't it?

I also thought that files are archived, but now I'm not sure anymore.

G0meisa avatar Dec 20 '20 04:12 G0meisa

How wonder how that can work given that the downloaded file is an archive, isn't it?

I also thought that files are archived, but now I'm not sure anymore.

The downloads are just CSV. If the download would stop in the middle of a CSV line, there would be parsing errors, but I have seen none yet here. That means that the downloads never start (empty raw/no response) or stop during the transmission, but exactly between two lines.

ghost avatar Dec 23 '20 10:12 ghost

I fixed my problem with this code. I know that it is not a good solution (because someone might use the Mail_URL bot for other sources) but in my case, it solved my problem.

def process_message(self, uid, message):
       erroneous = False  # If errors occurred this will be set to true.
       seen = False
       incomplete = False
       for body in message.body['plain']:
           incomplete = False
           body = str(body.decode('utf-8') if isinstance(body, bytes) else body)
           match = re.search(self.parameters.url_regex, body)
           # remove the thousand separator
           event_num = re.search('contains (\d+) event', body.replace(',',''))

           if match:
               if event_num:
                   event_num = int(event_num.group(1))
               else:
                   event_num = 0
               url = match.group()
               # strip leading and trailing spaces, newlines and
               # carriage returns
               url = url.strip()

               self.logger.info("Downloading report from %r.", url)
               try:
                   resp = self.session.get(url=url)
               except requests.exceptions.Timeout:
                   self.logger.error("Request timed out %i times in a row. " %
                                     self.http_timeout_max_tries)
                   erroneous = True
                   # The download timed out too often, leave the Loop.
                   continue

               if resp.status_code // 100 != 2:
                   self.logger.error('HTTP response status code was {}.'
                                     ''.format(resp.status_code))
                   erroneous = True
                   continue

               if not resp.content:
                   self.logger.warning('Got empty reponse from server.')
               else:

                   self.logger.info("Report downloaded.")

                   template = self.new_report()
                   template["feed.url"] = url
                   template["extra.email_subject"] = message.subject
                   template["extra.email_from"] = ','.join(x['email'] for x in message.sent_from)
                   template["extra.email_message_id"] = message.message_id
                   template["extra.file_name"] = file_name_from_response(resp)
                   data = io.BytesIO(resp.content)
                   reader = csv.reader(codecs.iterdecode(data, 'utf-8'))
                   lines = len(list(reader))
                   if int(event_num) == int(lines) - 1:
                       for report in generate_reports(template, io.BytesIO(resp.content),
                                                      self.chunk_size,
                                                      self.chunk_replicate_header):
                           self.send_message(report)
                   else:
                       self.logger.warning(str(template["extra.file_name"]) + ' downloaded incompletely')
                       incomplete = True
                       erroneous = True

               if incomplete:
                   seen = False
               else:
                   seen = True

If I get an incomplete download I keep the email as unread so it can be downloaded next time. BTW almost every day I get some incomplete warnings!!

G0meisa avatar Jan 03 '21 05:01 G0meisa

Thanks for your PoC!

The line event_num = re.search('contains (\d+) event', body.replace(',','')) could be generalized by moving the regex to a separate parameter. lines = len(list(reader)) not really performant, counting newlines would work faster. But is still depends on the format and it's extra data (like CSV headers, eventually comments) and is therefore not future-proof. Comparing bytes has the big advantage that does not depend on this issues, but Shadowserver provides only a rounded size ("157K bytes") unfortunately.

I hate (network) connection issues like this.

ghost avatar Jan 05 '21 10:01 ghost

cc @elsif2

sebix avatar Jul 01 '22 19:07 sebix