Can't parse Team Cymru feed data
Hello,
I have daily country feed data by Team Cymru. The data is in a csv format.
I pull the data to my IntelMQ pipeline through a File Collector. However, I am having trouble parsing the data.
I've tried parsing the csv file with the Generic CSV bot, but the messages take forever to be processed in the queue.
My file contains the following columns: ip, asn & timestamp.
I tried renaming the columns to the IntelMQ' Harmonization field names structure. For example, ip to destination.ip - but the result was the same.
Any suggestions where I am getting it wrong?
Many thanks, Stefan.
Hi, could you provide a small data sample (first few lines of the file, anonymize IPs if you need to) and your complete collector and parser configuration? So that I can attempt to replicate your problem.
What IntelMQ version are you using?
Hi @gethvi ,
Thank you for the response.
Sample data:
ip asn timestamp 141.136.14.188 57134 06.3.2023 23:41 141.136.14.188 57134 06.3.2023 23:43 141.136.14.188 57134 06.3.2023 23:46 141.136.14.188 57134 06.3.2023 23:54 151.236.247.230 199128 06.3.2023 11:32 151.236.247.230 199128 06.3.2023 19:35 188.117.204.5 41557 06.3.2023 00:04
Configuration:
Bot 1: File collector
run_mode: continuous delete_file: true path: linux directory postfix: .csv rate_limit: 30
Bot 2: GenericCsv-Parser
columns: (i have tried with both blank entry, setting the column names - [ip, asn, timestamp]. I Also tried to change the names of the columns in the file.. as per the harmonization details. skip_header: I've tried bith both true and false
// no other configurations here
Moreover, in my pipeline I have expert bots - deduplicator, taxonomy, url2fqdn, gethostbyname, cymru; as well as a file output. However, the the data does not move past the parser.
My versions IntelMQ version is: 3.0.2 IntelMQ API: 3.0.1 IntelMQ Manager: 3.0.1
Ok so in theory your parser config should look something like this (don't use it just yet):
parameters:
skip_header: true # you want to skip the first line
delimiter: " " # space is your delimiter
columns:
- source.ip # first column should be mapped to "source.ip"
- source.asn # second column should be mapped to "source.asn"
- time.source # third column should be mapped to "time.source"
However there is an issue. The timestamp you provided contains space which is going to be recognized as the delimiter and therefore the parser will consider date and time to be separate columns. However you can put them back together.
Try this configuration of your parser:
parameters:
skip_header: true # you want to skip the first line
delimiter: " " # space is your delimiter
columns:
- source.ip # first column should be mapped to "source.ip"
- source.asn # second column should be mapped to "source.asn"
compose_fields:
time.source: "{2} {3}" # "time.source" is composed from columns 2 (date) and 3 (time) (columns are indexed from 0)
This configuration works (for me), however it parses the datetime wrong. :disappointed:
>>> from intelmq.lib import harmonization
>>> harmonization.DateTime.convert_fuzzy('06.3.2023 23:41')
'2023-06-03T23:41:00+00:00'
@jovanovskistef I suggest you ask Team Cymru to fix their data format. It is prone to parsing errors.