spark-netflow
spark-netflow copied to clipboard
Consider adding aggregation similar to flow-tools
Aggregation should be flexible, e.g. specifying groupBy and aggregation on numeric columns. Also need to investigate why flow-tools
drop records when doing report in some cases.
Any update on this? Also are you considering to implement the ability to calculate flows directly from pcap files?
Hi,
I have not done much work for this. I normally use it with Spark, so aggregation could be done there (it is still slow, in my opinion, but I will address that later). I do not have pcap file sample to implement this functionality. So this issue is sort of stuck.
If you could help with pcap files, would be great.
P.S. Could you give me a link to pcap format and explain a little bit what it is structurally? I normally work with NetFlow files only, so do not have much experience with other formats.
This should give you an idea about pcap file format : https://wiki.wireshark.org/Development/LibpcapFileFormat. Its pretty straightforward.
You will also find a lot of pcap samples on the same website.
I have a few questions so I can better understand if the feature I requested makes sense to implement in this library.
What software produces Netflow files for you. What is the main use case of this library and how is it supposed to be used?
Is there a gitter channel available so we can take this discussion further?
Thanks for the link.
pcap
files look similar to netflow
files, header is simpler though, which is a good thing. I can generate sample netflow files using flow-gen
, which comes with flow-tools
, I believe one could still install it using apt-get install flow-tools
. You could also use nfdump
to read those files.
Specification is here (streaming variant, I use files which are slightly different): http://netflow.caligare.com/netflow_v5.htm
Normally, we get files delivered in this format already (I assume collected and compressed by some cisco software and hardware), files can be somehow large (hundreds of megabytes compressed binaries).
This library is written mainly to use Apache Spark (http://spark.apache.org/) to read files and utilize cluster to do easy ETL, since library will convert netflow data into DataFrame, but can be used as Java code to read files. There is section in README how to do very simple test. Also some samples files are included in repository as test resources.
Do you use Spark to read pcap files?
Unfortunately there is no gitter channel.
@sadikovi Thanks for the explanation. Is the process to dump netflow
files automated out of the box - meaning is Cisco hardware is capable to doing that or is there some additional code that extracts the netflow
files from the hardware and dump them in the place your spark job is looking for?
Yes we read pcap
files using spark and technically speaking we should be able to calculate flows directly from pcap
records. I guess I am stuck at researching that bit :)
@r4ravi2008 something like that, I am not exactly sure how collection happens - my main work is making sure that spark can read whatever files were delivered:)
I will have a look at pcap files this weekend to see how difficult it is to implement/use existing reader, will try to make it not to rely on any external commands.
How do you read pcap files? Do you use PipedRDDs and call shell command to read files?
I will be also, in addition to wiki, using this repo as reference (looks like it has quite a few examples: https://github.com/markofu/pcaps/tree/master/PracticalPacketAnalysis/ppa-capture-files).
@sadikovi To read pcap files I used PortableDataStream
and parse the binary data. You can do the similar thing with newHadoopApi
if you want an RDD and or by specifying DefaultSource
if you want DataFrame directly.
For parsing I kinda used references from from multiple sources: namely : this and this
If you are aiming for this library to be something like ntop/nprobe but with scalability, I think it makes sense to add the feature I mentioned. And I will be happy to help in that aspect :)
@r4ravi2008 would appreciate your help with this, thanks!