gocsv
gocsv copied to clipboard
Grouping/stacking datasets
Feature request: It would be amazing to be able to perform the following analysis with gocsv - unless of course I've missed something!
https://www.fireeye.com/blog/threat-research/2012/11/indepth-data-stacking.html
Thanks for the link, I always enjoy a read about aggregating/grouping, forensics, and statistical thinking.
For what I read there's nothing magical/special in the tooling that jumped out at me, except they didn't show any of the tools, so maybe there's some magic in terms of efficiently handling large datasets.
Considering the first example... if the services data looked something like:
big_data.csv
Service Name | Path | Service DLL |
---|---|---|
Seclogon | system32\svchost.exe | system32\seclogon.dll |
Seclogon | system32\svchost.exe | system32\seclogon.dll |
Seclogon | system32\svchost.exe | system32\seclogon.dll |
... 5595 more | ... rows | ... like this |
Seclogon | system32\svchost.exe | system32\selogon.dll |
Seclogon | system32\svchost.exe | system32\selogon.dll |
iprip | system32\svchost.exe | system32\iprip.dll |
... 5234 more | ... rows | ... like this |
iprip | system32\svchost.exe | system32\iprinp.dll |
iprip | system32\svchost.exe | system32\iprinp.dll |
iprip | system32\svchost.exe | temp\iprip.dll |
iprip | system32\svchost.exe | temp\iprip.dll |
iprip | system32\svchost.exe | temp\iprip.dll |
This GoCSV pipeline:
gocsv unique -c Service\ Name,Path,Service\ DLL --count big_data.csv | gocsv select -c Count,Service\ Name,Path,Service\ DLL
would produce a table like:
Count | Service Name | Path | Service DLL |
---|---|---|---|
5598 | Seclogon | system32\svchost.exe | system32\seclogon.dll |
2 | Seclogon | system32\svchost.exe | system32\selogon.dll |
5235 | iprip | system32\svchost.exe | system32\iprip.dll |
2 | iprip | system32\svchost.exe | system32\iprinp.dll |
3 | iprip | system32\svchost.exe | temp\iprip.dll |
For unique, choose the columns that represent the idea you want to investigate for different-ness. If any single value in those columns in one row is different than another value in the same column in another row, that'll make a unique group and will be counted. And then pass that through select to pare the result down, for visual inspection.