gocsv icon indicating copy to clipboard operation
gocsv copied to clipboard

Grouping/stacking datasets

Open geekscrapy opened this issue 5 years ago • 1 comments

Feature request: It would be amazing to be able to perform the following analysis with gocsv - unless of course I've missed something!

https://www.fireeye.com/blog/threat-research/2012/11/indepth-data-stacking.html

geekscrapy avatar Dec 10 '19 09:12 geekscrapy

Thanks for the link, I always enjoy a read about aggregating/grouping, forensics, and statistical thinking.

For what I read there's nothing magical/special in the tooling that jumped out at me, except they didn't show any of the tools, so maybe there's some magic in terms of efficiently handling large datasets.

Considering the first example... if the services data looked something like:

big_data.csv

Service Name Path Service DLL
Seclogon system32\svchost.exe system32\seclogon.dll
Seclogon system32\svchost.exe system32\seclogon.dll
Seclogon system32\svchost.exe system32\seclogon.dll
... 5595 more ... rows ... like this
Seclogon system32\svchost.exe system32\selogon.dll
Seclogon system32\svchost.exe system32\selogon.dll
iprip system32\svchost.exe system32\iprip.dll
... 5234 more ... rows ... like this
iprip system32\svchost.exe system32\iprinp.dll
iprip system32\svchost.exe system32\iprinp.dll
iprip system32\svchost.exe temp\iprip.dll
iprip system32\svchost.exe temp\iprip.dll
iprip system32\svchost.exe temp\iprip.dll

This GoCSV pipeline:

gocsv unique  -c Service\ Name,Path,Service\ DLL --count big_data.csv | gocsv select -c Count,Service\ Name,Path,Service\ DLL

would produce a table like:

Count Service Name Path Service DLL
5598 Seclogon system32\svchost.exe system32\seclogon.dll
2 Seclogon system32\svchost.exe system32\selogon.dll
5235 iprip system32\svchost.exe system32\iprip.dll
2 iprip system32\svchost.exe system32\iprinp.dll
3 iprip system32\svchost.exe temp\iprip.dll

For unique, choose the columns that represent the idea you want to investigate for different-ness. If any single value in those columns in one row is different than another value in the same column in another row, that'll make a unique group and will be counted. And then pass that through select to pare the result down, for visual inspection.

zacharysyoung avatar May 25 '21 22:05 zacharysyoung