tune bulk_extractor's default scanners
Running on ubnist1, here is the time (in nanoseconds) and the number of seconds taken up by each of the scanners. Recursive scanners are run in separate lamdas now if the amount of data being scanned is over 4K, so these times are accurate.
It seems that email is the slowest scanner. Sadly, it is also the most useful!
There is no reason to optimize any of the scanners after the first 10, given Amdahl's law.
email and accts can be replaced by lightgrep, but I don't know if it will run faster or slower.
I should note that this is without optimization, but there is no reason to think that optimization will change the ordering.
I also also note that there is no reason to make scan_email multi-threaded, since we already have 100% CPU utilization. We need to make it take less time overall.
I should also note that we are not creating sbufs all the time. This is purely the time spent in the flex scanner, in the case of email, accts and httplogs.
<scanner><name>email</name><ns> 227718072058</ns><calls>2183</calls></scanner>
<scanner><name>aes</name><ns> 126784180419</ns><calls>2183</calls></scanner>
<scanner><name>accts</name><ns> 132036377525</ns><calls>2183</calls></scanner>
<scanner><name>net</name><ns> 106580495404</ns><calls>2183</calls></scanner>
<scanner><name>httplogs</name><ns>47204179243</ns><calls>2183</calls></scanner>
<scanner><name>find</name><ns> 39900576856</ns><calls>2183</calls></scanner>
<scanner><name>rar</name><ns> 32578984879</ns><calls>2183</calls></scanner>
<scanner><name>exif</name><ns> 23340422218</ns><calls>1952</calls></scanner>
<scanner><name>gzip</name><ns> 16205349672</ns><calls>2183</calls></scanner>
<scanner><name>winpe</name><ns> 15328157516</ns><calls>2183</calls></scanner>
<scanner><name>winprefetch</name><ns>13331845641</ns><calls>2128</calls></scanner>
<scanner><name>zip</name><ns> 11328446245</ns><calls>2183</calls></scanner>
<scanner><name>json</name><ns> 12067960716</ns><calls>2183</calls></scanner>
<scanner><name>winlnk</name><ns> 10735573433</ns><calls>2010</calls></scanner>
<scanner><name>elf</name><ns> 9188981562</ns><calls>2183</calls></scanner>
<scanner><name>hiberfile</name><ns>8199420440</ns><calls>2183</calls></scanner>
<scanner><name>base64</name><ns> 8106005868</ns><calls>2183</calls></scanner>
<scanner><name>facebook</name><ns>4813394668</ns><calls>2183</calls></scanner>
<scanner><name>utmp</name><ns> 3231399202</ns><calls>2183</calls></scanner>
<scanner><name>windirs</name><ns>2760372736</ns><calls>80</calls></scanner>
<scanner><name>pdf</name><ns> 2668592167</ns><calls>2183</calls></scanner>
<scanner><name>evtx</name><ns> 1281086989</ns><calls>2183</calls></scanner>
<scanner><name>ntfsusn</name><ns>1582712731</ns><calls>2183</calls></scanner>
<scanner><name>gps</name><ns> 403738195</ns><calls>2183</calls></scanner>
<scanner><name>kml</name><ns> 265192204</ns><calls>2183</calls></scanner>
<scanner><name>sqlite</name><ns> 250249344</ns><calls>2183</calls></scanner>
<scanner><name>vcard</name><ns> 216024356</ns><calls>2183</calls></scanner>
<scanner><name>msxml</name><ns> 143476873</ns><calls>2183</calls></scanner>
<scanner><name>ntfsmft</name><ns>40870371</ns><calls>2183</calls></scanner>
<scanner><name>ntfslogfile</name><ns>19422627</ns><calls>2183</calls></scanner>
<scanner><name>ntfsindx</name><ns>16005992</ns><calls>2183</calls></scanner>
Interesting. Which search terms did you use for scan_find?
I hope to get the lightgrep scanners working this week. My recollection was that they were indeed faster than the equivalent flex scanners—there's just one search pass through each sbuf and then dispatch of hits to the different scanners, and the time for that one pass and then hit resolution was significantly faster than the combined time of the flex-based scanners. What I remember was that I could hear the difference on my workstation, as the fans didn't max out.
My recollection was also that scan_aes was significantly slower than the other scanners. Interesting to see that's not the case on ubnist1.
Thanks for looking at that. That's a bug! I didn't have any search terms, so scan_find shouldn't be running.
I was thinking of ways to speeding up scan_aes. One idea was that if a block only has 2 or 3 different char codes in it, don't bother scanning it, because encryption keys like 11 22 11 22 11 22 11 22... are exceedingly unlikely. Of course, then we will miss those encryption keys.
With ubnist1, do you use gen1, gen2, or gen3?
Speeding up scan_aes can be represented with #158. I will put some notes in there. I think there's more performance to be gotten from focusing on a rewrite of the valid key schedule functions than adding conditionals.
Gen3
Closing this. We did a good job turning for BE2.0.