bulk_extractor
bulk_extractor copied to clipboard
bulk_extractor wordlist should be rewritten to use la-strings.
bulk_extractor wordlist currently checks if a byte isprint(ch) && ch!=' ' && ch<128.
An improvement to this would be to support encodings such as UTF-8, UTF-16 and UTF-32, possibly as options specified by the user. The words should then be converted to a single encoding (UTF-8?) and then split/deduped, for possible conversion and use by the target application.
This would be nice. The problem with this approach is that pretty much any random sequence of bytes will produce valid UTF-16 with the Han characters, so you're going to need to add a language model. So you'll really need to add la-strings and then tell the system which language or languages you want to extract. That's a complete rewrite to this module. Would you like to do it, or would you be happy with just English strings?
Wouldn't it suffice if the character set (or language) was specified by the user? For example, if I were searching for Norwegian words or passwords written on a Norwegian keyboard, I could grab the full list of Norwegian UTF-8 characters from somewhere like http://en.wikipedia.org/wiki/Danish_and_Norwegian_alphabet and specify that as my UTF-8 input character set (which could also be converted to UTF-16 and UTF-32). Bundling some charsets with bulk_extractor would be useful as well.
I'm not sure how the language models in la-strings work, but if it tries to detect language based on character and word frequencies, it might fail on passwords that aren't words in any language, and specifically on passwords written on a non-english keyboard that are stored together with text in a different language, for example english language URLs or database table names. I may easily have misunderstood how la-strings works, though.
I could try to put some code together, but don't hold your breath! :-) I certainly won't feel bad if someone else beats me to it.
Straying off topic: Is there a generic way to pass parameters to modules without changing bulk_extractor core to support the parameters? That would be useful for this and other potential modules.
bulk_extractor uses the -S option to pass name=value pairs to modules. This is better supported in the 1.4 codebase currently in github. Regarding specifying the language --- yes, that's possible, but our experience to date is that the examiner frequently doesn't know which languages are present on the media.
Hi. I'm reviewing this for BE2.0 What is the status of Language-Aware Strings?