bulk_extractor icon indicating copy to clipboard operation
bulk_extractor copied to clipboard

bulk_extractor wordlist should be rewritten to use la-strings.

Open kefir- opened this issue 11 years ago • 4 comments

bulk_extractor wordlist currently checks if a byte isprint(ch) && ch!=' ' && ch<128.

An improvement to this would be to support encodings such as UTF-8, UTF-16 and UTF-32, possibly as options specified by the user. The words should then be converted to a single encoding (UTF-8?) and then split/deduped, for possible conversion and use by the target application.

kefir- avatar May 14 '13 08:05 kefir-

This would be nice. The problem with this approach is that pretty much any random sequence of bytes will produce valid UTF-16 with the Han characters, so you're going to need to add a language model. So you'll really need to add la-strings and then tell the system which language or languages you want to extract. That's a complete rewrite to this module. Would you like to do it, or would you be happy with just English strings?

simsong avatar May 14 '13 11:05 simsong

Wouldn't it suffice if the character set (or language) was specified by the user? For example, if I were searching for Norwegian words or passwords written on a Norwegian keyboard, I could grab the full list of Norwegian UTF-8 characters from somewhere like http://en.wikipedia.org/wiki/Danish_and_Norwegian_alphabet and specify that as my UTF-8 input character set (which could also be converted to UTF-16 and UTF-32). Bundling some charsets with bulk_extractor would be useful as well.

I'm not sure how the language models in la-strings work, but if it tries to detect language based on character and word frequencies, it might fail on passwords that aren't words in any language, and specifically on passwords written on a non-english keyboard that are stored together with text in a different language, for example english language URLs or database table names. I may easily have misunderstood how la-strings works, though.

I could try to put some code together, but don't hold your breath! :-) I certainly won't feel bad if someone else beats me to it.

Straying off topic: Is there a generic way to pass parameters to modules without changing bulk_extractor core to support the parameters? That would be useful for this and other potential modules.

kefir- avatar May 14 '13 22:05 kefir-

bulk_extractor uses the -S option to pass name=value pairs to modules. This is better supported in the 1.4 codebase currently in github. Regarding specifying the language --- yes, that's possible, but our experience to date is that the examiner frequently doesn't know which languages are present on the media.

simsong avatar Jun 06 '13 02:06 simsong

Hi. I'm reviewing this for BE2.0 What is the status of Language-Aware Strings?

simsong avatar Jun 26 '20 03:06 simsong