IPED
IPED copied to clipboard
Add a RawStringsParser for non Latin1 languages
Current RawStringsParser, used to extract strings from unallocated, unknown, corrupted or not supported files, extracts Latin1 scripts encoded with windows-1252, UTF-8 or UTF-16, even mixed in the same file. That is a custom implementation and very fast strings extractor.
We should add a more generic strings extractor where the encodings or scripts extracted could be configured by the user, even if it is much slower than the default.
I just came up with an idea for this, we could use a similar heuristic for charset detection implemented months ago for PST/OST emails with unknown charset, running the detection on small blocks with some intersection. Not sure about the block and intersection sizes, this would need testing. Probably will be slower and disabled by default, but should be generic enough to handle different charsets and scripts.