IPED icon indicating copy to clipboard operation
IPED copied to clipboard

Add a RawStringsParser for non Latin1 languages

Open lfcnassif opened this issue 3 years ago • 1 comments

Current RawStringsParser, used to extract strings from unallocated, unknown, corrupted or not supported files, extracts Latin1 scripts encoded with windows-1252, UTF-8 or UTF-16, even mixed in the same file. That is a custom implementation and very fast strings extractor.

We should add a more generic strings extractor where the encodings or scripts extracted could be configured by the user, even if it is much slower than the default.

lfcnassif avatar Apr 03 '21 15:04 lfcnassif

I just came up with an idea for this, we could use a similar heuristic for charset detection implemented months ago for PST/OST emails with unknown charset, running the detection on small blocks with some intersection. Not sure about the block and intersection sizes, this would need testing. Probably will be slower and disabled by default, but should be generic enough to handle different charsets and scripts.

lfcnassif avatar May 24 '22 18:05 lfcnassif