magika icon indicating copy to clipboard operation
magika copied to clipboard

Magika to detect text file encoding

Open mikacousin opened this issue 1 year ago • 5 comments

Hi, I was wondering if magika could be used in the future to detect the encoding of text files (utf-8, ascii, iso-885961, cp-1252, etc...) as this is not an easy task. Thanks

mikacousin avatar Feb 18 '24 15:02 mikacousin

You can try this: chardet

Byxs20 avatar Feb 19 '24 11:02 Byxs20

Thanks for the link. For the time being, I'm using charset_normalizer for this task. But I'm curious to know if another method is possible and more efficient.

mikacousin avatar Feb 19 '24 11:02 mikacousin

Thanks for the link. For the time being, I'm using charset_normalizer for this task. But I'm curious to know if another method is possible and more efficient.

Thanks for sharing. I'll try it sometime.

Byxs20 avatar Feb 19 '24 14:02 Byxs20

Thanks all! We would be interested in checking this out, but at first we would try to understand whether existing approaches have problems (and which ones).

I'm wondering:

  • What is the approach used by existing tools? heuristics? trial an error?
  • Does anyone have instances of misdetections or any intuition why existing approaches could have problems?
  • Are existing problems more related to efficiency/performance rather than accuracy?

reyammer avatar Feb 19 '24 18:02 reyammer

Hello @mikacousin!

The short answer is no, we're not extending Magika's scope beyond file type for now. That's already a large enough scope for this project at the moment, and we prefer to do that job really well before considering any extension.

Long answer: The Magika approach is pretty much agnostic to content types. That means that, as long as you can collect a large labeled dataset of text files with various encoding, you could train a Magika model on it to detect encodings. I'm expecting such a model to be at 99.xx% precision and recall, given what we've seen with Magika so far. That said, the top-of-the-line stats that charset_normalizer reports are already looking great, so any improvement edge is likely to be marginal.

Package Accuracy Mean per file (ms) File per sec (est)
charset-normalizer 98 % 10 ms 100 file/sec

An ML approach could potentially have two edges on solutions like charset_normalizer:

  • faster batch speed (through use of CPU features like this)
  • address the following stated limitation of charset_normalizer:

Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))

That's because ML is generally context-aware - when properly trained, it could recognise the HTML structure (or other common formats), and give more weight to the Turkish content.

invernizzi avatar Feb 19 '24 18:02 invernizzi

Thank you for your detailed answer. Shall I close this issue?

mikacousin avatar Feb 20 '24 12:02 mikacousin

Closing this for now, but tagged it so that we keep this into account for the future if this becomes relevant. Thanks!

reyammer avatar Feb 20 '24 12:02 reyammer