simplemagic icon indicating copy to clipboard operation
simplemagic copied to clipboard

Handling of BOM leading characters

Open j256 opened this issue 8 years ago • 0 comments

From @yongminyan .

Hey @j256 , I found these issues when I was parsing certain html content that start with BOM, like byte array of "-17, -69, -65, 60, 104, 116, 109, 108, 32" (the first three bytes are UTF-8 BOM and followed by <html tag) or "-1, -2, 60, 0, 104, 0, 116, 0, 109, 0, 108, 0" (the first two bytes are UTF-16 Little-Endian BOM and followed by <html tag), in these cases, the library failed to detect it as text/html, for it to be working, I think we need to fix the issues first and then add proper magic entries, something like

+0      byte 0xEF               
+!:mime text/html
+>1     byte 0xBB               
+>>2    byte 0xBF               UTF-8 Unicode text with BOM
+>>>3   search/1/cb \<html              

and

+# UTF-16 LE
+0      byte 0xFF               
+!:mime text/html
+>1     byte 0xFE               
+>>1    lestring16 \<html                Little-endian UTF-16 Unicode text with BOM

I did not include the magic entries in the pull request as I feel those changes are not very generic, it could happen to other types like xml (i.e., different encoding), not too sure about the best solution?

Also I am not too sure lestring16/bestring16 support [Bbc] options or not, the magic5 spec does not say so, but I see lestring16/bestring16 extends from StringTypes, I mean can we do something like lestring16/cb or not?

It would be great if you can take a look and answer my two questions above, thanks a lot!

j256 avatar Dec 13 '16 15:12 j256