simplemagic
                                
                                
                                
                                    simplemagic copied to clipboard
                            
                            
                            
                        Handling of BOM leading characters
From @yongminyan .
Hey @j256 , I found these issues when I was parsing certain html content that start with BOM, like byte array of "-17, -69, -65, 60, 104, 116, 109, 108, 32" (the first three bytes are UTF-8 BOM and followed by <html tag) or "-1, -2, 60, 0, 104, 0, 116, 0, 109, 0, 108, 0" (the first two bytes are UTF-16 Little-Endian BOM and followed by <html tag), in these cases, the library failed to detect it as text/html, for it to be working, I think we need to fix the issues first and then add proper magic entries, something like
+0      byte 0xEF               
+!:mime text/html
+>1     byte 0xBB               
+>>2    byte 0xBF               UTF-8 Unicode text with BOM
+>>>3   search/1/cb \<html              
and
+# UTF-16 LE
+0      byte 0xFF               
+!:mime text/html
+>1     byte 0xFE               
+>>1    lestring16 \<html                Little-endian UTF-16 Unicode text with BOM
I did not include the magic entries in the pull request as I feel those changes are not very generic, it could happen to other types like xml (i.e., different encoding), not too sure about the best solution?
Also I am not too sure lestring16/bestring16 support [Bbc] options or not, the magic5 spec does not say so, but I see lestring16/bestring16 extends from StringTypes, I mean can we do something like lestring16/cb or not?
It would be great if you can take a look and answer my two questions above, thanks a lot!