Robots.txt-Parser-Class icon indicating copy to clipboard operation
Robots.txt-Parser-Class copied to clipboard

Invalid encodings are not ignored

Open JanPetterMG opened this issue 8 years ago • 2 comments

No errors/warnings should be generated when parsing, still I get these:

mb_internal_encoding(): Unknown encoding "OSF10020402" // valid, but not installed
mb_internal_encoding(): Unknown encoding "UTF9" // invalid
mb_internal_encoding(): Unknown encoding "ASCI" // invalid
mb_internal_encoding(): Unknown encoding "ISO8859" // invalid

Such typos / invalid encoding names isn't uncommon when parsing the HTTP header to detect the character encoding.

I think it's a good thing trying to convert everything to UTF-8, but according to the spec, the content is expected to be UTF-8, and any invalid content (due to parsing errors, non-valid rules, or else) shall be ignored without warnings/errors.

What we need is an custom error handler...

If a character encoding is used that results in characters being used which are not a subset of UTF-8, this may result in the contents of the file being parsed incorrectly.

Only valid records will be considered; all other content will be ignored. (...) only valid text lines will be taken into account, the rest will be discarded without warning or error.

JanPetterMG avatar Jul 30 '16 02:07 JanPetterMG

but according to the spec, the content is expected to be UTF-8

Could you please provide a spec link for this?

t1gor avatar Jul 21 '17 13:07 t1gor

Sorry about the missing spec source, here it is: https://developers.google.com/search/reference/robots_txt#file-format

Last updated April 18, 2017.

JanPetterMG avatar Jul 21 '17 13:07 JanPetterMG