Robots.txt-Parser-Class Invalid encodings are not ignored

Invalid encodings are not ignored

Open JanPetterMG opened this issue 8 years ago • 2 comments

No errors/warnings should be generated when parsing, still I get these:

mb_internal_encoding(): Unknown encoding "OSF10020402" // valid, but not installed
mb_internal_encoding(): Unknown encoding "UTF9" // invalid
mb_internal_encoding(): Unknown encoding "ASCI" // invalid
mb_internal_encoding(): Unknown encoding "ISO8859" // invalid

Such typos / invalid encoding names isn't uncommon when parsing the HTTP header to detect the character encoding.

I think it's a good thing trying to convert everything to UTF-8, but according to the spec, the content is expected to be UTF-8, and any invalid content (due to parsing errors, non-valid rules, or else) shall be ignored without warnings/errors.

What we need is an custom error handler...

If a character encoding is used that results in characters being used which are not a subset of UTF-8, this may result in the contents of the file being parsed incorrectly.

Only valid records will be considered; all other content will be ignored. (...) only valid text lines will be taken into account, the rest will be discarded without warning or error.

Jul 30 '16 02:07 JanPetterMG

but according to the spec, the content is expected to be UTF-8

Could you please provide a spec link for this?

Jul 21 '17 13:07 t1gor

Sorry about the missing spec source, here it is: https://developers.google.com/search/reference/robots_txt#file-format

Last updated April 18, 2017.

Jul 21 '17 13:07 JanPetterMG

Robots.txt-Parser-Class Robots.txt-Parser-Class copied to clipboard

Invalid encodings are not ignored

Robots.txt-Parser-Class
Robots.txt-Parser-Class copied to clipboard