Robots.txt-Parser-Class
Robots.txt-Parser-Class copied to clipboard
Invalid encodings are not ignored
No errors/warnings should be generated when parsing, still I get these:
mb_internal_encoding(): Unknown encoding "OSF10020402" // valid, but not installed
mb_internal_encoding(): Unknown encoding "UTF9" // invalid
mb_internal_encoding(): Unknown encoding "ASCI" // invalid
mb_internal_encoding(): Unknown encoding "ISO8859" // invalid
Such typos / invalid encoding names isn't uncommon when parsing the HTTP header to detect the character encoding.
I think it's a good thing trying to convert everything to UTF-8, but according to the spec, the content is expected to be UTF-8, and any invalid content (due to parsing errors, non-valid rules, or else) shall be ignored without warnings/errors.
What we need is an custom error handler...
If a character encoding is used that results in characters being used which are not a subset of UTF-8, this may result in the contents of the file being parsed incorrectly.
Only valid records will be considered; all other content will be ignored. (...) only valid text lines will be taken into account, the rest will be discarded without warning or error.
but according to the spec, the content is expected to be UTF-8
Could you please provide a spec link for this?
Sorry about the missing spec source, here it is: https://developers.google.com/search/reference/robots_txt#file-format
Last updated April 18, 2017.