Robots.txt-Parser-Class
Robots.txt-Parser-Class copied to clipboard
Regex meta char escape using preg_quote
When using preg_match('@...@'), preg_quote($rule, '@') is expected to be used to escape input. Currently one of the following warnings occurs when a path contains some meta character:
PHP Warning: preg_match(): Compilation failed: missing ) at offset 15 in /path/to/vendor/t1gor/robots-txt-parser/source/robotstxtparser.php on line 836 PHP Warning: preg_match(): Compilation failed: unmatched parentheses at offset 1 in /path/to/vendor/t1gor/robots-txt-parser/source/robotstxtparser.php on line 836
I've seen it in some rare cases, but unfortunately never had the time to investigate it... This is indeed a bug.
Regex is not my expertise, but could this be as simple as using an non-valid URL character instead of "@"? All of the "@"s should already be escaped as far as I can see, but I'm clearly wrong about that... It's not my code, and I don't fully understand it either, to be honest...
rawurlencode()ing paths as currently do, I think, is a good way, as URL may contain any char code. But that isn't make regex escape unnecessary as it is only URL escaping. I just took a glance at code so I may be wrong about. Anyway sorry about being lazy not to add failing case. Tested on e1b052c.
require_once(__DIR__ . '/vendor/autoload.php');
$parser = new \RobotsTxtParser('User-agent: webcrawler
Disallow: /(
Disallow: /)
Disallow: /.
');
var_dump($parser->isAllowed('/%5C.', 'webcrawler') == true); // bool(false)
var_dump($parser->isAllowed('/(', 'webcrawler') == false); // bool(false)
I just took a look at the issue again, unable to fix it (for now), but here is something to continue on for the next person who tries to fix it...
private function checkBasicRule($rule, $path)
{
$rule = $this->encode_url($rule);
$rule = preg_quote($rule);
// match result
if (preg_match('@' . $rule . '@', $path)) {
if (mb_stripos($rule, '$') !== false) {
if (mb_strlen($rule) - 1 == mb_strlen($path)) {
return true;
}
} else {
$this->log[] = "Rule match: Path";
return true;
}
}
return false;
}
I'm not sure what the problem is, but I think this template is a good place to start...