detect invalid regex in lexer
in Lexer.php, preg_match could return false in case of regex error.
the false return should be tested and explained at https://github.com/hoaproject/Compiler/blob/master/Llk/Lexer.php#L270
Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.
Hello :-),
This will change with this PR https://github.com/hoaproject/Compiler/pull/6.
And additionnaly, this can be done separately in this block.
PR #6 still has this issue
on another side, I like the idea of detecting wrong regexes during pp parsing, like you suggest it, but I think having to test each regex at this stage has an overhead, and preg_match could still return false at line 270 (even with valid regex - for example when offset > length(string)… this should never happen, right…)
moreover, just testing false === $preg does not seem a huge overhead to me, but the error would be reported late in the process, which is sad
I'm adding the required check and benchmarking it.
Here is the script I use to bench:
<?php
require_once __DIR__ . '/vendor/autoload.php';
passthru(PHP_BINARY . ' -n -v');
$contents = <<<PP
#rule:
choice()
choice:
concatenation() ( ::or:: concatenation() #choice )*
concatenation:
repetition() ( repetition() #concatenation )*
repetition:
simple() ( quantifier() #repetition )? <node>?
simple:
::capturing_:: choice() ::_capturing::
| ::skipped:: <token> ( ::unification_:: <unification> ::_unification:: )?
::skipped:: #skipped
| ::kept_:: <token> ( ::unification_:: <unification> ::_unification:: )?
::_kept:: #kept
| <token> ::named::
( ::unification_:: <unification> ::_unification:: )? #named
quantifier:
<zero_or_one>
| <one_or_more>
| <zero_or_more>
| <n_to_m>
| <n_or_more>
| <exactly_n>
PP;
$bench = new Hoa\Bench\Bench();
$bench->one_iteration->start();
try {
$parser = \Hoa\Compiler\Llk\Llk::load(new \Hoa\File\Read(__DIR__ . '/Llk/Llk.pp'));
$parser->parse($contents);
} catch (\exception $e) {}
$bench->one_iteration->stop();
$bench->hundred_iterations->start();
for ($i = 100; $i > 0; $i--) {
try {
$parser = \Hoa\Compiler\Llk\Llk::load(new \Hoa\File\Read(__DIR__ . '/Llk/Llk.pp'));
$parser->parse($contents);
} catch (\exception $e) {}
}
$bench->hundred_iterations->stop();
echo $bench;
Executing it with php -n test.phpproduces the following results:
No PCRE check
PHP 7.0.3 (cli) (built: Feb 17 2016 22:30:41) ( ZTS )
Copyright (c) 1997-2016 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies
__global__ |||||||||||||||||||||||||||||||||||||||||||| 696ms, 100.0%
one_iteration | 18ms, 2.6%
hundred_iterations ||||||||||||||||||||||||||||||||||||||||||| 678ms, 97.4%
With PCRE check
PHP 7.0.3 (cli) (built: Feb 17 2016 22:30:41) ( ZTS )
Copyright (c) 1997-2016 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies
__global__ |||||||||||||||||||||||||||||||||||||||||||| 954ms, 100.0%
one_iteration | 19ms, 2.0%
hundred_iterations ||||||||||||||||||||||||||||||||||||||||||| 934ms, 97.9%
This was implemented using $preg = @preg_match(...); if (false === $preg) { throw ... } and a custom error handler.
The check adds up to 30% computation time: this is too costly. I think this should only be enabled when running in a debug mode (see #14)
Maybe we can check tokens once when “instanciating” the lexer. Is it where your placed it?
ping @CircleCode?
The regexes come from the tokens
https://github.com/hoaproject/Compiler/blob/fd6f3f943514193b349109347c56241c1c34a331/Llk/Lexer.php#L192-L201
These are supplied in
https://github.com/hoaproject/Compiler/blob/fd6f3f943514193b349109347c56241c1c34a331/Llk/Lexer.php#L111
The interesting places where lexMe is called is as follows:
- https://github.com/hoaproject/Compiler/blob/ff934cce1745f2a5bd470abfaf0046d7b8e9025b/Llk/Rule/Analyzer.php#L149
- https://github.com/hoaproject/Compiler/blob/ff934cce1745f2a5bd470abfaf0046d7b8e9025b/Llk/Parser.php#L163
The first code calls lexMe in a loop with the same tokens each time. Possibly this is an opportunity for optimizing this code.
Anyway i both cases it seems fine to place the check in lexMe and throw an exception.
As for how to check if the regex is valid i created this feature request which describes it in more detail https://bugs.php.net/bug.php?id=77521
<?php
// returns 1 if the pattern matches given subject, 0 if it does not, or FALSE if an error occurred
// The @ is used (unfortunately) to stop the warnings in your error_log
// The ! revedrses the condition, so the real one will be true, the bad one false
function isRegex($string)
{
return !(@preg_match($string, null) === false);
}
var_dump(isRegex('/\w+/'));
var_dump(isRegex('thisis|64*%notRe3!ex/'));