Compiler icon indicating copy to clipboard operation
Compiler copied to clipboard

detect invalid regex in lexer

Open CircleCode opened this issue 12 years ago • 8 comments

in Lexer.php, preg_match could return false in case of regex error.

the false return should be tested and explained at https://github.com/hoaproject/Compiler/blob/master/Llk/Lexer.php#L270


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

CircleCode avatar Dec 02 '13 09:12 CircleCode

Hello :-),

This will change with this PR https://github.com/hoaproject/Compiler/pull/6.

Hywan avatar Dec 02 '13 09:12 Hywan

And additionnaly, this can be done separately in this block.

Hywan avatar Dec 02 '13 09:12 Hywan

PR #6 still has this issue

on another side, I like the idea of detecting wrong regexes during pp parsing, like you suggest it, but I think having to test each regex at this stage has an overhead, and preg_match could still return false at line 270 (even with valid regex - for example when offset > length(string)… this should never happen, right…) moreover, just testing false === $preg does not seem a huge overhead to me, but the error would be reported late in the process, which is sad

CircleCode avatar Dec 02 '13 10:12 CircleCode

I'm adding the required check and benchmarking it.

Here is the script I use to bench:

<?php

require_once __DIR__ . '/vendor/autoload.php';

passthru(PHP_BINARY . ' -n -v');
$contents = <<<PP
#rule:
    choice()

choice:
    concatenation() ( ::or:: concatenation() #choice )*

concatenation:
    repetition() ( repetition() #concatenation )*

repetition:
    simple() ( quantifier() #repetition )? <node>?

simple:
    ::capturing_:: choice() ::_capturing::
  | ::skipped:: <token> ( ::unification_:: <unification> ::_unification:: )?
    ::skipped:: #skipped
  | ::kept_:: <token> ( ::unification_:: <unification> ::_unification:: )?
    ::_kept:: #kept
  | <token> ::named::
    ( ::unification_:: <unification> ::_unification:: )? #named

quantifier:
    <zero_or_one>
  | <one_or_more>
  | <zero_or_more>
  | <n_to_m>
  | <n_or_more>
  | <exactly_n>
PP;


$bench = new Hoa\Bench\Bench();
$bench->one_iteration->start();
try {
    $parser = \Hoa\Compiler\Llk\Llk::load(new \Hoa\File\Read(__DIR__ . '/Llk/Llk.pp'));
    $parser->parse($contents);
} catch (\exception $e) {}
$bench->one_iteration->stop();

$bench->hundred_iterations->start();

for ($i = 100; $i > 0; $i--) {
    try {
        $parser = \Hoa\Compiler\Llk\Llk::load(new \Hoa\File\Read(__DIR__ . '/Llk/Llk.pp'));
        $parser->parse($contents);
    } catch (\exception $e) {}
}

$bench->hundred_iterations->stop();

echo $bench;

Executing it with php -n test.phpproduces the following results:

No PCRE check

PHP 7.0.3 (cli) (built: Feb 17 2016 22:30:41) ( ZTS )
Copyright (c) 1997-2016 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies
__global__          ||||||||||||||||||||||||||||||||||||||||||||   696ms, 100.0%
one_iteration       |                                               18ms,   2.6%
hundred_iterations  |||||||||||||||||||||||||||||||||||||||||||    678ms,  97.4%

With PCRE check

PHP 7.0.3 (cli) (built: Feb 17 2016 22:30:41) ( ZTS )
Copyright (c) 1997-2016 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies
__global__          ||||||||||||||||||||||||||||||||||||||||||||   954ms, 100.0%
one_iteration       |                                               19ms,   2.0%
hundred_iterations  |||||||||||||||||||||||||||||||||||||||||||    934ms,  97.9%

This was implemented using $preg = @preg_match(...); if (false === $preg) { throw ... } and a custom error handler.

The check adds up to 30% computation time: this is too costly. I think this should only be enabled when running in a debug mode (see #14)

jubianchi avatar Mar 11 '16 17:03 jubianchi

Maybe we can check tokens once when “instanciating” the lexer. Is it where your placed it?

Hywan avatar Mar 15 '16 07:03 Hywan

ping @CircleCode?

Hywan avatar Aug 15 '16 09:08 Hywan

The regexes come from the tokens

https://github.com/hoaproject/Compiler/blob/fd6f3f943514193b349109347c56241c1c34a331/Llk/Lexer.php#L192-L201

These are supplied in

https://github.com/hoaproject/Compiler/blob/fd6f3f943514193b349109347c56241c1c34a331/Llk/Lexer.php#L111

The interesting places where lexMe is called is as follows:

  • https://github.com/hoaproject/Compiler/blob/ff934cce1745f2a5bd470abfaf0046d7b8e9025b/Llk/Rule/Analyzer.php#L149
  • https://github.com/hoaproject/Compiler/blob/ff934cce1745f2a5bd470abfaf0046d7b8e9025b/Llk/Parser.php#L163

The first code calls lexMe in a loop with the same tokens each time. Possibly this is an opportunity for optimizing this code.

Anyway i both cases it seems fine to place the check in lexMe and throw an exception.

As for how to check if the regex is valid i created this feature request which describes it in more detail https://bugs.php.net/bug.php?id=77521

flip111 avatar Jan 25 '19 15:01 flip111

<?php
// returns 1 if the pattern matches given subject, 0 if it does not, or FALSE if an error occurred
// The @ is used (unfortunately) to stop the warnings in your error_log
// The ! revedrses the condition, so the real one will be true, the bad one false

function isRegex($string)
{
    return !(@preg_match($string, null) === false);
}

var_dump(isRegex('/\w+/'));
var_dump(isRegex('thisis|64*%notRe3!ex/'));

flip111 avatar Jan 25 '19 17:01 flip111