doc-en icon indicating copy to clipboard operation
doc-en copied to clipboard

mbregex missing documentation about supported RE syntax

Open dashiad opened this issue 3 years ago • 7 comments

Description

The following code:

<?php
mb_ereg_search_init("qaaoiuu");
mb_ereg_search_setpos(1);
var_dump(mb_ereg_search_pos("^a{2}(..)u{2}"));

Resulted in this output:

bool(false)

But I expected this output instead:

array(2) {
  [0]=>
  int(1)
  [1]=>
  int(6)
}

The "^" modifier is not applying to the start of the whole string either (independently of "setpos"), as :

<?php
mb_ereg_search_init("aaoiuu");
mb_ereg_search_setpos(1);
var_dump(mb_ereg_search_pos("^a{2}(..)u{2}"));

Returns false too (this should be expected).

bool(false)

Does "^" have a different meaning in the mb_ereg_search functions?

PHP Version

PHP 8.1.12 and previous versions

Operating System

No response

dashiad avatar Nov 18 '22 14:11 dashiad

Does "^" have a different meaning in the mb_ereg_search functions?

Looks like that, but given that there is almost no documentation on the patterns supported by the mb_ereg_search_*() API, it's hard to tell. Anyhow, it seems you're looking for \G (aka. "where the current search attempt begins").

cmb69 avatar Nov 18 '22 15:11 cmb69

Yes! Thank you, adding \G works. Not sure if "^" should work too, but \G is a good workaround...So i dont know if this should be closed or not!

dashiad avatar Nov 18 '22 15:11 dashiad

So i dont know if this should be closed or not!

The ticket should probably be transferred to doc-en, but I'm not sure.

cmb69 avatar Nov 18 '22 16:11 cmb69

The ticket should probably be transferred to doc-en, but I'm not sure.

I think so. A comment on the mb_ereg page mentions geoffgarside/oniguruma as a syntax reference, where ^ is described as "beginning of the line". Not exactly precise, but it seems to behave with the same semantics as PCRE's ^ which also won't match after the start of the string.

damianwadley avatar Nov 18 '22 16:11 damianwadley

Hi all Edit: mb_regex_set_options only be set one mode but multiple options

<?php

mb_ereg_search_init("aaaoi\r\nuu", '\\Aa{3}[^a]+u{2}\\z', 'pr');
//mb_ereg_search_setpos(1);
preg_match('/\\Aa{3}[^a]+u{2}\\z/', "aaaoi\r\nuu", $matches);
var_dump(mb_ereg_search_pos('\\Aa{3}[^a]+u{2}\\z', 'pr'), $matches);

?>

It appears from the source php-src that the default syntax mode is ruby ​​and the p option

PCRE Escape sequences The \A, \Z, and \z assertions differ from the traditional circumflex and dollar (described in anchors ) in that they only ever match at the very start and end of the subject string

The \G assertion is true only when the current matching position is at the start point of the match, as specified by the offset argument of preg_match(). It differs from \A when the value of offset is non-zero.

hormus avatar Nov 18 '22 17:11 hormus

Oh, right, the options! There is some documentation about these on https://www.php.net/manual/en/function.mb-regex-set-options.php, but that likely can be improved (mention defauts, etc.), and we should at the very least link to https://github.com/kkos/oniguruma/blob/master/doc/RE which has detailed info about patterns.

cmb69 avatar Nov 22 '22 10:11 cmb69

oh thanks @cmb69 for improving php it's actually a headache there is an exception mb_split although it has no arguments options internally it uses them.

<?php

$string = "0aaa\r\n0a0";
mb_regex_set_options(''); //Empty string set empty mode and option
//if mode is empty default Ruby, if empty option is null
$a = mb_split('^', $string);
var_dump(mb_regex_set_options(), $a);

?>

https://github.com/php/php-src/blob/master/ext/mbstring/tests/mb_split_empty_match.phpt Without manually setting the default is mode syntax Ruby and option p which is bitwise OR between singleline (s) and multiline (m) as pointed out by the test by @nikic , the expected behavior is with m.. I wonder if oniguruma sees right?

new(string, [options]) → regexpclick to toggle source new(regexp) → regexp compile(string, [options]) → regexp compile(regexp) → regexp Constructs a new regular expression from pattern, which can be either a String or a Regexp (in which case that regexp's options are propagated), and new options may not be specified (a change as of Ruby 1.8).

If options is an Integer, it should be one or more of the constants Regexp::EXTENDED, Regexp::IGNORECASE, and Regexp::MULTILINE, or-ed together. Otherwise, if options is not nil or false, the regexp will be case insensitive.

Sorry i couldn't see it in the php/php-src what oniguruma actually does. If mb_split can't use other than mode syntax Ruby need document, also empty string to set mode and option to its default mode syntax ruby and option null.

hormus avatar Nov 23 '22 01:11 hormus