hashlink icon indicating copy to clipboard operation
hashlink copied to clipboard

Migrate to pcre2

Open tobil4sk opened this issue 3 years ago • 35 comments

PCRE 1 is no longer maintained and doesn't get updates, so this PR ports the regexp module of the standard library to use PCRE 2 instead. All build systems have been updated to link PCRE 2.

See also: https://github.com/HaxeFoundation/haxe/issues/10491

tobil4sk avatar Feb 19 '22 19:02 tobil4sk

I am surprised this was merged on the deprecated nekovm before here:

  • HaxeFoundation/neko#249

Uzume avatar Apr 21 '22 16:04 Uzume

This one isn't green. :P

Simn avatar Apr 21 '22 16:04 Simn

@Simn Check again ;)

tobil4sk avatar Apr 23 '22 08:04 tobil4sk

I'm a bit irritated that there are so many new .c files. Is this just because they reorganized something? I just want to make sure we don't add anything unnecessary here because the stats say that this PR has a net gain of like 26k locs.

Simn avatar Apr 23 '22 09:04 Simn

It's just because we have an entire copy of the pcre library within include/pcre, so updating to pcre2 requires updating all these files. I think the reasoning behind this has been to make it easier to compile on Windows, so I just followed the legacy and kept the library there. (I did try to see if any of them could be removed, however, all of the ones that are left seem necessary)

tobil4sk avatar Apr 23 '22 09:04 tobil4sk

And yes, the files were reorganised with pcre2 as far as I can tell. This the documentation file that describes how to build pcre2 on Windows which I followed: https://github.com/PCRE2Project/pcre2/blob/master/NON-AUTOTOOLS-BUILD

tobil4sk avatar Apr 23 '22 09:04 tobil4sk

@ncannasse Please check if this works for you and merge accordingly. We should make this change, but I'd like to make sure it doesn't break anything for you.

Simn avatar Apr 23 '22 09:04 Simn

By the way, for large dependencies included in the repo you might want to mark them as "vendored" in the attributes: .gitattributes

Aurel300 avatar Apr 24 '22 11:04 Aurel300

Ah, I wasn't aware of that. We should probably just mark the entire include directory, although that might be outside the scope of this PR.

tobil4sk avatar Apr 24 '22 13:04 tobil4sk

#541 should mark these directories as third party libs as you suggested.

tobil4sk avatar Apr 24 '22 14:04 tobil4sk

Few comments:

I'm quite worried with the following:

// this is reinitialised for each new regex object...
	match_context = pcre2_match_context_create_16(NULL);
	pcre2_set_depth_limit_16(match_context, 3500); 

Seems like a memory leak because the context is never free

ncannasse avatar Apr 26 '22 08:04 ncannasse

@tobil4sk Why are you adding the _16 suffixes all over in src/std/regexp.c? The macros in include/pcre/pcre2.h already take care of adding such when PCRE2_CODE_UNIT_WIDTH is defined appropriately (and they complain verbosely when it isn't)? There is no need for that extra noise (and it is possible such things could become problematic too if at some point the direct API changes and the macro API remains the same, etc.).

I also agree with @ncannasse, if you are calling pcre2_match_context_create then there should be a matching call to pcre2_match_context_free somewhere to balance things (unless I am missing some sort of automatic garbage collection somewhere). match_context should probably be moved into ereg and then pcre2_match_context_free called on it in its finalize (regexp_finalize).

Incidentally, since it seems you are always setting the match recursion depth limit to a constant, you could instead set the compile-time option MATCH_LIMIT_DEPTH (either in include/pcre/config.h or externally in the make files) and then pass NULL for the pcre2_match_context parameter of pcre2_match. That would alleviate having to allocate and deallocate such in src/std/regexp.c ereg.

It seems like the same memory leak was already introduced into nekovm with HaxeFoundation/neko#249 (where oddly you decided to not use the _8 suffixes).

Uzume avatar Apr 26 '22 20:04 Uzume

Seems like a memory leak because the context is never free

@ncannasse This is required for the entire lifetime of the program, which is why it could not be freed anywhere. The memory leak existed before as well, however, I guess now is a good time to deal with it. With @Uzume's tip (thanks btw!) about the compile time option, we can avoid this issue altogether.

Why are you adding the _16 suffixes all over in src/std/regexp.c

@Uzume I left the 16 suffixes because they were used before (whereas the neko files didn't have them). I can remove them but I assumed there was some reason to add them if they're optional.

tobil4sk avatar Apr 27 '22 15:04 tobil4sk

Although, the only issue with this is that now the variable must be set externally (either in config.h or every build system) rather then conveniently in the regexp file. Setting it manually in make/cmake/vs builds is a bit messy, but at the same time having it in config.h means that there's a risk that when updating the library in future, someone may miss this configuration out, which would silently cause problems.

EDIT: Although, this is already necessary for PCRE2_CODE_UNIT_WIDTH, so I guess it won't do harm to add the extra flag.

tobil4sk avatar Apr 27 '22 17:04 tobil4sk

@tobil4sk No, I do not think there was a memory leak before. Before we had:

// ...
static pcre16_extra limit;
// ...
// hl_regexp_new_options
// ...
	limit.flags = PCRE_EXTRA_MATCH_LIMIT_RECURSION;
	limit.match_limit_recursion = 3500; // adapted based on Windows 1MB stack size
// ...
// hl_regexp_match
// ...
	int res = pcre16_exec(e->p,&limit,(PCRE_SPTR16)s,pos+len,pos,PCRE_NO_UTF16_CHECK,e->matches,e->nmatches * 3);
// ...

limit was a statically allocated struct that had some fields repeatedly reinitialized in hl_regexp_new_options and was used repeatedly by pcre16_exec in hl_regexp_match.

Now we have:

// ...
static pcre2_match_context_16 *match_context;
// ...
// hl_regexp_new_options
// ...
	// this is reinitialised for each new regex object...
	match_context = pcre2_match_context_create_16(NULL);
	pcre2_set_depth_limit_16(match_context, 3500); // adapted based on Windows 1MB stack size
// ...
// hl_regexp_match
// ...
	int res = pcre2_match_16(e->regex,(PCRE2_SPTR16)s,pos+len,pos,PCRE2_NO_UTF_CHECK,e->match_data,match_context);
// ...

match_context is a statically allocated pointer to a struct that is repeatedly allocated and initialized by pcre2_match_context_create and has a field set by pcre2_set_depth_limit in hl_regexp_new_options and is used repeatedly by pcre2_match in hl_regexp_match.

The repeated calls of pcre2_match_context_create with no matching calls to pcre2_match_context_free is the introduction of a memory leak. This is why I first suggested moving match_context into ereg and adding pcre2_match_context_free to its finalizer. But I also suggested you move to a compile-time option of changing MATCH_LIMIT_DEPTH instead and removing the run-time code. The downside of using the compile-time option is that it affects the generated library and if we were sharing the generated library with other code that did not want such set by default it would be problematic. As it is we are generating the library only for our hashlink vm and I assume pcre is statically linked (or the dynamically linked library is specific to the built hashlink vm)

nekovm uses 8-bit code units (i.e., ASCII, UTF-8, etc.) but hashlink uses 16-bit code units (i.e., UTF-16).

In the old pcre.h API, 8-bit code unit interfaces have no numeric size references but 16-bit and 32-bit code unit interfaces are all designated with 16 or 32 in the names. The old API defines all three code unit interfaces together.

In the new pcre2.h API, all the different code unit interfaces are created by macros that result in _8, _16, or _32 being appended to them but the macros also require the consumer of the interface to define PCRE2_CODE_UNIT_WIDTH to specify the size of the code unit interface the consumer wants to use and it then provides a set of unsuffixed macro interfaces that automatically add the appropriate suffixes to the interface references based upon the provided (and required) PCRE2_CODE_UNIT_WIDTH. PCRE2_CODE_UNIT_WIDTH can also be defined to be 0 but this is for multi-width code unit applications where one must manually specify the suffixed interfaces, however, that is quite rare and in general a consumer only wants access to a single set of code unit interfaces, so this is simplified so they can all use the same code references that have no suffixes and the suffixes will be automatically applied by the judicious choice of how PCRE2_CODE_UNIT_WIDTH is defined.

For example, in the new pcre2.h API, nekovm and hashlink could in theory use a the same consumer interface code (if it used pcre in the same way) so long as they defined PCRE2_CODE_UNIT_WIDTH differently.

I recommend you drop the _16 suffixes in hashlink and now you know why 8 was not used in the nekovm consumer code and why you could have but did not need to add _8 to the new code there.

Uzume avatar Apr 27 '22 18:04 Uzume

@Uzume Thanks for the insight!

match_context is a statically allocated pointer to a struct that is repeatedly allocated and initialized by pcre2_match_context_create

Yes, I see the difference now (and where the memory leak would be). The only reason it is repeatedly allocated is because, as far as I could tell, there was no way of allocating something when a module is first loaded, the same way that can be done with neko.

I also suggested you move to a compile-time option of changing MATCH_LIMIT_DEPTH

I think this is probably the best way, because aside from that there is no other reason for having a context object. Also, the context should really be global, as there is currently no reason to have a different context per regex object. Just about whether to put it in config.h or to put it in make and the other build systems.

I recommend you drop the _16 suffixes in hashlink

That explanation makes sense, so I do think now it would make sense to get rid of the suffix ;)

tobil4sk avatar Apr 27 '22 18:04 tobil4sk

Alternatively, I guess we could check whether the context object is null before allocating it, but I guess that might have some thread safety concerns attached to it.

tobil4sk avatar Apr 27 '22 18:04 tobil4sk

@tobil4sk

EDIT: Although, this is already necessary for PCRE2_CODE_UNIT_WIDTH, so I guess it won't do harm to add the extra flag. Agreed. I also think the best method is in the build system (over config.h configuration changes; in fact generation of the config.h could be handled by the build system and the option could be provided that way).

There is a lot of indirection in the code but you can see the difference in how pcre2_set_depth_limit vs. MATCH_LIMIT_DEPTH works via this code:

// pcre2.h
// ...
#define PCRE2_TYPES_LIST \
//...
struct pcre2_real_match_context; \
typedef struct pcre2_real_match_context pcre2_match_context; \
// ...
#define PCRE2_JOIN(a,b) a ## b
#define PCRE2_GLUE(a,b) PCRE2_JOIN(a,b)
#define PCRE2_SUFFIX(a) PCRE2_GLUE(a,PCRE2_LOCAL_WIDTH)
// ...
#define pcre2_match_context            PCRE2_SUFFIX(pcre2_match_context_)
// ...
#define PCRE2_TYPES_STRUCTURES_AND_FUNCTIONS \
PCRE2_TYPES_LIST \
// ...
#define PCRE2_LOCAL_WIDTH 8
PCRE2_TYPES_STRUCTURES_AND_FUNCTIONS
#undef PCRE2_LOCAL_WIDTH

#define PCRE2_LOCAL_WIDTH 16
PCRE2_TYPES_STRUCTURES_AND_FUNCTIONS
#undef PCRE2_LOCAL_WIDTH

#define PCRE2_LOCAL_WIDTH 32
PCRE2_TYPES_STRUCTURES_AND_FUNCTIONS
#undef PCRE2_LOCAL_WIDTH
// ...
#undef PCRE2_SUFFIX
// ...
#if PCRE2_CODE_UNIT_WIDTH == 8 || \
    PCRE2_CODE_UNIT_WIDTH == 16 || \
    PCRE2_CODE_UNIT_WIDTH == 32
#define PCRE2_SUFFIX(a) PCRE2_GLUE(a, PCRE2_CODE_UNIT_WIDTH)
#elif PCRE2_CODE_UNIT_WIDTH == 0
#undef PCRE2_JOIN
#undef PCRE2_GLUE
#define PCRE2_SUFFIX(a) a
// ...
#endif
// ...

pcre2_match_context is defined to be struct pcre2_real_match_context (apparently three times).

// pcre2_intmodedep.h
// ...
typedef struct pcre2_real_match_context {
  pcre2_memctl memctl;
#ifdef SUPPORT_JIT
  pcre2_jit_callback jit_callback;
  void *jit_callback_data;
#endif
  int    (*callout)(pcre2_callout_block *, void *);
  void    *callout_data;
  int    (*substitute_callout)(pcre2_substitute_callout_block *, void *);
  void    *substitute_callout_data;
  PCRE2_SIZE offset_limit;
  uint32_t heap_limit;
  uint32_t match_limit;
  uint32_t depth_limit;
} pcre2_real_match_context;
// ...

struct pcre2_real_match_context is define with the last field named depth_limit.

// pcre2_context.c
// ...
const pcre2_match_context PRIV(default_match_context) = {
  { default_malloc, default_free, NULL },
#ifdef SUPPORT_JIT
  NULL,          /* JIT callback */
  NULL,          /* JIT callback data */
#endif
  NULL,          /* Callout function */
  NULL,          /* Callout data */
  NULL,          /* Substitute callout function */
  NULL,          /* Substitute callout data */
  PCRE2_UNSET,   /* Offset limit */
  HEAP_LIMIT,
  MATCH_LIMIT,
  MATCH_LIMIT_DEPTH };
// ...
PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
pcre2_set_depth_limit(pcre2_match_context *mcontext, uint32_t limit)
{
mcontext->depth_limit = limit;
return 0;
}
// ...

The last field of default_match_context of type pcre2_match_context is initialized to MATCH_LIMIT_DEPTH. While pcre2_set_depth_limit changes the depth_limit field of pcre2_match_context struct.

// pcre2_match.c
// ...
PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, PCRE2_SIZE length,
  PCRE2_SIZE start_offset, uint32_t options, pcre2_match_data *match_data,
  pcre2_match_context *mcontext)
// ...
#ifdef SUPPORT_JIT
// ...
  rc = pcre2_jit_match(code, subject, length, start_offset, options,
    match_data, mcontext);
// ...
#endif  /* SUPPORT_JIT */
// ...
if (mcontext == NULL)
  {
  mcontext = (pcre2_match_context *)(&PRIV(default_match_context));
  mb->memctl = re->memctl;
  }
// ...
mb->match_limit_depth = (mcontext->depth_limit < re->limit_depth)?
  mcontext->depth_limit : re->limit_depth;
// ...

pcre2_match take a last parameter named mcontext of type pointer to pcre2_match_data. If SUPPORT_JIT is defined mcontext gets passed as the last argument to pcre2_jit_match. Otherwise if mcontext is NULL it is updated to point to default_match_context and later depth_limit field of mcontext is potentially used to set match_limit_depth field of whatever mb points to (which is immaterial since it will be the depth_limit field of either the provided pcre2_match_data or default_match_context where that field was initialized to MATCH_LIMIT_DEPTH ).

I shall leave it to you to figure out that in the case of SUPPORT_JIT pcre2_jit_match never touches the depth_limit field of the any pcre2_match_data provided or not (via NULL).

Uzume avatar Apr 27 '22 19:04 Uzume

Yes, I see the difference now (and where the memory leak would be). The only reason it is repeatedly allocated is because, as far as I could tell, there was no way of allocating something when a module is first loaded, the same way that can be done with neko.

Alternatively, I guess we could check whether the context object is null before allocating it, but I guess that might have some thread safety concerns attached to it.

@tobil4sk You might also be able to do something along the lines of:

static pcre2_match_context *match_context = pcre2_match_context_create(NULL);

In that case you would still have a memory leak but of just a single solitary pcre2_match_context which you could later repeatedly use pcre2_set_depth_limit on if you wanted. And I am not sure it could really be called a memory leak as per se because it will go away when the process dies just like statically allocated items anyway (meaning it is sort of pseudo static). And there should be no thread safety concerns that way (unless you count repeatedly updating the field but that existed before anyway).

Sadly you cannot do something like this:

static pcre2_match_context match_context;

Because of the way pcre2 has defined pcre2_match_context as an opaque type unlike how pcre_extra and pcre16_extra types were defined in pcre1.

Uzume avatar Apr 27 '22 19:04 Uzume

Agreed. I also think the best method is in the build system (over config.h configuration changes; in fact generation of the config.h could be handled by the build system and the option could be provided that way).

The only issue is that if PCRE2_CODE_UNIT_WIDTH is missed out, an error is given, however, if MATCH_LIMIT_DEPTH somehow ends up missed out then nothing is shown (we could do this ourselves I guess). Generating the file via the build systems would be complicated as this would have to be done for each one of them separately (current make, cmake, and Visual Studio are all available as separate options).

Also, currently, there is a comment explaining why this value for MATCH_LIMIT_DEPTH is chosen, however, I'm not sure where this comment could go if the flag is defined in all these different places (probably will just have to be copied to make and cmake).

static pcre2_match_context *match_context = pcre2_match_context_create(NULL);

Unfortunately, it doesn't allow this because it is not a constant value (would have been a good solution though).

tobil4sk avatar Apr 27 '22 20:04 tobil4sk

Unfortunately, it doesn't allow this because it is not a constant value (would have been a good solution though).

It is a constant value (after initial allocation the pointer would never change during the entire execution) but you are right it is not a so called compile-time constant and static initialization would require such. I had a momentary lapse (sorry about that). It has been a while since I had to be concerned with lifetimes in C.

Uzume avatar Apr 27 '22 20:04 Uzume

Upon thinking about this more, I am also concerned with some other issues (that were already in the original code).

For example match_context (and the earlier limit) are globally scoped despite only ever being used in hl_regexp_match. Whether the value has a static lifetime or not, it should probably be moved to be local to that function. If we consider it local to the function, why not have a solution like this:

HL_PRIM bool hl_regexp_match( ereg *e, vbyte *s, int pos, int len ) {
	static pcre2_match_context *match_context = NULL;
	int res;
	if (match_context == NULL) {
		match_context = pcre2_match_context_create(NULL);
		pcre2_set_depth_limit(match_context, 3500); // adapted based on Windows 1MB stack size;
	}
	res = pcre2_match(e->regex,(PCRE2_SPTR16)s,pos+len,pos,PCRE2_NO_UTF_CHECK,e->match_data,match_context);
	e->matched = res >= 0;
	if( res >= 0 )
		return true;
	if( res != PCRE2_ERROR_NOMATCH )
		hl_error("An error occurred while running pcre2_match()");
	return false;
}

It is not a limit of the regular expression but rather a limit of the matching so unless we decide to put this into ereg (for its finalizer), why was this ever initialized in hl_regexp_new_options to begin with?

Uzume avatar Apr 27 '22 21:04 Uzume

@Uzume I think if we can handle it at compile time then that is the preferable solution.

For now I have placed it in each build file as well as config.h just to be safe, in case pcre2 is updated in future and the config.h is replaced, which would cause the setting to be lost. Is this fine? @ncannasse

tobil4sk avatar May 07 '22 11:05 tobil4sk

If we ever decide to link pcre2 dynamically this will have to be handled at run time again, maybe if that happens using a local static might be a fair solution.

tobil4sk avatar May 07 '22 11:05 tobil4sk

@Uzume I think if we can handle it at compile time then that is the preferable solution.

For now I have placed it in each build file as well as config.h just to be safe, in case pcre2 is updated in future and the config.h is replaced, which would cause the setting to be lost. Is this fine? @ncannasse

@tobil4sk I am not sure carrying that match depth limit is really still necessary. I can see how you are just porting that from the PCRE1 integration code in hashlink but the internal architecture has of PCRE has changed significantly so I am not sure it is really even interesting much less necessary to consider keeping that in the first place. For example, in NEWS you can see the following text for Version 10.30 14-August-2017:

  1. The main interpreter, pcre2_match(), has been refactored into a new version that does not use recursive function calls (and therefore the system stack) for remembering backtracking positions. This makes --disable-stack-for-recursion a NOOP. The new implementation allows backtracking into recursive group calls in patterns, making it more compatible with Perl, and also fixes some other previously hard-to-do issues. For patterns that have a lot of backtracking, the heap is now used, and there is an explicit limit on the amount, settable by pcre2_set_heap_limit() or (*LIMIT_HEAP=xxx). The "recursion limit" is retained, but is renamed as "depth limit" (though the old names remain for compatibility).

Based on just the above quote, I'd say newer default implementations (not dfa or jit) have traded some stack pressure for heap pressure.

Also in NON-AUTOTOOLS-BUILD in the section labelled STACK SIZE IN WINDOWS ENVIRONMENTS it states:

Prior to release 10.30 the default system stack size of 1MiB in some Windows environments caused issues with some tests. This should no longer be the case for 10.30 and later releases.

Previous in the same file and section for PCRE1 it stated:

The default processor stack size of 1Mb in some Windows environments is too small for matching patterns that need much recursion. In particular, test 2 may fail because of this. Normally, running out of stack causes a crash, but there have been cases where the test program has just died silently. See your linker documentation for how to increase stack size if you experience problems. The Linux default of 8Mb is a reasonable choice for the stack, though even that can be too small for some pattern/subject combinations.

PCRE has a compile configuration option to disable the use of stack for recursion so that heap is used instead. However, pattern matching is significantly slower when this is done. There is more about stack usage in the "pcrestack" documentation.

You can see similar text in earlier PCRE2 (the latter two about pcre2stack were later removed):

  • https://github.com/PCRE2Project/pcre2/blob/b6b716b54064bf65936f73d8542c0aa7986d73be/NON-AUTOTOOLS-BUILD
  • https://github.com/PCRE2Project/pcre2/blob/b6b716b54064bf65936f73d8542c0aa7986d73be/doc/html/NON-AUTOTOOLS-BUILD.txt
  • https://github.com/PCRE2Project/pcre2/blob/b6b716b54064bf65936f73d8542c0aa7986d73be/doc/pcre2stack.3
  • https://github.com/PCRE2Project/pcre2/blob/b6b716b54064bf65936f73d8542c0aa7986d73be/doc/html/pcre2stack.html

The code you are trying to port seems to refer to the same Windows 1Mb stack size:

	limit.flags = PCRE_EXTRA_MATCH_LIMIT_RECURSION;
	limit.match_limit_recursion = 3500; // adapted based on Windows 1MB stack size

And I am sure there were other drastic changes during the changed from PCRE1 to PCRE2, so the question is do we need it and do we bother? I am not sure I am in a position to answer that but methinks it definitely makes sense to ask such.

Uzume avatar May 08 '22 03:05 Uzume

@tobil4sk

[...]This the documentation file that describes how to build pcre2 on Windows which I followed: https://github.com/PCRE2Project/pcre2/blob/master/NON-AUTOTOOLS-BUILD

In following NON-AUTOTOOLS-BUILD, I am not sure why you choose to include config.h (which requires externally defining HAVE_CONFIG_H to have any effect) at all. It is not required.

PCRE2 seems to provide four build possibilities with regard to config.h:

  1. Use a configure and make build that creates a config.h from src/config.h.in (which also defines HAVE_CONFIG_H)
  2. Use a cmake and Visual Studio (really MSBuild) or make build that creates a config.h from cmake/pcre2-config.cmake.in (which also defines HAVE_CONFIG_H)
  3. Use a "manual" (for another or custom build system) build, create a config.h from src/config.h.generic (or some other custom method) and define HAVE_CONFIG_H
  4. Use a "manual" (for another or custom build system) build without config.h and do not define HAVE_CONFIG_H

If I am doing a "manual" or custom build (where I am not using the build scripts from the PCRE sources), methinks I would prefer option 4 over option 3 as that route allows one to integrate things with one less extraneous file.

Although, the only issue with this is that now the variable must be set externally (either in config.h or every build system) rather then conveniently in the regexp file. Setting it manually in make/cmake/vs builds is a bit messy, but at the same time having it in config.h means that there's a risk that when updating the library in future, someone may miss this configuration out, which would silently cause problems.

EDIT: Although, this is already necessary for PCRE2_CODE_UNIT_WIDTH, so I guess it won't do harm to add the extra flag.

Option 4 and setting things externally in the build system(s) also alleviates the future risk you mention above, while also allowing one to integrate the code with one less file.

The only issue is that if PCRE2_CODE_UNIT_WIDTH is missed out, an error is given, however, if MATCH_LIMIT_DEPTH somehow ends up missed out then nothing is shown (we could do this ourselves I guess). Generating the file via the build systems would be complicated as this would have to be done for each one of them separately (current make, cmake, and Visual Studio are all available as separate options).

Also, currently, there is a comment explaining why this value for MATCH_LIMIT_DEPTH is chosen, however, I'm not sure where this comment could go if the flag is defined in all these different places (probably will just have to be copied to make and cmake).

Removing config.h alleviates any such issues in the future (unless someone adds it back).

PCRE1 also optionally used a config.h and the previous integration here also used such but my question is why are we doing this?

Uzume avatar May 08 '22 03:05 Uzume

There are multiple other settings for specifying static linking and disabling jit support that are defined in this config file and it wouldn't make sense to define the long list of flags everywhere. These settings are pretty obvious as they are mentioned by name in the guide. The only reason there is a risk of missing out the match limit depth setting is because you need to know that it was configured before. I guess we could have a file somewhere explaining how to update the library so that the settings are not lost.

tobil4sk avatar May 08 '22 08:05 tobil4sk

The stuff about the stack/heap changes sounds promising. If we can get rid of this setting then all the problems are solved. I doubt there were hashlink tests to make sure whatever cases this setting was for were not crashing, but hopefully we can just safely remove it altogether judging by the fact that the stack is no longer used here?

tobil4sk avatar May 08 '22 08:05 tobil4sk

The stuff about the stack/heap changes sounds promising. If we can get rid of this setting then all the problems are solved.

@tobil4sk The stack is used but not to the same extent so it is much less likely to be causing an issue (which the code you are trying to port seems to be trying to avoid; likely because there was an issue in the past). For that reason, I believe it should not cause issues to now remove it but I have no way to actually test this theory. Maybe @ncannasse knows. He was the one that added that setting in the original PCRE1 integration on 2016-02-21, see df1828b208137c35c43aaf93d00941774d998f9b (search for PCRE_EXTRA_MATCH_LIMIT_RECURSION).

There is another potential option too. Since this setting seems only applicable to Windows, we could only set it on Windows builds. It might be possible to do this externally from the build scripts but if we opted to keep the config.h, it could be done something like:

#if defined(_WIN32) && !defined(MATCH_LIMIT_DEPTH)
#define MATCH_LIMIT_DEPTH 3500 // adapted based on Windows 1MB stack size
#endif

Then at least it would not be applied across all builds when obviously it is a tweak specific to Windows.

But frankly I do not think this does the same thing as it used to do. In digging through PCRE2 history I can see changes in config.h.generic (which is generated but still I am sure I could find the source of such if I tried). Currently:

/* The above limit applies to all backtracks, whether or not they are nested.
   In some environments it is desirable to limit the nesting of backtracking
   (that is, the depth of tree that is searched) more strictly, in order to
   restrict the maximum amount of heap memory that is used. The value of
   MATCH_LIMIT_DEPTH provides this facility. To have any useful effect, it
   must be less than the value of MATCH_LIMIT. The default is to use the same
   value as MATCH_LIMIT. There is a runtime method for setting a different
   limit. In the case of pcre2_dfa_match(), this limit controls the depth of
   the internal nested function calls that are used for pattern recursions,
   lookarounds, and atomic groups. */
#ifndef MATCH_LIMIT_DEPTH
#define MATCH_LIMIT_DEPTH MATCH_LIMIT
#endif

Notice: "limit the nesting of backtracking[...] in order to restrict the maximum amount of heap memory that is used."

And earlier:

/* The above limit applies to all backtracks, whether or not they are nested.
   In some environments it is desirable to limit the nesting of backtracking
   more strictly, in order to restrict the maximum amount of heap memory that
   is used. The value of MATCH_LIMIT_RECURSION provides this facility. To have
   any useful effect, it must be less than the value of MATCH_LIMIT. The
   default is to use the same value as MATCH_LIMIT. There is a runtime method
   for setting a different limit. */
#ifndef MATCH_LIMIT_RECURSION
#define MATCH_LIMIT_RECURSION MATCH_LIMIT
#endif

Apparently this changed names from MATCH_LIMIT_RECURSION.

And earlier still:

/* The above limit applies to all calls of match(), whether or not they
   increase the recursion depth. In some environments it is desirable to limit
   the depth of recursive calls of match() more strictly, in order to restrict
   the maximum amount of stack (or heap, if HEAP_MATCH_RECURSE is defined)
   that is used. The value of MATCH_LIMIT_RECURSION applies only to recursive
   calls of match(). To have any useful effect, it must be less than the value
   of MATCH_LIMIT. The default is to use the same value as MATCH_LIMIT. There
   is a runtime method for setting a different limit. */
#ifndef MATCH_LIMIT_RECURSION
#define MATCH_LIMIT_RECURSION MATCH_LIMIT
#endif

Notice: "limit the depth of recursive calls[...] in order to restrict the maximum amount of stack[...] that is used."

So it used to limit stack usage during recursive backtracking but now limits heap usage during backtracking (which no longer uses recursion). I am assuming it was possible to optimize the algorithm to use tail-recursion (if it wasn't already using such) which was then turned in to a loop. Thus less stack allocations and more heap allocations (which doesn't threaten the 1Mb stack default on Windows).

I doubt there were hashlink tests to make sure whatever cases this setting was for were not crashing, but hopefully we can just safely remove it altogether judging by the fact that the stack is no longer used here?

I agree. I doubt there are tests for this and as such, I recommend just ripping this out and see if someone screams. It can always be put back in if necessary.

This should likely also be done for nekovm too as the pcre2_set_depth_limit code seems to be unnecessary. As a side note, there is a newer version of PCRE2 out too. It seems you have integrated pcre2-10.39 dated 2021-10-29 but there is a pcre2-10.40 dated 2022-04-14. When you push an update, rip out the unnecessary pcre2_set_depth_limit code (technically this is also a memory leak but it is not normally repeatedly executed unlike here). I am not sure if you want to bother updating this to 10.40 now or not.

Uzume avatar May 08 '22 10:05 Uzume

see https://github.com/HaxeFoundation/hashlink/commit/df1828b208137c35c43aaf93d00941774d998f9b

The more relevant commit for the limit being added is the neko one, as that is where the hashlink version was ported from: https://github.com/HaxeFoundation/neko/commit/9dcc5e89e86060eb4f195455a48084ff9c52f861

Since this setting seems only applicable to Windows, we could only set it on Windows builds.

I'm not sure about whether it would have been a good idea to have slightly different behaviour between platforms and only apply this setting on Windows, but I think the best thing to do now is to get rid of it.

This should likely also be done for nekovm too as the pcre2_set_depth_limit code seems to be unnecessary. As a side note, there is a newer version of PCRE2 out too.

I guess we might as well update to 10.40 here, but probably not for neko as the main reason was to get rid of pcre1. But it would still be a good idea to fix this on neko as well.

tobil4sk avatar May 08 '22 10:05 tobil4sk