flex optimize yy_get_next_buffer by switching to memcpy

based on https://lists.defectivebydesign.org/archive/html/help-flex/2013-01/msg00000.html

Feb 22 '20 22:02 jafl

@jafl @Explorer09 Apart of rather cosmetic things, could you please explain

under which circumstances number_to_move is non-positive here? As always, context matters.
if you checked that there is no overlap for memcpy?
if there is any benchmarking for using memcpy? What is the benefit of the change in terms of speed?

Mar 05 '20 09:03 jannick0

The original for loop assumes that there is no overlap, so I don't think we need to add a check for that.

The message I linked to explains the performance issue.

Mar 06 '20 17:03 jafl

I think that it is the opposite. Because the original for loop is copying one byte at a time, overlap safety is implied.

If number_to_move is bigger than half of the buffer size, then you have an overlap.

That being said, I definitely like the idea to replace a for loop with memcpy/memmove to let gcc and/or glibc apply low-level copy optimizations when available and possible.

Not sure what is the current state of recent gcc versions Maybe it has become clever enough to detect for loops used for byte copies and replace them with memcpy/memmove automatically...

but I wouldn't bet my life on it...

My take on the issue is that memmove() would be more appropriate here.

Greetings,

On Fri, 2020-03-06 at 09:12 -0800, John Lindal wrote:

The original for loop assumes that there is no overlap, so I don't think we need to add a check for that.

The message I linked to explains the performance issue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Mar 06 '20 21:03 lano1106

Also,

the other thing to consider is:

how big is the average copy?

I would think that for most lexers, where token size are pretty small, those copies must be around 10 bytes or so.

gcc cannot inline memcpy/memmove in yy_get_next_buffer because copy size must be known at compile time: https://stackoverflow.com/questions/11747891/when-builtin-memcpy-is-replaced-with-libcs-memcpy

so it comes down to figure out if memcpy/memmove speed-up for few bytes copy is greater than an extra function call overhead.

the answer isn't trivial and possibly cannot be found out without measurements...

On Fri, 2020-03-06 at 16:06 -0500, Olivier Langlois wrote:

I think that it is the opposite. Because the original for loop is copying one byte at a time, overlap safety is implied.

If number_to_move is bigger than half of the buffer size, then you have an overlap.

That being said, I definitely like the idea to replace a for loop with memcpy/memmove to let gcc and/or glibc apply low-level copy optimizations when available and possible.

Not sure what is the current state of recent gcc versions Maybe it has become clever enough to detect for loops used for byte copies and replace them with memcpy/memmove automatically...

but I wouldn't bet my life on it...

My take on the issue is that memmove() would be more appropriate here.

Greetings,

On Fri, 2020-03-06 at 09:12 -0800, John Lindal wrote:

The original for loop assumes that there is no overlap, so I don't think we need to add a check for that.

The message I linked to explains the performance issue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Mar 06 '20 21:03 lano1106

It is a performance edge case. With a "normal" source file with lots of short lines, it's fast. But when it tries to parse a source file with everything on a single line - and this is how I found out about it - it takes a painfully long time. The post I linked to discovered the same issue and did the experiment to discover the for loop.

I don't see how the memory regions could possibly overlap. The point of the code is to copy from the externally supplied input buffer to internally allocated processing buffer.

Mar 07 '20 02:03 jafl

Doesn't memcpy already manage overlapping regions? It's been a while, but I thought that was the whole point in using it other than because it's also usually optimized for your hardware...

Rick

On Fri., Mar. 6, 2020, 21:08 John Lindal, [email protected] wrote:

It is a performance edge case. With a "normal" source file with lots of short lines, it's fast. But when it tries to parse a source file with everything on a single line - and this is how I found out about it - it takes a painfully long time. The post I linked to discovered the same issue and did the experiment to discover the for loop.

I don't see how the memory regions could possibly overlap. The point of the code is to copy from the externally supplied input buffer to internally allocated processing buffer.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/westes/flex/pull/438?email_source=notifications&email_token=AAHXNTTCDJWXHRR6NNA2TKDRGGUCTA5CNFSM4KZVGDT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEODL7HY#issuecomment-596033439, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHXNTSECBIMDQOP2AZVOATRGGUCTANCNFSM4KZVGDTQ .

Mar 07 '20 02:03 rickprice

On Fri, 2020-03-06 at 18:44 -0800, Rick Price wrote:

Doesn't memcpy already manage overlapping regions? It's been a while, but I thought that was the whole point in using it other than because it's also usually optimized for your hardware...

Rick

No memcpy() doesn't manage overlapping regions. The main benefit of memcpy is that it is ultra optimized using assembly instructions that are architecture specific which guarantee the best copy performance possible as you said

GCC can even use an inline version for even faster copies. but it can only do so when the copy size is known at compile time. ie: the memcpy size param is a constant value.

Now for very small copies (ie: 10 bytes or less), you need to figure out if memcpy() speedup is worth the function call overhead. IMHO, it is negligeable but I could be wrong and the only to way to be absolutely sure is to write a test and measure the execution time of a lexer using memcpy vs the time of a lexer using the for loop.

From the memcpy man page:

The memcpy() function copies n bytes from memory area src to memory area dest. The memory areas must not overlap. Use memmove(3) if the memory areas do overlap.

Mar 07 '20 03:03 lano1106

On Fri, 2020-03-06 at 18:08 -0800, John Lindal wrote:

It is a performance edge case. With a "normal" source file with lots of short lines, it's fast. But when it tries to parse a source file with everything on a single line - and this is how I found out about it - it takes a painfully long time. The post I linked to discovered the same issue and did the experiment to discover the for loop.

To be convinced of that (and possibly others), I would need to get a copy of your lex file and an example of an input file allowing you to measure the performance improvement.

when you say memcpy() is faster. It is faster by how much? Did you measure the execution time before/after the memcpy patch?

I don't see how the memory regions could possibly overlap. The point of the code is to copy from the externally supplied input buffer to internally allocated processing buffer.

It can. In my last reply, I gave you the exact condition when it will happen.

Again, when number_to_move is bigger than half of the buffer pointed by char *dest = YY_CURRENT_BUFFER_LVALUE->yy_ch_buf

yytext_ptr points inside that buffer to point on the current token text to pass back to the lexer actions when/if there is a match.

The copy is only when new data cannot be pushed further into the current lexer buffer because the end of the buffer has been reached. So it moves the current token text at the beginning of the buffer to be able to append more characters after it.

As a last resort, realloc is used.

Now what you refer to as copy from the externally supplied input buffer to the internally allocated processing buffer is actually done inside YY_INPUT().

Greetings,

Mar 07 '20 03:03 lano1106

I'm going to wrap up my comments by saying this.

calling memmove() has a cost. For very small copies, that cost will be higher than the speed-up gain that the memmove() finetuned code can provide.

Unless you know the equilibrium point with specific copy size where memmove() execution time is equal to the for loop, you don't know if you are going to improve the performance or degrade it. That would need to be measured but I suspect it to be around 10 bytes...

This is accurate that for some frankenstein monstruously big (ie: several hundreds of bytes long) tokens (ie: like multiline comments), the memmove change will speed up things. but I suspect that it could degrade the performance for many existing applications where the average token size is rather small, like most programming language lexers

but maybe not too. IMHO, you cannot offer a performance improvement patch with simply a claim that it is faster without backing that claim with some empirical data showing some profiling stats with few typical lexing scenarios.

don't get me wrong. I love the idea to remove a for loop with memcpy/memmove and I'm not that overly concerned about a possible performance degradation. If my buffers are big enough to store 200+ tokens, even if I get a performance hit with the change, it shouldn't be that horrible.

All I'm saying is a performance patch should always come along a proof that it indeed improve performance. It is a lacking aspect here. Linking to a 7 years old email showing that some dude have said that it did improve his use case, IMHO, isn't good enough.

Mar 07 '20 17:03 lano1106

@lano1106 Agreed. I will run some tests and post the results.

Mar 08 '20 19:03 jafl

I wrote the following test program:

#include <stdio.h>
#include <math.h>
#include <sys/time.h>
#include <string.h>

#define ITER_COUNT 10000

int main()
{
	char source[1048576], dest[1048576];

	int i,j,k,count;
	struct timeval start, stop;

	printf("loop\n");

	count=1;
	for (j=0; j<20; j++)
	{
		count*=2;
		gettimeofday(&start, NULL);

		for (k=0; k<ITER_COUNT; k++)
		{
			char *b1 = source, *b2 = dest;
			for (i=0; i<count; i++)
				*(b2++) = *(b1++);
		}

		gettimeofday(&stop, NULL);
		printf("%d: %f microseconds\n", count, ((stop.tv_sec - start.tv_sec) * 1e6 + stop.tv_usec - start.tv_usec)/ITER_COUNT);
	}

	printf("memcpy\n");

	count=1;
	for (j=0; j<20; j++)
	{
		count*=2;
		gettimeofday(&start, NULL);

		for (k=0; k<ITER_COUNT; k++)
			memcpy(dest, source, count);

		gettimeofday(&stop, NULL);
		printf("%d: %f microseconds\n", count, ((stop.tv_sec - start.tv_sec) * 1e6 + stop.tv_usec - start.tv_usec)/ITER_COUNT);
	}

	printf("memmove\n");

	count=1;
	for (j=0; j<20; j++)
	{
		count*=2;
		gettimeofday(&start, NULL);

		for (k=0; k<ITER_COUNT; k++)
			memmove(dest, source, count);

		gettimeofday(&stop, NULL);
		printf("%d: %f microseconds\n", count, ((stop.tv_sec - start.tv_sec) * 1e6 + stop.tv_usec - start.tv_usec)/ITER_COUNT);
	}
}

and got the following results on a MacBook Pro 2.8GHz i7:

$ ./perf
loop
      2: 0.005000 microseconds
      4: 0.008300 microseconds
      8: 0.016300 microseconds
     16: 0.031500 microseconds
     32: 0.070400 microseconds
     64: 0.122400 microseconds
    128: 0.250500 microseconds
    256: 0.481800 microseconds
    512: 0.933800 microseconds
   1024: 1.899400 microseconds
   2048: 3.736400 microseconds
   4096: 7.669300 microseconds
   8192: 14.774500 microseconds
  16384: 29.484200 microseconds
  32768: 55.893600 microseconds
  65536: 115.655800 microseconds
 131072: 225.846900 microseconds
 262144: 484.298500 microseconds
 524288: 956.787700 microseconds
1048576: 1877.529200 microseconds
memcpy
      2: 0.006400 microseconds
      4: 0.005500 microseconds
      8: 0.008600 microseconds
     16: 0.004600 microseconds
     32: 0.004600 microseconds
     64: 0.009500 microseconds
    128: 0.016100 microseconds
    256: 0.017200 microseconds
    512: 0.019500 microseconds
   1024: 0.024100 microseconds
   2048: 0.033300 microseconds
   4096: 0.053600 microseconds
   8192: 0.090600 microseconds
  16384: 1.302700 microseconds
  32768: 0.855000 microseconds
  65536: 2.001100 microseconds
 131072: 3.707800 microseconds
 262144: 8.672900 microseconds
 524288: 21.687100 microseconds
1048576: 42.431900 microseconds
memmove
      2: 0.008000 microseconds
      4: 0.007100 microseconds
      8: 0.005700 microseconds
     16: 0.004300 microseconds
     32: 0.007500 microseconds
     64: 0.005900 microseconds
    128: 0.006800 microseconds
    256: 0.011000 microseconds
    512: 0.012400 microseconds
   1024: 0.022300 microseconds
   2048: 0.031400 microseconds
   4096: 0.045500 microseconds
   8192: 0.087000 microseconds
  16384: 0.254600 microseconds
  32768: 0.860600 microseconds
  65536: 3.183100 microseconds
 131072: 5.627500 microseconds
 262144: 9.639500 microseconds
 524288: 19.199900 microseconds
1048576: 36.682800 microseconds

First, memmove is faster than memcpy. Second, memmove is faster than the loop at 4 bytes!

Mar 08 '20 21:03 jafl

Excuse me, but why do we need to care about the memmove vs. memcpy performance difference? The two functions have different purposes and one cannot be substituted for another. If for some situations memmove performs better than memcpy, then it's a problem in libc implementation that we should forward it upstream instead of trying to "fix" that.

Also I consider the microsecond level performance differences in the above report unconvincing. There are several factors that can impact memory performance: CPU cache size, system load, OS memory management technique, whether you use swap space, etc. When you didn't control all of these variables, the performance comparisons would not be useful.

Mar 09 '20 03:03 Explorer09

I just found it interesting that memove was faster on my machine than memcpy. I changed to it because others commented that they felt better about memmove.

Mar 09 '20 17:03 jafl

@jafl Since you performed the memmove test last, it is likely that the shorter time you observed is not from the implementation of memmove itself, but because the memory is cached.

If you are serious about the perf test, you should test the same copy operation on the same memory block a dozen times (the result in the first few times might be unreliable). That way you could rule out cache misses impacting the performance.

Mar 10 '20 02:03 Explorer09

The only issue that the perf test needs to settle is whether or not it is significantly faster than the original for loop - which it clearly is. memcpy vs memmove should not be decided based on the perf test. People said they felt memmove was safer so I changed to that, though I believe there can't be any overlap.

Mar 10 '20 19:03 jafl

Great discussion!

@jafl John,

I would suggest to run the for loop/memcpy/memmove only if dest!=source which in this context is certainly a harmless condition. (BTW: A reason for my question regarding overlaps.)
And would you be able to check with your mega token sample the gradual performance change with the for loop/memcpy/memmove?

Copying the token is needed only once (namely at the beginning) such that the innocent if condition could provide the nice benefit to improve efficiency for mega tokens from O(n^2) down to O(n) (where n is the token length).

Mar 13 '20 09:03 jannick0

@jannick0 I don't understand your second question. The numbers I posted show how the performance with the for loop increases faster than for memcpy/memmove

Mar 14 '20 19:03 jafl

Sorry for having been a bit sloppy. I wanted to ask you if you could run various flex scanners with your mega token example

flex 2.6.4 - current release version
prepend the condition if(source!=dest) to the for loop to avoid the clearly redundant replacements
replace the for loop by memmove, but still with the if condition in place

and show the gradual performance changes. This should apply to flex scanners, not to your separate sample code. That would help show that the additional if condition does matter for mega tokens with length exceeding the scanner's buffer size. Grateful I you could possibly run these tests.

Many thanks.

Mar 14 '20 20:03 jannick0

John, your test program is very convincing
Yes memmove is needed. Overlap is certainly possible. dest points at the start of yy_ch_buf and source points at the end. That it is faster is just an added bonus. You are not the first to make that observation. It apparently has something to do with CPU cache. My only objection for your patch proposal was that memmove() would deteriorate performance for small copies which are, imho, the vast majority of the normal scenarios but your test did show that my concern was invalid.
Overlap means: dest+number_to_move > source but I really don't see how source == dest could test true... So, imho the test if(source!=dest) isn't needed at all

For what it is worth (I'm just a casual observer and a flex user), you have my blessing for your patch.

Apr 09 '20 17:04 lano1106

Does anybody else have any objections or concerns with this patch? If not, how can we move forward to get it merged?

(IMHO, sizeof(*source) isn't necessary since the standard requires that char has size 1, but we can leave it in, if other feel strongly about it.)

May 02 '20 21:05 jafl

To work around this while the PR is still open, I modified my make rule:

%.cpp : %.l
	${LEX} ${LFLAGS} -o$@ $<
	@perl -n -e 's/for\s*\(\s*i\s*=\s*0;\s*i\s*<\s*number_to_move;\s*\+\+i\s*\)/memmove(dest,source,number_to_move);/g;' \
           -e 's/\s*\*\(dest\+\+\)\s*=\s*\*\(source\+\+\);//g;' \
           -e 'print;' < $@ > [email protected]
	@mv [email protected] $@

Jun 18 '20 18:06 jafl

If you can resolve the conflicts and address the concerns raised in the discussion, I could have another look.

Apr 25 '24 19:04 westes

@westes resolved

Apr 25 '24 23:04 jafl

@jafl Could you please rebase instead? It makes no sense to have a merge commit with a parent that points to a years old commit.

Apr 25 '24 23:04 Explorer09

@jafl Could you please rebase instead?

In general, we prefer rebase to merge. And given the age of the initial pr that will make more sense.

Apr 25 '24 23:04 westes

Could you please rebase instead?

In general, we prefer rebase to merge. And given the age of the initial pr that will make more sense.

@westes Done.

Apr 26 '24 18:04 jafl

@jafl The commit message should say memmove rather than memcpy. Apart from this, the commit looks okay.

Apr 26 '24 18:04 Explorer09

@Explorer09 Good catch! Fixed.

Apr 26 '24 19:04 jafl

Thanks; this is now on master.

Apr 27 '24 13:04 westes

flex flex copied to clipboard

optimize yy_get_next_buffer by switching to memcpy

flex
flex copied to clipboard