flex
flex copied to clipboard
optimize yy_get_next_buffer by switching to memcpy
based on https://lists.defectivebydesign.org/archive/html/help-flex/2013-01/msg00000.html
@jafl @Explorer09 Apart of rather cosmetic things, could you please explain
- under which circumstances
number_to_move
is non-positive here? As always, context matters. - if you checked that there is no overlap for
memcpy
? - if there is any benchmarking for using
memcpy
? What is the benefit of the change in terms of speed?
The original for loop assumes that there is no overlap, so I don't think we need to add a check for that.
The message I linked to explains the performance issue.
I think that it is the opposite. Because the original for loop is copying one byte at a time, overlap safety is implied.
If number_to_move is bigger than half of the buffer size, then you have an overlap.
That being said, I definitely like the idea to replace a for loop with memcpy/memmove to let gcc and/or glibc apply low-level copy optimizations when available and possible.
Not sure what is the current state of recent gcc versions Maybe it has become clever enough to detect for loops used for byte copies and replace them with memcpy/memmove automatically...
but I wouldn't bet my life on it...
My take on the issue is that memmove() would be more appropriate here.
Greetings,
On Fri, 2020-03-06 at 09:12 -0800, John Lindal wrote:
The original for loop assumes that there is no overlap, so I don't think we need to add a check for that.
The message I linked to explains the performance issue.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Also,
the other thing to consider is:
how big is the average copy?
I would think that for most lexers, where token size are pretty small, those copies must be around 10 bytes or so.
gcc cannot inline memcpy/memmove in yy_get_next_buffer because copy size must be known at compile time: https://stackoverflow.com/questions/11747891/when-builtin-memcpy-is-replaced-with-libcs-memcpy
so it comes down to figure out if memcpy/memmove speed-up for few bytes copy is greater than an extra function call overhead.
the answer isn't trivial and possibly cannot be found out without measurements...
On Fri, 2020-03-06 at 16:06 -0500, Olivier Langlois wrote:
I think that it is the opposite. Because the original for loop is copying one byte at a time, overlap safety is implied.
If number_to_move is bigger than half of the buffer size, then you have an overlap.
That being said, I definitely like the idea to replace a for loop with memcpy/memmove to let gcc and/or glibc apply low-level copy optimizations when available and possible.
Not sure what is the current state of recent gcc versions Maybe it has become clever enough to detect for loops used for byte copies and replace them with memcpy/memmove automatically...
but I wouldn't bet my life on it...
My take on the issue is that memmove() would be more appropriate here.
Greetings,
On Fri, 2020-03-06 at 09:12 -0800, John Lindal wrote:
The original for loop assumes that there is no overlap, so I don't think we need to add a check for that.
The message I linked to explains the performance issue.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
It is a performance edge case. With a "normal" source file with lots of short lines, it's fast. But when it tries to parse a source file with everything on a single line - and this is how I found out about it - it takes a painfully long time. The post I linked to discovered the same issue and did the experiment to discover the for loop.
I don't see how the memory regions could possibly overlap. The point of the code is to copy from the externally supplied input buffer to internally allocated processing buffer.
Doesn't memcpy already manage overlapping regions? It's been a while, but I thought that was the whole point in using it other than because it's also usually optimized for your hardware...
Rick
On Fri., Mar. 6, 2020, 21:08 John Lindal, [email protected] wrote:
It is a performance edge case. With a "normal" source file with lots of short lines, it's fast. But when it tries to parse a source file with everything on a single line - and this is how I found out about it - it takes a painfully long time. The post I linked to discovered the same issue and did the experiment to discover the for loop.
I don't see how the memory regions could possibly overlap. The point of the code is to copy from the externally supplied input buffer to internally allocated processing buffer.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/westes/flex/pull/438?email_source=notifications&email_token=AAHXNTTCDJWXHRR6NNA2TKDRGGUCTA5CNFSM4KZVGDT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEODL7HY#issuecomment-596033439, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHXNTSECBIMDQOP2AZVOATRGGUCTANCNFSM4KZVGDTQ .
On Fri, 2020-03-06 at 18:44 -0800, Rick Price wrote:
Doesn't memcpy already manage overlapping regions? It's been a while, but I thought that was the whole point in using it other than because it's also usually optimized for your hardware...
Rick
No memcpy() doesn't manage overlapping regions. The main benefit of memcpy is that it is ultra optimized using assembly instructions that are architecture specific which guarantee the best copy performance possible as you said
GCC can even use an inline version for even faster copies. but it can only do so when the copy size is known at compile time. ie: the memcpy size param is a constant value.
Now for very small copies (ie: 10 bytes or less), you need to figure out if memcpy() speedup is worth the function call overhead. IMHO, it is negligeable but I could be wrong and the only to way to be absolutely sure is to write a test and measure the execution time of a lexer using memcpy vs the time of a lexer using the for loop.
From the memcpy man page:
The memcpy() function copies n bytes from memory area src to memory area dest. The memory areas must not overlap. Use memmove(3) if the memory areas do overlap.
On Fri, 2020-03-06 at 18:08 -0800, John Lindal wrote:
It is a performance edge case. With a "normal" source file with lots of short lines, it's fast. But when it tries to parse a source file with everything on a single line - and this is how I found out about it - it takes a painfully long time. The post I linked to discovered the same issue and did the experiment to discover the for loop.
To be convinced of that (and possibly others), I would need to get a copy of your lex file and an example of an input file allowing you to measure the performance improvement.
when you say memcpy() is faster. It is faster by how much? Did you measure the execution time before/after the memcpy patch?
I don't see how the memory regions could possibly overlap. The point of the code is to copy from the externally supplied input buffer to internally allocated processing buffer.
It can. In my last reply, I gave you the exact condition when it will happen.
Again, when number_to_move is bigger than half of the buffer pointed by char *dest = YY_CURRENT_BUFFER_LVALUE->yy_ch_buf
yytext_ptr points inside that buffer to point on the current token text to pass back to the lexer actions when/if there is a match.
The copy is only when new data cannot be pushed further into the current lexer buffer because the end of the buffer has been reached. So it moves the current token text at the beginning of the buffer to be able to append more characters after it.
As a last resort, realloc is used.
Now what you refer to as copy from the externally supplied input buffer to the internally allocated processing buffer is actually done inside YY_INPUT().
Greetings,
I'm going to wrap up my comments by saying this.
calling memmove() has a cost. For very small copies, that cost will be higher than the speed-up gain that the memmove() finetuned code can provide.
Unless you know the equilibrium point with specific copy size where memmove() execution time is equal to the for loop, you don't know if you are going to improve the performance or degrade it. That would need to be measured but I suspect it to be around 10 bytes...
This is accurate that for some frankenstein monstruously big (ie: several hundreds of bytes long) tokens (ie: like multiline comments), the memmove change will speed up things. but I suspect that it could degrade the performance for many existing applications where the average token size is rather small, like most programming language lexers
but maybe not too. IMHO, you cannot offer a performance improvement patch with simply a claim that it is faster without backing that claim with some empirical data showing some profiling stats with few typical lexing scenarios.
don't get me wrong. I love the idea to remove a for loop with memcpy/memmove and I'm not that overly concerned about a possible performance degradation. If my buffers are big enough to store 200+ tokens, even if I get a performance hit with the change, it shouldn't be that horrible.
All I'm saying is a performance patch should always come along a proof that it indeed improve performance. It is a lacking aspect here. Linking to a 7 years old email showing that some dude have said that it did improve his use case, IMHO, isn't good enough.
@lano1106 Agreed. I will run some tests and post the results.
I wrote the following test program:
#include <stdio.h>
#include <math.h>
#include <sys/time.h>
#include <string.h>
#define ITER_COUNT 10000
int main()
{
char source[1048576], dest[1048576];
int i,j,k,count;
struct timeval start, stop;
printf("loop\n");
count=1;
for (j=0; j<20; j++)
{
count*=2;
gettimeofday(&start, NULL);
for (k=0; k<ITER_COUNT; k++)
{
char *b1 = source, *b2 = dest;
for (i=0; i<count; i++)
*(b2++) = *(b1++);
}
gettimeofday(&stop, NULL);
printf("%d: %f microseconds\n", count, ((stop.tv_sec - start.tv_sec) * 1e6 + stop.tv_usec - start.tv_usec)/ITER_COUNT);
}
printf("memcpy\n");
count=1;
for (j=0; j<20; j++)
{
count*=2;
gettimeofday(&start, NULL);
for (k=0; k<ITER_COUNT; k++)
memcpy(dest, source, count);
gettimeofday(&stop, NULL);
printf("%d: %f microseconds\n", count, ((stop.tv_sec - start.tv_sec) * 1e6 + stop.tv_usec - start.tv_usec)/ITER_COUNT);
}
printf("memmove\n");
count=1;
for (j=0; j<20; j++)
{
count*=2;
gettimeofday(&start, NULL);
for (k=0; k<ITER_COUNT; k++)
memmove(dest, source, count);
gettimeofday(&stop, NULL);
printf("%d: %f microseconds\n", count, ((stop.tv_sec - start.tv_sec) * 1e6 + stop.tv_usec - start.tv_usec)/ITER_COUNT);
}
}
and got the following results on a MacBook Pro 2.8GHz i7:
$ ./perf
loop
2: 0.005000 microseconds
4: 0.008300 microseconds
8: 0.016300 microseconds
16: 0.031500 microseconds
32: 0.070400 microseconds
64: 0.122400 microseconds
128: 0.250500 microseconds
256: 0.481800 microseconds
512: 0.933800 microseconds
1024: 1.899400 microseconds
2048: 3.736400 microseconds
4096: 7.669300 microseconds
8192: 14.774500 microseconds
16384: 29.484200 microseconds
32768: 55.893600 microseconds
65536: 115.655800 microseconds
131072: 225.846900 microseconds
262144: 484.298500 microseconds
524288: 956.787700 microseconds
1048576: 1877.529200 microseconds
memcpy
2: 0.006400 microseconds
4: 0.005500 microseconds
8: 0.008600 microseconds
16: 0.004600 microseconds
32: 0.004600 microseconds
64: 0.009500 microseconds
128: 0.016100 microseconds
256: 0.017200 microseconds
512: 0.019500 microseconds
1024: 0.024100 microseconds
2048: 0.033300 microseconds
4096: 0.053600 microseconds
8192: 0.090600 microseconds
16384: 1.302700 microseconds
32768: 0.855000 microseconds
65536: 2.001100 microseconds
131072: 3.707800 microseconds
262144: 8.672900 microseconds
524288: 21.687100 microseconds
1048576: 42.431900 microseconds
memmove
2: 0.008000 microseconds
4: 0.007100 microseconds
8: 0.005700 microseconds
16: 0.004300 microseconds
32: 0.007500 microseconds
64: 0.005900 microseconds
128: 0.006800 microseconds
256: 0.011000 microseconds
512: 0.012400 microseconds
1024: 0.022300 microseconds
2048: 0.031400 microseconds
4096: 0.045500 microseconds
8192: 0.087000 microseconds
16384: 0.254600 microseconds
32768: 0.860600 microseconds
65536: 3.183100 microseconds
131072: 5.627500 microseconds
262144: 9.639500 microseconds
524288: 19.199900 microseconds
1048576: 36.682800 microseconds
First, memmove
is faster than memcpy
. Second, memmove
is faster than the loop at 4 bytes!
Excuse me, but why do we need to care about the memmove
vs. memcpy
performance difference? The two functions have different purposes and one cannot be substituted for another. If for some situations memmove
performs better than memcpy
, then it's a problem in libc implementation that we should forward it upstream instead of trying to "fix" that.
Also I consider the microsecond level performance differences in the above report unconvincing. There are several factors that can impact memory performance: CPU cache size, system load, OS memory management technique, whether you use swap space, etc. When you didn't control all of these variables, the performance comparisons would not be useful.
I just found it interesting that memove was faster on my machine than memcpy. I changed to it because others commented that they felt better about memmove.
@jafl Since you performed the memmove
test last, it is likely that the shorter time you observed is not from the implementation of memmove
itself, but because the memory is cached.
If you are serious about the perf test, you should test the same copy operation on the same memory block a dozen times (the result in the first few times might be unreliable). That way you could rule out cache misses impacting the performance.
The only issue that the perf test needs to settle is whether or not it is significantly faster than the original for loop - which it clearly is. memcpy
vs memmove
should not be decided based on the perf test. People said they felt memmove
was safer so I changed to that, though I believe there can't be any overlap.
Great discussion!
@jafl John,
- I would suggest to run the
for
loop/memcpy
/memmove
only ifdest!=source
which in this context is certainly a harmless condition. (BTW: A reason for my question regarding overlaps.) - And would you be able to check with your mega token sample the gradual performance change with the
for
loop/memcpy
/memmove
?
Copying the token is needed only once (namely at the beginning) such that the innocent if
condition could provide the nice benefit to improve efficiency for mega tokens from O(n^2)
down to O(n)
(where n
is the token length).
@jannick0 I don't understand your second question. The numbers I posted show how the performance with the for loop increases faster than for memcpy/memmove
Sorry for having been a bit sloppy. I wanted to ask you if you could run various flex scanners with your mega token example
- flex 2.6.4 - current release version
- prepend the condition
if(source!=dest)
to thefor
loop to avoid the clearly redundant replacements - replace the
for
loop bymemmove
, but still with theif
condition in place
and show the gradual performance changes. This should apply to flex scanners, not to your separate sample code. That would help show that the additional if
condition does matter for mega tokens with length exceeding the scanner's buffer size. Grateful I you could possibly run these tests.
Many thanks.
- John, your test program is very convincing
- Yes memmove is needed. Overlap is certainly possible. dest points at the start of yy_ch_buf and source points at the end. That it is faster is just an added bonus. You are not the first to make that observation. It apparently has something to do with CPU cache. My only objection for your patch proposal was that memmove() would deteriorate performance for small copies which are, imho, the vast majority of the normal scenarios but your test did show that my concern was invalid.
- Overlap means: dest+number_to_move > source but I really don't see how source == dest could test true... So, imho the test if(source!=dest) isn't needed at all
For what it is worth (I'm just a casual observer and a flex user), you have my blessing for your patch.
Does anybody else have any objections or concerns with this patch? If not, how can we move forward to get it merged?
(IMHO, sizeof(*source)
isn't necessary since the standard requires that char has size 1, but we can leave it in, if other feel strongly about it.)
To work around this while the PR is still open, I modified my make
rule:
%.cpp : %.l
${LEX} ${LFLAGS} -o$@ $<
@perl -n -e 's/for\s*\(\s*i\s*=\s*0;\s*i\s*<\s*number_to_move;\s*\+\+i\s*\)/memmove(dest,source,number_to_move);/g;' \
-e 's/\s*\*\(dest\+\+\)\s*=\s*\*\(source\+\+\);//g;' \
-e 'print;' < $@ > [email protected]
@mv [email protected] $@
If you can resolve the conflicts and address the concerns raised in the discussion, I could have another look.
@westes resolved
@jafl Could you please rebase instead? It makes no sense to have a merge commit with a parent that points to a years old commit.
@jafl Could you please rebase instead?
In general, we prefer rebase to merge. And given the age of the initial pr that will make more sense.
Could you please rebase instead?
In general, we prefer rebase to merge. And given the age of the initial pr that will make more sense.
@westes Done.
@jafl The commit message should say memmove rather than memcpy. Apart from this, the commit looks okay.
@Explorer09 Good catch! Fixed.
Thanks; this is now on master.