tenfourfox icon indicating copy to clipboard operation
tenfourfox copied to clipboard

Implement and use vmx_strchr()

Open classilla opened this issue 8 years ago • 23 comments

Like it says. Not as dramatic as memchr but should still improve things.

classilla avatar Sep 29 '17 16:09 classilla

The network code caused JPEG artifacts on Macintosh Repository, probably due to bad HTTP chunking. Disabling it and flushing the cache removed the artifacts. I suspect the problem is the first vec_* can see the same character in adjacent strings and fails to see the null terminator. We probably need a second check that sees if there is both the search character and the null, and chooses a terminal bytewalker based on that.

classilla avatar Nov 16 '17 15:11 classilla

Did you receive my mail a couple weeks ago? Anyway, I should've commented here at first. Since assembly is completely innapropriate for plc I went ahead and incorporated the small improvements I made to plvmx.c using intrinsics. It mainly benefits stchr since it loops faster. I was able to add more intrinsics (among others, non-altivec ones) to memchr and haschr also. This slightly decreases setup/exit penalty when vmx is in use and 64-bit GPRs are used when available. If you're giving it a look, it's here: https://github.com/NapalmSauce/tenfourfox/blob/master/nsprpub/lib/libc/src/plvmx.c Please advise/ask if there's anything.

NapalmSauce avatar Nov 01 '18 01:11 NapalmSauce

Do you have a diff? If it seems reasonable, I'll incorporate it into FPR12 (I'm already typing this in what will be the FPR11 beta).

classilla avatar Nov 01 '18 05:11 classilla

Actually, I just finished reading it. Clever. It definitely needs a full beta bake time, but it's pretty ingenious, I must say. Can you turn it into a PR?

classilla avatar Nov 01 '18 05:11 classilla

Thanks a lot :) Do you mind if I delay the PR a little bit? There's one case where it's detrimental vs the current vmx_strchr. If the character c is the first element of the first vector load, it'll make strchr 4-5% slower than currently and I'm curious if there's a way I can avoid this that's near gratis. I'll PR soon. There's a couple more things than can be done for the loops, manually unroll them of course; since gcc doesn't unroll loops with intrinsics.

My handwritten assembly routines are still significantly faster than what gcc produces but.. I spent about two weeks finishing the whole thing in asm. That was for fun, at least.

NapalmSauce avatar Nov 01 '18 06:11 NapalmSauce

Sure, take your time. This won't get integrated for a few weeks until I'm into FPR12. What are you using to do your benchmarks?

classilla avatar Nov 01 '18 06:11 classilla

It will look silly, but currently I use the unix time; I made a C file that calls the routines 10M times so It can be represented in miliseconds (and in seconds for long strings). I try to only note the percentage it's faster; else that would make it a lot too subjective(though a percentage easily gets subjective too).

NapalmSauce avatar Nov 01 '18 06:11 NapalmSauce

To be more precise: the resulting executable accepts one command-line argument: the offset of the byte that it has to look for. Then it prints the address of the array(alignment), the address of the pointer returned by vmx_*chr, and lastly the pointer address returned by regular libc *chr to confirm it returned the expected result. After that it does the benchmark. I have a small makefile with it. I can provide the whole thing.

NapalmSauce avatar Nov 01 '18 06:11 NapalmSauce

Okay, I believe I'm done. I have a few more changes to commit. I removed vec_lvebx intrinsics and instead put a (const vector unsigned char){char} cast where appropriate because gcc seems to optimize it better. The only exception to this is haschr which seems to trigger a compiler bug (2x slower than memchr to complete with the latter)? It seems to happen with haschr the way it is in the tree right now, too. I'm uploading the makefile I mentioned earlier and the c sources mostly as a testcase. Can you confirm wether it also happens with gcc-4.8.2 or not when you have time? I can reproduce on 4.8.5, 5.5 and 6.4 right now.

The makefile will produce executables haschr, haschr1, memchr, memchr1, strchr and strchr1, where the -1 suffixed ones use the modified plvmx and others use in-tree plvmx.c So what I'm asking for is essentially that you compare results from calling time ./haschr 1000 and time ./memchr 1000. You can also use it to compare in-tree/modified functions as you like.

Otherwise I made my type usage more consistent.

vmxstring_test.zip

Thanks!

NapalmSauce avatar Nov 04 '18 03:11 NapalmSauce

I haven't forgotten about this, I've just been on the Talos II all week. How much faster is the assembly version, for my interest? If it's a big delta, we could convert it to __asm__. Kludgey, but would work.

classilla avatar Nov 08 '18 03:11 classilla

Well, obviously SIMD is SIMD, so there's no major speedup, but the difference was tempting enough that I relinked, silently dropping the assembly version plvmx.o in the objects and using it once I passed all JS tests. It cuts libnss by about 2KB. I think for what it does it's worth it, though I'm not sure what should be "want" and "not worthwhile". Once I've read a bit on how mozilla benchmarks their JS (python, again?) I'll comment benchmarks of the JS and of the individual function themselves run on the 3 archs to compare and clarify to what degree it is faster.

NapalmSauce avatar Nov 08 '18 18:11 NapalmSauce

Sorry for the delay. I made and attached a quick chart filled with time results from 10 millions calls for each function with 3 machines (one for each altivec-enabled arch) I had at hand to try to give a somewhat overall view. There's a couple things you'll notice. Every C source was compiled with gcc-4.8.5 from macports with the good -mcpu value for each subarch.

Just as a note, exceptionally for the G5 in assembly memchr and haschr I've added nops in strategical places to reduce the density of branch instructions. Otherwise the loop was bottleneck with dispatch groups and the speed improvement was tiny.

vmx_results.txt

NapalmSauce avatar Nov 13 '18 02:11 NapalmSauce

@classilla, I know you're busy right now. Given the blog post, I believe you're still interested, and FPRb12 is next on the road map. So I expect no more than a little "ok" before converting the thing to __asm__, and of course this doesn't commit you to take a PR at all if something's not right. I'll rework the readability at the same time, though it's not that ugly as it is ATM.

If it can help reduce hesitation on taking the patch, it's been part of my own build for over 1 month and I have yet to find any problem with it.

NapalmSauce avatar Dec 13 '18 04:12 NapalmSauce

Yes, I'm still interested, I just don't have a large number of cycles just now (between fixing Talos II issues with Firefox like https://bugzilla.mozilla.org/show_bug.cgi?id=1512162 and the holiday) but the idea is to get it into FPR12 when I get back on the G5. So ... "ok" ;)

classilla avatar Dec 13 '18 05:12 classilla

Okay. We're back to this now that I have most of the other stuff for FPR12 landed and I have a testing G5 build up.

classilla avatar Jan 10 '19 17:01 classilla

I'm looking at your strchr implementation. Most of it looks good, but I don't see an early return if a null is found before the character did as the older implementation had. On a short string this could end up doing more work than needed if the character isn't there, unless your testing demonstrates this is negligible.

classilla avatar Jan 10 '19 18:01 classilla

You're right, I can observe this. In this case the in-tree version becomes slightly faster than the C version with my modifications. Though similar happens with the asm version, it's already far ahead so it still completes faster.

I'm no longer convinced the changes brought to the C versions are worthwhile. It represents well the algorithm from the asm version, that said.

NapalmSauce avatar Jan 10 '19 19:01 NapalmSauce

On the topic of readability. I suspect adding comments on top of an asm function would make it even more a mess. So how about I add the original .S files for each function somewhere appropriate in the tree?

NapalmSauce avatar Jan 10 '19 23:01 NapalmSauce

Or, just attach them here (no PR, just a naked set of files or a zip) and I'll figure out where to place them. I'll need to teach NSPR about .S files though. I'm hoping to have the beta out early next week, so please send ASAP.

classilla avatar Jan 11 '19 01:01 classilla

Sorry if it's taking awhile, I was reviewing everything, and now I'm re-passing JS tests on the G5. As soon as it's done I'm uploading.

NapalmSauce avatar Jan 11 '19 04:01 NapalmSauce

PMG5Raph:jit-test raph$ ./jit_test.py --ion -f ../../../obj-ff-dbg/dist/bin/js
[9945|   0|   0|   0] 100% ==========================================>|1400.6s
PASSED ALL

PMG5Raph:tests raph$ ./jstests.py --jitflags=ion --run-slow-tests ../../../obj-ff-dbg/dist/bin/js -j2
[12966|    0|    0| 1098] 100% ======================================>|2827.9s
PASS

string functions.zip

NapalmSauce avatar Jan 11 '19 06:01 NapalmSauce

Very nice! Some really clever stuff in the assembly.

I'm patching NSPR for the build.

classilla avatar Jan 12 '19 02:01 classilla

Thanks! I put a good amount of thought in this.

I'm repeating myself, but if there's anything I can do to help, just let me know. I don't want this to end up being a burden.

NapalmSauce avatar Jan 12 '19 02:01 NapalmSauce