tenfourfox
tenfourfox copied to clipboard
Implement and use vmx_strchr()
Like it says. Not as dramatic as memchr but should still improve things.
The network code caused JPEG artifacts on Macintosh Repository, probably due to bad HTTP chunking. Disabling it and flushing the cache removed the artifacts. I suspect the problem is the first vec_* can see the same character in adjacent strings and fails to see the null terminator. We probably need a second check that sees if there is both the search character and the null, and chooses a terminal bytewalker based on that.
Did you receive my mail a couple weeks ago? Anyway, I should've commented here at first. Since assembly is completely innapropriate for plc I went ahead and incorporated the small improvements I made to plvmx.c using intrinsics. It mainly benefits stchr since it loops faster. I was able to add more intrinsics (among others, non-altivec ones) to memchr and haschr also. This slightly decreases setup/exit penalty when vmx is in use and 64-bit GPRs are used when available.
If you're giving it a look, it's here: https://github.com/NapalmSauce/tenfourfox/blob/master/nsprpub/lib/libc/src/plvmx.c
Please advise/ask if there's anything.
Do you have a diff? If it seems reasonable, I'll incorporate it into FPR12 (I'm already typing this in what will be the FPR11 beta).
Actually, I just finished reading it. Clever. It definitely needs a full beta bake time, but it's pretty ingenious, I must say. Can you turn it into a PR?
Thanks a lot :) Do you mind if I delay the PR a little bit? There's one case where it's detrimental vs the current vmx_strchr. If the character c is the first element of the first vector load, it'll make strchr 4-5% slower than currently and I'm curious if there's a way I can avoid this that's near gratis. I'll PR soon. There's a couple more things than can be done for the loops, manually unroll them of course; since gcc doesn't unroll loops with intrinsics.
My handwritten assembly routines are still significantly faster than what gcc produces but.. I spent about two weeks finishing the whole thing in asm. That was for fun, at least.
Sure, take your time. This won't get integrated for a few weeks until I'm into FPR12. What are you using to do your benchmarks?
It will look silly, but currently I use the unix time; I made a C file that calls the routines 10M times so It can be represented in miliseconds (and in seconds for long strings). I try to only note the percentage it's faster; else that would make it a lot too subjective(though a percentage easily gets subjective too).
To be more precise: the resulting executable accepts one command-line argument: the offset of the byte that it has to look for. Then it prints the address of the array(alignment), the address of the pointer returned by vmx_*chr, and lastly the pointer address returned by regular libc *chr to confirm it returned the expected result. After that it does the benchmark. I have a small makefile with it. I can provide the whole thing.
Okay, I believe I'm done. I have a few more changes to commit. I removed vec_lvebx intrinsics and instead put a (const vector unsigned char){char} cast where appropriate because gcc seems to optimize it better. The only exception to this is haschr which seems to trigger a compiler bug (2x slower than memchr to complete with the latter)? It seems to happen with haschr the way it is in the tree right now, too. I'm uploading the makefile I mentioned earlier and the c sources mostly as a testcase. Can you confirm wether it also happens with gcc-4.8.2 or not when you have time? I can reproduce on 4.8.5, 5.5 and 6.4 right now.
The makefile will produce executables haschr, haschr1, memchr, memchr1, strchr and strchr1, where the -1 suffixed ones use the modified plvmx and others use in-tree plvmx.c So what I'm asking for is essentially that you compare results from calling time ./haschr 1000 and time ./memchr 1000. You can also use it to compare in-tree/modified functions as you like.
Otherwise I made my type usage more consistent.
Thanks!
I haven't forgotten about this, I've just been on the Talos II all week.
How much faster is the assembly version, for my interest? If it's a big delta, we could convert it to __asm__. Kludgey, but would work.
Well, obviously SIMD is SIMD, so there's no major speedup, but the difference was tempting enough that I relinked, silently dropping the assembly version plvmx.o in the objects and using it once I passed all JS tests. It cuts libnss by about 2KB. I think for what it does it's worth it, though I'm not sure what should be "want" and "not worthwhile". Once I've read a bit on how mozilla benchmarks their JS (python, again?) I'll comment benchmarks of the JS and of the individual function themselves run on the 3 archs to compare and clarify to what degree it is faster.
Sorry for the delay. I made and attached a quick chart filled with time results from 10 millions calls for each function with 3 machines (one for each altivec-enabled arch) I had at hand to try to give a somewhat overall view. There's a couple things you'll notice. Every C source was compiled with gcc-4.8.5 from macports with the good -mcpu value for each subarch.
Just as a note, exceptionally for the G5 in assembly memchr and haschr I've added nops in strategical places to reduce the density of branch instructions. Otherwise the loop was bottleneck with dispatch groups and the speed improvement was tiny.
@classilla, I know you're busy right now. Given the blog post, I believe you're still interested, and FPRb12 is next on the road map. So I expect no more than a little "ok" before converting the thing to __asm__, and of course this doesn't commit you to take a PR at all if something's not right. I'll rework the readability at the same time, though it's not that ugly as it is ATM.
If it can help reduce hesitation on taking the patch, it's been part of my own build for over 1 month and I have yet to find any problem with it.
Yes, I'm still interested, I just don't have a large number of cycles just now (between fixing Talos II issues with Firefox like https://bugzilla.mozilla.org/show_bug.cgi?id=1512162 and the holiday) but the idea is to get it into FPR12 when I get back on the G5. So ... "ok" ;)
Okay. We're back to this now that I have most of the other stuff for FPR12 landed and I have a testing G5 build up.
I'm looking at your strchr implementation. Most of it looks good, but I don't see an early return if a null is found before the character did as the older implementation had. On a short string this could end up doing more work than needed if the character isn't there, unless your testing demonstrates this is negligible.
You're right, I can observe this. In this case the in-tree version becomes slightly faster than the C version with my modifications. Though similar happens with the asm version, it's already far ahead so it still completes faster.
I'm no longer convinced the changes brought to the C versions are worthwhile. It represents well the algorithm from the asm version, that said.
On the topic of readability. I suspect adding comments on top of an asm function would make it even more a mess. So how about I add the original .S files for each function somewhere appropriate in the tree?
Or, just attach them here (no PR, just a naked set of files or a zip) and I'll figure out where to place them. I'll need to teach NSPR about .S files though. I'm hoping to have the beta out early next week, so please send ASAP.
Sorry if it's taking awhile, I was reviewing everything, and now I'm re-passing JS tests on the G5. As soon as it's done I'm uploading.
PMG5Raph:jit-test raph$ ./jit_test.py --ion -f ../../../obj-ff-dbg/dist/bin/js
[9945| 0| 0| 0] 100% ==========================================>|1400.6s
PASSED ALL
PMG5Raph:tests raph$ ./jstests.py --jitflags=ion --run-slow-tests ../../../obj-ff-dbg/dist/bin/js -j2
[12966| 0| 0| 1098] 100% ======================================>|2827.9s
PASS
Very nice! Some really clever stuff in the assembly.
I'm patching NSPR for the build.
Thanks! I put a good amount of thought in this.
I'm repeating myself, but if there's anything I can do to help, just let me know. I don't want this to end up being a burden.