OpenShadingLanguage icon indicating copy to clipboard operation
OpenShadingLanguage copied to clipboard

Add support for b4_SSE2 batched mode.

Open johnfea opened this issue 1 year ago • 20 comments

Description

Add support for b4_SSE2 batched mode, enabling batched execution for all x86-64 CPUs that don't support AVX.

Is there interest in adding this into OSL? I still support CPUs with vector width of 4 in my project that uses OSL and it would be nice to not to have to resort to scalar processing.

Tests

Quick tests were run to see that output with procedural and texture based materials looked ok and proper SSE2 batched code was being generated for wide/ functions. EDIT: Additionally, all tests pass now.

Checklist:

  • [x] I have read the contribution guidelines.
  • [x] I have updated the documentation, if applicable.
  • [x] I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
  • [x] My code follows the prevailing code style of this project. If I haven't already run clang-format v17 before submitting, I definitely will look at the CI test that runs clang-format and fix anything that it highlights as being nonconforming.

johnfea avatar Jun 07 '24 14:06 johnfea

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: johnfea (9e5c6749f0faed043a912485ad68fea890a3bea6, ae76938cfc0b0dc8f6a72f4b20851268c2653e36, c6d796abdb0c351258b5e566349d93815b1e0448, b8480ab25afadbfc85e42af3b350a91b0dffd1d5, e0197db665214a071c0578cb177314f1e0973f2a, d13c46730fc5eabf5d8dee0b4fe32f8093880f2e)
  • :white_check_mark: login: lgritz / name: Larry Gritz (a6d89f23e402d5435e96a09cb74c5ede82099875)

@johnfea thanks, I think the question (possibly unanswered) was: With many of the noise and other functions doing internal SIMD using x,y,z or r,g,b,a to take advantage of SSE would running a Batched<4> be an improvement? The Batched<4> would execute the entire shader logic and other features in SIMD, so could be a faster. If you have any feedback on that we would love to hear it.

Also please attempt to add a testing configuration that exercises batched<4> to .github/workflows/ci.yml You should see some batch-b8avx2 config you can copy.

AlexMWells avatar Jun 07 '24 16:06 AlexMWells

Based on a quick benchmark, batched SSE2 was 20% to 40% faster in total render speed with low complexity material when a single material filled the screen.

20% was with a single 3D fBm and 40% was with mix of fBm and two different procedural shaders.

It seems definitely an improvement.

There was at least two maybe related issues, however. A string from a shader didn't end up with correct ustringhash in output closure and its string representation was empty string. Also, backfacing() in a shader didn't work correctly anymore.

johnfea avatar Jun 07 '24 20:06 johnfea

@johnfea great to hear it is speeding things up. Curious how Batched<8> does with AVX on the same workloads vs Batched<4> with SSE.

As far as backfacing() not working, it should just pull a varying value out of the BatchShaderGlobals<4>::varing.backfacing. So make sure you are populating it with value values before executing the shadernetwork. Not sure this would affect this but take a look at: https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/pull/1827

As far as the closures, once you have added a b4_SSE2 workflow to the .github/workflows/ci.yml you can craft a failing unit test that reproduces the issue, you can update this pull request and we could take a look at the failing CI.

AlexMWells avatar Jun 08 '24 00:06 AlexMWells

b4_SSE2 ran at 60% the speed of b8_AVX2 mode with single 3D fBm when measured in total render time, but for this measurement all parts of rendering were switched and not just OSL part. b4_SSE2 ran at 55% the speed of b8_AVX2 mode with mix of fBm and two different procedural shaders.

I updated the pull request with failing closure test unit, but I've no idea how to make the tests run so it probably won't run without some changes. Since testshade didn't seem to read output closures and testrender doesn't support batched, I added new testminimal. This issue doesn't happen with printf from a shader like done in some testshade tests.

It seems the backfacing() issue occurs with batched mode at any width but it doesn't affect output. When using backfacing() as input to a mix shader, non-batched mode doesn't output the cancelled shader in closures, but batched mode does with zero weights, which is missed optimization.

Also, the issue reproduced by this unit test is also generic to batched mode like #1801 and #1800 and not just with SSE2.

johnfea avatar Jun 08 '24 13:06 johnfea

I've updated the pull request, now "closure-string" test fails in batched mode only and for some reason also for non-batched icc/icx. This is the only test that fails for b8_AVX2 but for b4_SSE2 there's also "Invalid bitcast" <4x i1> to i8.

johnfea avatar Jun 17 '24 11:06 johnfea

I removed closure-string and testminimal from this PR and added those into a separate PR #1831.

Now for b4_SSE2 only "'Invalid bitcast' <4x i1> to i8" should fail in tests and b8_AVX2 should pass.

johnfea avatar Jun 17 '24 13:06 johnfea

All tests for this PR pass now.

I'm not sure why AVX512 needs cases for width 8 in src/liboslexec/llvm_util.cpp and why AVX needs cases for width 16. I tried removing them and EDIT: it made no difference for either b16 or b8 in tests. Anyway, I added cases in those places too for width 4. There's an alternative branch here with all of those cases removed. EDIT2: Apparently, b8 AVX512 is a path in liboslexec CMakeList.txt but b16 AVX is not. My take would be to not add any b4 paths/cases for AVX512 or AVX, but retain the existing non-typical width cases for AVX512 and AVX.

johnfea avatar Jun 25 '24 10:06 johnfea

Did you close on purpose?

lgritz avatar Jul 04 '24 16:07 lgritz

No, I was just trying to fix CLA check failing.

johnfea avatar Jul 04 '24 16:07 johnfea

@johnfea Reviving this, sorry for the delay, it's been a busy last several weeks. Would you mind doing a rebase on top of the current main so we get a clean merge? And other than that, is this patch ready to merge now?

@AlexMWells Do you have any reservations about it?

lgritz avatar Aug 15 '24 18:08 lgritz

For the Files Changed it all looks good, just adds a 4 wide path utilizing all the same approaches. Want to see the CI run it though and make sure it passes the massive pile of regressions.

Great Work!

AlexMWells avatar Aug 15 '24 21:08 AlexMWells

@johnfea Reviving this, sorry for the delay, it's been a busy last several weeks. Would you mind doing a rebase on top of the current main so we get a clean merge? And other than that, is this patch ready to merge now?

Yes, it can be merged now. One might want to change b4sse2 ci to be more different from b8avx2. Both old and new non-typical width batched code in llvm_util.c isn't covered by testsuite though.

johnfea avatar Aug 16 '24 11:08 johnfea

@johnfea , can you elaborate or provide example of "old and new non-typical width batched code in llvm_util.c isn't covered by testsuite though."

AlexMWells avatar Aug 16 '24 17:08 AlexMWells

Looking at CI action, I see VFX2021 gcc9/C++17 llvm11 py3.7 exr2.5 oiio2.3 sse2 batch-b4sse2

which successfully executed in (SSE2 batch width 4) 60 different *.regress.batched.opt tests comparing results against scalar results, and another 161 *.batched.opt tests comparing against reference txt or images.

@johnfea what isn't being covered such that "Both old and new non-typical width batched code in llvm_util.c isn't covered by testsuite though"? Do we need more test suite entries to cover?

AlexMWells avatar Aug 16 '24 17:08 AlexMWells

I mean what I discussed above. Here's the code structure:

if (m_supports_avx512f) {
switch (m_vector_width) {
case 16:
break;
case 8;
// Removing this old case doesn't change test results
break;
case 4:
// Removing this new case doesn't change test results
break;
}
} else if (m_supports_avx) {
switch (m_vector_width) {
case 16:
// Removing this old case doesn't change test results
break;
case 8:
break;
case 4:
// Removing this new case doesn't change test results
break;
}
} else {
switch (m_vector_width) {
case 16:
// Removing this old case doesn't change test results
break;
case 8:
// Removing this old case doesn't change test results
break;
case 4:
break;
}
}

Here's maybe all affected functions:

llvm::Value* LLVM_Util::mask_as_int(llvm::Value* mask)
llvm::Value* LLVM_Util::op_gather(llvm::Type* src_type, llvm::Value* src_ptr,llvm::Value* wide_index)
void LLVM_Util::op_scatter(llvm::Value* wide_val, llvm::Type* src_type, llvm::Value* src_ptr, llvm::Value* wide_index)

I haven't looked more into it. For me these are not important since I don't currently use e.g. 8-wide AVX512. I added the completely untested cases for 4 there to not change existing code layout. I wouldn't worry about it but just as a note.

johnfea avatar Aug 16 '24 20:08 johnfea

@johnfea , I got it, so CI doesn't execute all combinations of 4,8,16 and SSE, AVX, AVX2, AVX512 ISA's so portions of llvm_util.cpp maybe untested.

  1. I do think llvm_util.cpp should support all combinations.
  2. Due to turbo frequency differences when executing different ISA's at different widths, AVX512 8 wide is a useful/expected scenario. I could also imaging AVX/2 4 wide as well in some scenarios where top bin turbo frequencies are desired.
  3. It would be easy to add more CI plans, but probably need a balance of CI plan time vs. coverage.
  4. Users of batched should build OSL for their desired configurations and make sure the test suite passes (as the CI may not cover them all)
  5. As noted 4 wide batched scenarios for other than SSE are not in the CI plan and may contain bugs until fully tested.

AlexMWells avatar Aug 16 '24 21:08 AlexMWells

I think the test failures are because of a merge that just happened in OIIO master. I merged a fix into OSL main. If you rebase on top of the current main, I think that will all get cleaned up.

lgritz avatar Aug 17 '24 16:08 lgritz

Fixed now, sorry about that. One more merge/rebase on top of the (new) main ought to clear up the CI for you.

lgritz avatar Aug 18 '24 11:08 lgritz

Looks like all the CI tests are coming in passing now.

I'm inclined to merge this in its current state, and any further enhancements or tests can be a follow-up.

Express objections now; in their absense, I will merge this tomorrow (Monday).

lgritz avatar Aug 18 '24 17:08 lgritz

This one needs to be accepted before I can work on the others. Due to changes there's again trivial merge conflict in ci.yml.

johnfea avatar Sep 03 '24 07:09 johnfea

I'm going to merge this, and then I have a fix for the CI that I will push immediately following that.

lgritz avatar Sep 04 '24 00:09 lgritz

Next time, please submit PRs from a branch other than your own main, that can make things very difficult.

lgritz avatar Sep 04 '24 00:09 lgritz