ImageSharp
ImageSharp copied to clipboard
Failing tests on .NET7
Prerequisites
- [X] I have written a descriptive issue title
- [X] I have verified that I am running the latest version of ImageSharp
- [X] I have verified if the problem exist in both
DEBUG
andRELEASE
mode - [X] I have searched open and closed issues to ensure it has not already been reported
ImageSharp version
Current main branch
Other ImageSharp packages and versions
none
Environment (Operating system, version and so on)
Windows 10
.NET Framework version
- 7.0.100-preview.4.22252.9
- 7.0.100-preview.6.22352.1
- 7.0.100-preview.7.22377.5
- 7.0.100-rc.1.22431.12
Description
The following tests fail on .NET 7.0 Windows only. Introduced in 7.0.100-preview.4.22252.9:
-
Skew_IsNotBoundToSinglePixelType<Bgra32>(provider: TestPattern100x50[Bgra32], x: 20, y: 10)
-
Skew_IsNotBoundToSinglePixelType<Bgra32>(provider: TestPattern100x50[Bgra32], x: -20, y: -10)
The issue seems to be only occurring in Release mode. Also it seems somehow related to the PixelFormat Bgra32
Environment Info
OS=Windows 10.0.20348
Intel Xeon Platinum 8272CL CPU 2.60GHz, 1 CPU, 2 logical and 2 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT
Error Message:
System.PlatformNotSupportedException : Operation is not supported on this platform.
Stack Trace:
at System.Runtime.Intrinsics.Vector256.Create(Single value)
at SixLabors.ImageSharp.Formats.Jpeg.Components.FastFloatingPointDCT.<IDCT8x8_Avx>g__IDCT8x8_1D_Avx|16_0(Block8x8F& block) in /_/src/ImageSharp/Formats/Jpeg/Components/FastFloatingPointDCT.Intrinsic.cs:line 141
at SixLabors.ImageSharp.Formats.Jpeg.Components.Decoder.JpegComponentPostProcessor.CopyBlocksToColorBuffer(Int32 spectralStep) in /_/src/ImageSharp/Formats/Jpeg/Components/Decoder/JpegComponentPostProcessor.cs:line 70
Environment Info
OS=macOS Big Sur 11.6.7 (20G630) [Darwin 20.6.0]
Intel Xeon CPU E5-1650 v2 3.50GHz (Max: 3.34GHz), 1 CPU, 3 logical and 3 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT
Steps to Reproduce
Run the tests with .NET 7.0 in Release mode.
Images
No response
Do we have the output images? Are they badly off? If so that would likely indicate a JIT error which we can report upstream.
Do we have the output images? Are they badly off? If so that would likely indicate a JIT error which we can report upstream.
Visually, I cant tell the difference, but the report says 30% of the pixels are different.
Expected:
Actual:
I could not pinpoint whats going wrong, since it only happens on Release mode. To report an issue upstream, we need a smaller reproduction.
The difference is that the RGB components in the actual output for transparent areas are 255,255,255
vs 0,0,0
That it fails for skew and not rotate is confusing as the code differs only on the transform (though that might make it possible to identify a cause).
I cannot replicate this locally. 😔
I had a look at the latest preview release notes and it would take a smarter person than I to narrow down a potential cause without replication.
https://github.com/dotnet/core/issues/7378#issuecomment-1113865127
I can replicate it locally, but it seems not to happen always (but very often).
Intermittent! Whoa that’s weird!
To add another weird observation to the list: When executing the test via visual studio they always pass. Only executing the tests via command line fails occasionally.
I think it's got something to do with the way we unpremultiply Vector4
instances in Numerics
where the W
value is 0
.
We actually get 1 of two possible values from both the scalar and simd versions.
If the input XYZ
values are 0
then you get <NaN, NaN, NaN, 0>
If the values are greater than 0
then you get <∞, ∞, ∞, 0>
Now what I think is happening is that when we are converting the values to bytes on some chipsets NaN
is somehow being converted into 255
. What the exact cause (and why it doesn't seem to affect other tests) I do not know yet.
Maybe handling of NaN/Infinite has changed in .Net7.0? It's still weird that this does not always happens.
edit: the only thing I could find NaN related changed in .Net7.0 seems to be the Equal
method: https://docs.microsoft.com/en-us/dotnet/core/compatibility/core-libraries/7.0/equals-nan
I cant see, how this could affect the Skew Processor, though.
We need to add a bunch of diagnostic information to the output when something fails. I'm going to write some code in our tests to do this.
@brianpopow I managed to capture the environmental values for the Windows failure.
OS=Windows 10.0.20348
Intel Xeon Platinum 8272CL CPU 2.60GHz, 1 CPU, 2 logical and 2 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT
The spec of this processor says it has AVX and AVX2, so this seems a different issue then the one reported in #2173.
I still believe it has something to do with how NaN is handled, but it baffles me that this is only happening occasionally.
Yep. Definitely different.
What I don't understand is why Bgra32
only? I couldn't find anything in our pipeline that would make conversion from Vector4
different. We just shuffle the values first.
The PlatformNotSupportedException
issue on Avx1-only CPU was fixed in .NET 7.0 (should make it to RC1) via https://github.com/dotnet/runtime/pull/72522
@EgorBo thanks for the feedback, happy to see the Vector256.Create
issue to be fixed!
I think there is another issue with the two Skew_IsNotBoundToSinglePixelType
tests failing, maybe unrelated to AVX1 since it was reported on a machine with AVX1 and AVX2 available.
Saw this on Twitter, gave some replies there going to paste them here as well...
At first glance, I don't see anything obvious. The algorithm for division hasn't changed here in a long time.
Conversion of NaN
to byte
has been zero for a long time as well. Well, at least for x86/x64, since NaN
to integral returns 0x8000_0000
then truncate to byte gives 0x00
On Arm64 I believe it saturates instead and so getting 0xFF
would be feasible, but I don't expect you're targeting Arm64 in CI yet based on the above.
Could I get pointers to the test and code that's causing issues and I can look at disassembly dumps and try to reproduce locally? I have a few different machines I can test on across Linux, MacOS, and Windows; just not intimately familiar with the best way to test just the item failing here.
Hi @tannergooding, thanks for trying to help here, I am out of idea's here and any suggestion on how to find this issue is very appreciated.
The failing test is Skew_IsNotBoundToSinglePixelType
. You can execute the test via:
dotnet test -c Release -f net7.0 --filter FullyQualifiedName~Skew_IsNotBoundToSinglePixelType
note: You need to add net7.0 to the test project ImageSharp.Tests.csproj
first.
I can see this happening on my machine, but its very rare. Maybe 1 out of 10 run's this occurs.
I am on windows with dotnet 7.0.100-preview.4.22252.9
, my CPU is Core i7 6700K.
If you want, I can create a branch with a new sub test project with just the failing test. Not sure If this would be more helpful then just executing the test via dotnet test. Let me know if that would be helpful.
edit: also note: this only seems to happen in Release mode and also we are not targeting Arm64 in the CI yet.
Thanks @tannergooding for having a look at this!
@brianpopow The intermittent replication on your machine has me thoroughly stumped. I cannot understand why would be the case at all.
Setting the environmental setting SIXLABORS_TESTING_PREVIEW
to true will also enable .NET 7 without having to adjust the csproj
.
On Arm64 I believe it saturates instead and so getting
0xFF
would be feasible, but I don't expect you're targeting Arm64 in CI yet based on the above.
That's a surprise! Perhaps we should be looking at a high performance strategy that allows us to avoid NaN
then in our (un)premultiply strategy.
I have updated to 7.0.100-preview.7.22377.5
and it seems now to happen more frequently. I have run the tests in a loop of 20 iterations and the issue occurred in 19 / 20 cases.
Well that's interesting! And it only happens during release?
Yes, only in Release mode.
A bug was found and fixed for floating-point corruption on MacOS: https://github.com/dotnet/runtime/pull/75440
But this was reproing on Windows as well, so its likely not the root cause.
But this was reproing on Windows as well, so its likely not the root cause. @tannergooding Maybe actually. We're seeing the issue manifested during skew on Windows exclusively.
I have tried 7.0.100-rc.1.22431.12
and unfortunately still see this issue.
Ok, so I could reliably repro on RC1, but I cannot repro on RC2 (nightly build 7.0.100-rc.2.22464.26
).
There were ~161 commits between these two: https://github.com/dotnet/runtime/compare/release/7.0-rc1...release/7.0-rc2
Of those commits, the most likely to impact this would have been:
- https://github.com/dotnet/runtime/pull/74880 - Fix use of uninitialized memory for Vector3 constants
- https://github.com/dotnet/runtime/pull/74980 - Ensure that the SSE fallback for Vector3.Dot masks off the unused element of op1 and op2
Both of these should only have impacted Vector3
and only CG2/R2R (Crossgen or Ready To Run) scenarios. I'd expect most of the hardware being run against was SSE4.1 or later, but its possible that there is some specific edge case or dependence on some BCL method (which would be cg2/r2r) that was causing this instead.
It would be great if someone else could validate that it is fixed as well, and if so I can dig a little bit deeper to finalize the root cause.
-- Noting that the latest .NET 7 RC2 nightly build may also require the 6.0.10 SDK which I'm not sure where to get atm. I ended up changing the TargetFrameworks
under SIXLABORS_TESTING_PREVIEW
to only be net7.0
to work around this.
Hmmm, maybe I spoke too soon.
I'm still seeing the following in a clean build, but it won't repro after having been hit once:
[xUnit.net 00:00:05.20] Skew_IsNotBoundToSinglePixelType<Bgra32>(provider: TestPattern100x50[Bgra32], x: 20, y: 10) [FAIL]
[xUnit.net 00:00:05.20] Skew_IsNotBoundToSinglePixelType<Bgra32>(provider: TestPattern100x50[Bgra32], x: -20, y: -10) [FAIL]
Failed Skew_IsNotBoundToSinglePixelType<Bgra32>(provider: TestPattern100x50[Bgra32], x: 20, y: 10) [22 ms]
Error Message:
SixLabors.ImageSharp.Tests.TestUtilities.ImageComparison.ImageDifferenceIsOverThresholdException : Image difference is over threshold!
Test Environment OS : Windows
Test Environment is CI : False
Test Environment is .NET Core : True
Test Environment is Mono : False
Report ImageFrame {i}:
Total difference: 29.9761%
[δ(65535,65535,65535,0) @ (5,0)];
[δ(65535,65535,65535,0) @ (6,0)];
[δ(65535,65535,65535,0) @ (7,0)];
[δ(65535,65535,65535,0) @ (8,0)];
[δ(65535,65535,65535,0) @ (9,0)]...
Stack Trace:
at SixLabors.ImageSharp.Tests.TestUtilities.ImageComparison.ImageComparerExtensions.VerifySimilarity[TPixelA,TPixelB](ImageComparer comparer, Image`1 expected, Image`1 actual) in D:\Users\tagoo\source\repos\ImageSharp\tests\ImageSharp.Tests\TestUtilities\ImageComparison\ImageComparer.cs:line 91
at SixLabors.ImageSharp.Tests.TestImageExtensions.CompareToReferenceOutput[TPixel](Image`1 image, ImageComparer comparer, ITestImageProvider provider, Object testOutputDetails, String extension, Boolean grayscale, Boolean appendPixelTypeToFileName, Boolean appendSourceFileOrDescription, IImageDecoder decoder) in D:\Users\tagoo\source\repos\ImageSharp\tests\ImageSharp.Tests\TestUtilities\TestImageExtensions.cs:line 227
at SixLabors.ImageSharp.Tests.TestUtils.RunValidatingProcessorTest[TPixel](TestImageProvider`1 provider, Action`1 process, Object testOutputDetails, ImageComparer comparer, Boolean appendPixelTypeToFileName, Boolean appendSourceFileOrDescription) in D:\Users\tagoo\source\repos\ImageSharp\tests\ImageSharp.Tests\TestUtilities\TestUtils.cs:line 239
at SixLabors.ImageSharp.Tests.Processing.Processors.Transforms.SkewTests.Skew_IsNotBoundToSinglePixelType[TPixel](TestImageProvider`1 provider, Single x, Single y) in D:\Users\tagoo\source\repos\ImageSharp\tests\ImageSharp.Tests\Processing\Processors\Transforms\SkewTests.cs:line 49
at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
Failed Skew_IsNotBoundToSinglePixelType<Bgra32>(provider: TestPattern100x50[Bgra32], x: -20, y: -10) [5 ms]
Error Message:
SixLabors.ImageSharp.Tests.TestUtilities.ImageComparison.ImageDifferenceIsOverThresholdException : Image difference is over threshold!
Test Environment OS : Windows
Test Environment is CI : False
Test Environment is .NET Core : True
Test Environment is Mono : False
Report ImageFrame {i}:
Total difference: 29.9761%
[δ(65535,65535,65535,0) @ (0,0)];
[δ(65535,65535,65535,0) @ (1,0)];
[δ(65535,65535,65535,0) @ (2,0)];
[δ(65535,65535,65535,0) @ (3,0)];
[δ(65535,65535,65535,0) @ (4,0)]...
Stack Trace:
at SixLabors.ImageSharp.Tests.TestUtilities.ImageComparison.ImageComparerExtensions.VerifySimilarity[TPixelA,TPixelB](ImageComparer comparer, Image`1 expected, Image`1 actual) in D:\Users\tagoo\source\repos\ImageSharp\tests\ImageSharp.Tests\TestUtilities\ImageComparison\ImageComparer.cs:line 91
at SixLabors.ImageSharp.Tests.TestImageExtensions.CompareToReferenceOutput[TPixel](Image`1 image, ImageComparer comparer, ITestImageProvider provider, Object testOutputDetails, String extension, Boolean grayscale, Boolean appendPixelTypeToFileName, Boolean appendSourceFileOrDescription, IImageDecoder decoder) in D:\Users\tagoo\source\repos\ImageSharp\tests\ImageSharp.Tests\TestUtilities\TestImageExtensions.cs:line 227
at SixLabors.ImageSharp.Tests.TestUtils.RunValidatingProcessorTest[TPixel](TestImageProvider`1 provider, Action`1 process, Object testOutputDetails, ImageComparer comparer, Boolean appendPixelTypeToFileName, Boolean appendSourceFileOrDescription) in D:\Users\tagoo\source\repos\ImageSharp\tests\ImageSharp.Tests\TestUtilities\TestUtils.cs:line 239
at SixLabors.ImageSharp.Tests.Processing.Processors.Transforms.SkewTests.Skew_IsNotBoundToSinglePixelType[TPixel](TestImageProvider`1 provider, Single x, Single y) in D:\Users\tagoo\source\repos\ImageSharp\tests\ImageSharp.Tests\Processing\Processors\Transforms\SkewTests.cs:line 49
at InvokeStub_SkewTests.Skew_IsNotBoundToSinglePixelType(Object, Object, IntPtr*)
at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
I've root caused the bug. There looks to be a bug with maxps
and minps
.
This is the disassembly for Bgra32.FromVector4
in .NET 6:
push rdi
push rsi
sub rsp,38h
vzeroupper
mov rdi,rcx
mov rsi,rdx
vmovupd xmm0,xmmword ptr [rsi]
vmovupd xmmword ptr [rsp+28h],xmm0
mov rcx,7FF8BAA93EA8h
mov edx,156h
call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE (07FF919E3B470h)
mov rax,1A9DCFC5410h
mov rax,qword ptr [rax]
vmovupd xmm0,xmmword ptr [rsp+28h]
vmulps xmm0,xmm0,xmmword ptr [rax+8]
vmovupd xmmword ptr [rsi],xmm0
vmovupd xmm0,xmmword ptr [rsi]
mov rax,1A9DCFC5418h
mov rax,qword ptr [rax]
vaddps xmm0,xmm0,xmmword ptr [rax+8]
vmovupd xmmword ptr [rsi],xmm0
vmovupd xmm0,xmmword ptr [rsi]
vxorps xmm1,xmm1,xmm1
mov rax,1A9DCFC5410h
mov rax,qword ptr [rax]
vmovupd xmm2,xmmword ptr [rax+8]
vmaxps xmm0,xmm0,xmm1
vminps xmm0,xmm0,xmm2
vmovupd xmmword ptr [rsi],xmm0
vcvttss2si eax,dword ptr [rsi]
mov byte ptr [rdi+2],al
vcvttss2si eax,dword ptr [rsi+4]
mov byte ptr [rdi+1],al
vcvttss2si eax,dword ptr [rsi+8]
mov byte ptr [rdi],al
vcvttss2si eax,dword ptr [rsi+0Ch]
mov byte ptr [rdi+3],al
add rsp,38h
pop rsi
pop rdi
ret
This is the codegen for the same method in .NET 7:
push rdi
push rsi
sub rsp,38h
vzeroupper
mov rdi,rcx
mov rsi,rdx
vmovupd xmm0,xmmword ptr [rsi]
vmovupd xmmword ptr [rsp+28h],xmm0
mov rcx,7FF887B04688h
mov edx,155h
call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE (07FF8E6C4C890h)
mov rax,22D8700EF80h
mov rax,qword ptr [rax]
add rax,8
vmovupd xmm0,xmmword ptr [rsp+28h]
vmulps xmm0,xmm0,xmmword ptr [rax]
vmovupd xmmword ptr [rsi],xmm0
vmovupd xmm0,xmmword ptr [rsi]
mov rdx,22D8700EF88h
mov rdx,qword ptr [rdx]
vaddps xmm0,xmm0,xmmword ptr [rdx+8]
vmovupd xmmword ptr [rsi],xmm0
vxorps xmm0,xmm0,xmm0
vmaxps xmm0,xmm0,xmmword ptr [rsi]
vminps xmm0,xmm0,xmmword ptr [rax]
vmovupd xmmword ptr [rsi],xmm0
vmovss xmm0,dword ptr [rsi]
vcvttss2si eax,xmm0
mov byte ptr [rdi+2],al
vmovss xmm0,dword ptr [rsi+4]
vcvttss2si eax,xmm0
mov byte ptr [rdi+1],al
vmovss xmm0,dword ptr [rsi+8]
vcvttss2si eax,xmm0
mov byte ptr [rdi],al
vmovss xmm0,dword ptr [rsi+0Ch]
vcvttss2si eax,xmm0
mov byte ptr [rdi+3],al
add rsp,38h
pop rsi
pop rdi
ret
You'll note that these are basically identical (you can ignore the GETSHARED_NONGCSTATIC_BASE
difference) except .NET 6 does (simplified):
vmovupd xmm0, [vector4] ; read vector4 into xmm0
vxorps xmm1, xmm1, xmm1 ; zero xmm1
vmovupd xmm2, [maxBytes] ; read maxBytes into xmm2
vmaxps xmm0, xmm0, xmm1 ; vector4 = max(vector4, zero)
vminps xmm0, xmm0, xmm2 ; vector4 = min(vector4, maxBytes)
But .NET 7 is doing (simplified):
vxorps xmm0, xmm0, xmm0 ; zero xmm0
vmaxps xmm0, xmm0, [vector4] ; vector4 = max(zero, vector4)
vminps xmm0, xmm0, [maxBytes] ; vector4 = min(vector4, maxBytes)
This might not seem like much, but it has big impact for NaN
because maxps
/minps
return the right hand side if either operand is NaN
. This means .NET 6 propagates up 0
while .NET 7 propagates up NaN
.
I believe this is non-deterministic because it somewhat depends on TieredCompilation and when the method becomes optimized. It more reliably reproduces if FromVector4
is marked "no-inlining" and both FromVector4
/Pack
are marked AggressiveOptimization
.
Going to see if I can figure out why the JIT is deciding to swap operands here and will try to get a fix up. In the interim, the simple workaround here should be to change Pack(ref Vector4)
to just Pack(Vector4)
. This should be "better" when the method is inlined (and its being aggressively inlined) but also even when not inlined for non-Windows platforms.
I've root caused the bug. There looks to be a bug with
maxps
andminps
.
@tannergooding: very happy to see that you found the root cause of this. Thanks a lot for working on this issue and providing a fix!
edit:
in the interim, the simple workaround here should be to change Pack(ref Vector4) to just Pack(Vector4)
I will make a PR for that.
closing this now with #2230 merged