Celeste64 icon indicating copy to clipboard operation
Celeste64 copied to clipboard

Camera FieldOfView can be NaN

Open NoelFB opened this issue 1 year ago • 22 comments

Error Log (1/29/2024 11:15:25 PM)
Call Stack:
System.ArgumentOutOfRangeException: fieldOfView ('NaN') must be greater than '0'. (Parameter 'fieldOfView')
Actual value was NaN.
   at System.ArgumentOutOfRangeException.ThrowLessEqual[T](T, T, String)
   at Celeste64.Camera.get_Projection() in /home/noel/Projects/Celeste64/Source/Graphics/Camera.cs:line 128
   at Celeste64.World.Render(Target) in /home/noel/Projects/Celeste64/Source/Scenes/World.cs:line 633
   at Celeste64.Game.Render() in /home/noel/Projects/Celeste64/Source/Game.cs:line 288
   at Foster.Framework.App.Tick()
   at Foster.Framework.App.Run(String, Int32, Int32, Boolean )
   at Foster.Framework.App.Run[T](String, Int32, Int32, Boolean )
   at Celeste64.Program.Main(String[]) in /home/noel/Projects/Celeste64/Source/Program.cs:line 23
Game Output:
Celeste 64 v.1.0.1
Foster: v0.1.14
Platform: Microsoft Windows 10.0.19045 (X64)
Framework: .NET 8.0.1
SDL: v2.28.5
OpenGL: v3.3.13399 Core Profile Forward-Compatible Context 15.201.1151.1008, AMD Radeon HD 6450
FMOD Bindings: v20218
FMOD: v20207
Loaded Bank: C:\Users\*\Desktop\Celeste64-win-x64\Content\Audio\Master.strings.bank
Loaded Bank: C:\Users\*\Desktop\Celeste64-win-x64\Content\Audio\Master.bank
Loaded Bank: C:\Users\*\Desktop\Celeste64-win-x64\Content\Audio\music.bank
Loaded Bank: C:\Users\*\Desktop\Celeste64-win-x64\Content\Audio\sfx.bank
Loaded Assets in 1758ms
Strawb Count: 20
Loaded Map '1' in 268ms```

NoelFB avatar Jan 30 '24 08:01 NoelFB

The only place FieldOfView is ever assigned is here in Player.cs

float targetFOV = Calc.ClampedMap(velocity.XY().Length(), MaxSpeed * 1.2f, 120, 1, 1.2f);
World.Camera.FOVMultiplier = Calc.Approach(World.Camera.FOVMultiplier, targetFOV, Time.Delta / 4);

The only way I could see NaN being assigned is if velocity is also somehow maybe NaN? Weird.

NoelFB avatar Jan 30 '24 20:01 NoelFB

Gonna share some progress debugging this crash.

I managed to narrow down the NaN appearance in one crash instance to this block of code in StNormalUpdate: https://github.com/ExOK/Celeste64/blob/48eebdadc8814451308eef2b96180edf6ac2b91d/Source/Actors/Player.cs#L990-L1120

velXY is being set to NaN somewhere here.

The person says they're crashing whenever they try to move around, but jumping works fine - so does moving the camera. Dashing causes the game to freeze, both grounded and airborne. They use keyboard, and not analog (I even completely commented out the AddLeftJoystick and AddDPad calls - it still crashes.)

Feels like Move.Value is somehow NaN?

SnipUndercover avatar Jan 30 '24 23:01 SnipUndercover

Good news; Move.Value does not contain NaN. Building and running with the below patch does not log that Controls.Move.Value has NaN.

diff --git a/Source/Game.cs b/Source/Game.cs
index 89c0fe8..a5fb4f9 100644
--- a/Source/Game.cs
+++ b/Source/Game.cs
@@ -113,6 +113,9 @@ public class Game : Module

        public override void Update()
        {
+               if (VectorHelpers.HasNaN(Controls.Move.Value))
+                       Log.Error($"{nameof(Controls.Move.Value)} contains NaN! Things are about to go very wrong! ({Controls.Move.Value})");
+
                // update top scene
                if (scenes.TryPeek(out var scene))
                {
diff --git a/Source/Helpers/VectorHelpers.cs b/Source/Helpers/VectorHelpers.cs
new file mode 100644
index 0000000..173d09a
--- /dev/null
+++ b/Source/Helpers/VectorHelpers.cs
@@ -0,0 +1,9 @@
+namespace Celeste64;
+public static class VectorHelpers
+{
+    public static bool HasNaN(this in Vec2 vec2)
+        => float.IsNaN(vec2.X) || float.IsNaN(vec2.Y);
+
+    public static bool HasNaN(this in Vec3 vec3)
+        => float.IsNaN(vec3.X) || float.IsNaN(vec3.Y) || float.IsNaN(vec3.Z);
+}

SnipUndercover avatar Jan 31 '24 20:01 SnipUndercover

Thanks for the investigation! This should narrow it down a lot ...

NoelFB avatar Jan 31 '24 21:01 NoelFB

Found the source of the NaN; apparently it's RelativeMoveInput. (so I was almost there; I considered checking RelativeMoveInput directly but chose against it, as I thought that since Controls.Move.Input is fine, RelativeMoveInput would also be fine.)

I added assertions to every parameter when calculating velXY, logging if it contains NaN. input is NaN when grounded; RelativeMoveInput (and subsequently accel) is NaN when airborne.
I'd take it home from there, but it's currently late; figured I'd share the development. I plan to make a PR in case I get to the bottom of this issue.

SnipUndercover avatar Jan 31 '24 23:01 SnipUndercover

Somehow the NaN surfaces from normalizing the camera's XY components?? This doesn't make any sense..

diff --git a/Source/Actors/Player.cs b/Source/Actors/Player.cs
index d0482d6..03950ae 100644
--- a/Source/Actors/Player.cs
+++ b/Source/Actors/Player.cs
@@ -611,7 +611,15 @@ public class Player : Actor, IHaveModels, IHaveSprites, IRidePlatforms, ICastPoi
                        if (Vec2.Dot(input, Vec2.UnitY) >= .985f)
                                input = Vec2.UnitY;

-                       return forward * input.Y + side * input.X;
+                       Vec2 ret = forward * input.Y + side * input.X;
+                       if (ret.HasNaN())
+                       {
+                               ret.AssertNotNaN($"{nameof(RelativeMoveInput)} contains NaN!");
+                               Log.Error($"{nameof(World.Camera.Forward)} = {World.Camera.Forward}");
+                               Log.Error($"{nameof(forward)} = {forward} (normalized)");
+                               Log.Error($"{nameof(side)} = {side} (normalized)");
+                       }
+                       return ret;
                }
        }

Running this yields these logs:

RelativeMoveInput contains NaN! (<NaN, NaN>)
Forward = <-0.10875643, -0.887422, -0.44794446>
forward = <0.12164313, -Infinity> (normalized)
side = <Infinity, NaN> (normalized)

This is not a one-off, here's what happens when the camera is not moved:

RelativeMoveInput contains NaN! (<NaN, NaN>)
Forward = <-3.9004448E-08, 0.89231783, -0.45140773>
forward = <-4.3711385E-08, Infinity> (normalized)
side = <-Infinity, NaN> (normalized)

SnipUndercover avatar Feb 01 '24 21:02 SnipUndercover

Hmm we do use a custom Normalize method, but all it does is check for x/y being 0 and return 0 in that case. Maybe we need to use an epsilon?

    public static Vector2 Normalized(this Vector2 vector)
    {
        if (MathF.Abs(vector.X) <= float.Epsilon && MathF.Abs(vector.Y) <= float.Epsilon)
            return Vector2.Zero;

        return Vector2.Normalize(vector);
    }

I'm not sure what else would be the cause... but that doesn't entirely make sense because the input values aren't both near zero (only X in the example above).

NoelFB avatar Feb 01 '24 21:02 NoelFB

Getting infinite/NaN values is definitely expected for very small values (Vector2.Normalize(new(float.Epsilon, 0)) produces <∞, NaN> for instance), but the values Snip's test subject is getting this for are definitely not in a range where that should happen (and it doesn't if I test with those values).

Note that even when this is behaving normally, float.Epsilon is definitely too small as a guard to fully avoid this. In my testing this starts happening when the vector components are somewhere in the 1e-23 range (for reference, float.Epsilon is 1e-45).

Kalobi avatar Feb 01 '24 22:02 Kalobi

Could it be processor-dependent? The person is running on a relatively old CPU (AMD Phenom II X4 955), and Vector2 operations' assembly leverages AVX instructions; this sounds kind of crazy but it's the only thing floating on my mind.

It might make sense as I've only seen two people experience this crash in the Celeste server.

SnipUndercover avatar Feb 01 '24 23:02 SnipUndercover

Got bored and decided to make a visualization; Yellow is input, green is expected and red is actual. (red vector's angle with the -Y axis is not to scale for demonstration sake) image

SnipUndercover avatar Feb 01 '24 23:02 SnipUndercover

if we do our own Normal calculation does that work for them? what the hell

return vector / MathF.Sqrt(vector.X * vector.X + vector.Y * vector.Y)

NoelFB avatar Feb 01 '24 23:02 NoelFB

Will try tomorrow. :+1:

SnipUndercover avatar Feb 01 '24 23:02 SnipUndercover

For those who are following this issue:

It turns out my theory wasn't in fact so crazy. A .NET Runtime bug causes the JIT to emit wrong ASM instructions when calling Vector2.Normalize(Vector2) on CPUs which don't support AVX, making the normalization result incorrect. For example, on an AMD Phenom II X4 955, this causes the normalized vector's Y component to be +/- Infinity.

The solution here is to do a manual normalization.

I've forked Foster and made it fall back to manual normalization, if any component of the normalized vector is not finite.
I've also included some camera debug metrics in the top left if Celeste64 is built in debug mode.

I sent the built version to my tester (thank you Tanya!) and they confirmed it worked for them. I'll soon make a release on my fork including the fix.


EDIT: The release is out. I hope this'll help those affected until the .NET bug is fixed.

SnipUndercover avatar Feb 02 '24 22:02 SnipUndercover

Wow, thanks for looking into this and figuring it out!

I guess at this point it's worth submitting findings to .NET runtime? We could add a hack to use manual normalization if it fails but I don't really want to do that on our end if it's something that can be resolved upstream... Unless .NET team decide it's not worth fixing for some reason.

NoelFB avatar Feb 04 '24 00:02 NoelFB

@Popax21 said he'd make an issue soon, but not right now as he's currently pretty busy. I'd rather let him do the talking as I'm not exactly very qualified to be talking about this stuff.

SnipUndercover avatar Feb 04 '24 00:02 SnipUndercover

It would probably be a good idea to figure out the minimum .NET version this bug was introduced in while I am still out of action. I'll try to prepare a writeup for a potential .NET bug report once I have the time for it.

Popax21 avatar Feb 05 '24 00:02 Popax21

This might be related to issues seen on ARM64 Release Mode: https://github.com/ExOK/Celeste64/issues/70#issuecomment-1933066532

NoelFB avatar Feb 07 '24 22:02 NoelFB

The fork fell a bit out of date, so I pulled the changes and published the 1.1.1 release + a few extra commits; remembering to publish new platforms as well. (unfortunately I don't have any machines to test the releases on, so fingers crossed)

SnipUndercover avatar Feb 12 '24 19:02 SnipUndercover

So this is CPU dependent? huh interesting(i have a phenom 2😭)

movercell avatar Mar 01 '24 13:03 movercell

[...] For example, on an AMD Phenom X4 955, this causes the normalized vector's Y component to be +/- Infinity.

Whoops, I meant to say "AMD Phenom II X4 955", but just noticed I forgot to put the "II" in. Regardless, my Celeste64 fork should still make the game playable for you.

SnipUndercover avatar Mar 01 '24 18:03 SnipUndercover

After spending a few hours trying to chase down this bug in the JIT source code, I found out that it's already known & fixed: https://github.com/dotnet/runtime/issues/96939.

Not sure if the fix is in the latest .NET 8 runtime, but it would definitely be worth a try to see if updating fixes this.

Popax21 avatar Jun 26 '24 23:06 Popax21

Yeah I think at this point it's worth it for me to just update everything and rebuild and see where we're at. I'll do that soon.

NoelFB avatar Jun 27 '24 04:06 NoelFB