KallistiOS 2ndmix slows down when text is on-screen

2ndmix now: https://www.youtube.com/watch?v=llaQJi2N68s

Properly working 2ndmix for comparison: https://www.youtube.com/watch?v=pdkXfyGqVgA

Jan 04 '24 06:01 darcagn

Confirmed through bisection that Timer Cleanup, Precision Enhancement, PR #345, commit ID 92ef58e, results in building toolchains with different timer behavior that breaks 2ndmix's performance.

eta: It might not be the toolchain after all, but just KOS itself.

Jan 04 '24 20:01 darcagn

Don't understand how this is possible, but I'll check it out. Thanks for the repro.

Jan 04 '24 23:01 gyrovorbis

Updating with more information on this issue, because it's very weird.

It seems to go away sometimes depending on optimization flags (-O2 doesn't work, but -O3 does?)
It seems to go away sometimes depending on compiler version (GCC 13.2.0 doesn't work, if I remember correctly 9 and 10 didn't work but 11 did?)
It seems to exhibit the problem on Gentoo, OpenSUSE Tumbleweed, and Fedora (more bleeding edge rolling release distros) but not exhibit the problem on Debian Sid, Ubuntu, macOS, or FreeBSD

As with my original test with Gentoo, I find that with Fedora I also have the issue with commit ID 92ef58e, PR #345, and after, but not with commit IDs before.

Jan 24 '24 20:01 darcagn

Yo, while you guys are there, does there appear to be any other performance regressions that are noticeable on any other of the examples?

Can one of you run pvrmark_strips_direct to just kind of give me an idea of roughly what your peak PPS is with the TMU regression? I'm curious 1) if it's just barely bad enough to push 2ndmix which was kind of borderlilne over the 30fps edge and 2) if it shows up with just a minimal rendering benchmark like that which doesn't have as much going on as 2ndmix?

Jan 25 '24 20:01 gyrovorbis

(I guess this has already been tracked mostly but didn't send earlier) My guess has been that there is something off in the implementation of timer_getticks() . Since that's the backend for all time requests now and they are called during pvr stat operations. Adding a few notes about it that might be helpful (I haven't been able to replicate in my dev env yet):

https://github.com/KallistiOS/KallistiOS/blob/1162af57fab083584c7b6c0e553cbcb15723424a/kernel/arch/dreamcast/kernel/timer.c#L271

I think it may make sense to refactor this. Is there any real scenario where someone would pass in shift = 12? or 23? Since there are only a few values that make sense for this, and the only use to the function is to adjust the output, I'd have it just return the raw value and have the functions calling it adjust the output however makes best sense. This applies as well to the look up tables being passed in. Since it's such a small set of useful combinations there's likely a heavy cost being paid performance-wise for the theoretical functionality of providing an arbitrary lookup table or shift value.

https://github.com/KallistiOS/KallistiOS/blob/1162af57fab083584c7b6c0e553cbcb15723424a/kernel/arch/dreamcast/kernel/timer.c#L283 For each of these there could be simple short-circuiting logic if either of the parameters was null (so, if I only want seconds or ticks). To take full advantage, the base getticks would need to be aware of this though (for only secs, just return timer_ms_counter without needing to check the regs, for only ticks don't care about overflow).

https://github.com/KallistiOS/KallistiOS/blob/1162af57fab083584c7b6c0e553cbcb15723424a/kernel/arch/dreamcast/kernel/timer.c#L258-L265 This is really minor but just something that stuck out. Even though they both only get updated when the two match, secs is updated from the first read and ticks from the second. Of more note, the update to secs could be done outside the loop instead to match the setting of delta.

Jan 26 '24 14:01 QuzarDC

@QuzarDC what's the "heavy performance cost being paid"? You mean the 4-values LUT or the right-shift? This code is as optimized as it can be, I think.

Feb 19 '24 10:02 pcercuei

As for moving down the initialization of "secs"... Your suggestion does not work. It cannot be set after the loop, because then you lose the guarantee that "timer_ms_counter" did not change between the two reads. If cannot even be set with "unf2", for the exact same reason.

Feb 19 '24 10:02 pcercuei

@QuzarDC what's the "heavy performance cost being paid"? You mean the 4-values LUT or the right-shift? This code is as optimized as it can be, I think.

The code is as optimized as it can be for the highest resolution scenario, and perhaps as a general-purpose system. If nobody is ever going to use anything but the few specific resolutions we provide support for, then there is further optimization to be done with the tradeoff being slightly less precision in the end result.

The root cause of this was found, and the problem is the performance when doing anything but NS resolution lookups is worse than the previous implementation and timer_ms_gettime is being used throughout rendering to track stats. See #471

The simple solution is just to migrate to ns timing as the default, but the ns->ms still has a performance gap.

As for moving down the initialization of "secs"... Your suggestion does not work. It cannot be set after the loop, because then you lose the guarantee that "timer_ms_counter" did not change between the two reads. If cannot even be set with "unf2", for the exact same reason.

Yup. The naming got me since timer_ms_counter is effectively volatile, being updated by the interrupt, while timer_ms_countdown is effectively constant, only being set on enabling the timer system.

Feb 19 '24 19:02 QuzarDC

Could you test with timer_getticks set to __attribute__((always_inline))?

Feb 29 '24 17:02 pcercuei

I can't reproduce, with KOS master (ba408f3d), and gcc version 14.0.1 20240207.

Mar 02 '24 15:03 pcercuei

Could you test with timer_getticks set to __attribute__((always_inline))?

I'm also curious as to whether this fixes it or not, since it definitely disappears for those of us who repro'd with the higher optimization levels.

Mar 15 '24 18:03 gyrovorbis

Is this still an issue, btw, after the changes got merged that increased the PVR statistics resolution from millisecond to nanosecond? I actually think that will make this slightly faster...

May 09 '24 21:05 gyrovorbis

Unfortunately I won't have access to my main workstation machine that exhibited the issue until July probably, but I did reproduce it in a Fedora VM as well back then, so I'll try to install a Fedora VM soon and see if I can still get this to happen.

May 09 '24 21:05 darcagn

Can someone check if this is even still a thing now that we aren't HALVING OUR DATA CACHE by enabling OCRAM on accident? lmao.

Jun 30 '24 03:06 gyrovorbis

@gyrovorbis Im not getting this issue anymore since OCRAM fix

Jul 03 '24 03:07 andressbarajas

@gyrovorbis Im not getting this issue anymore since OCRAM fix

Rofl my waffles... Well... yeah, okay, super unoptimized demo was definitely a little rough on the DC when took away one if its testic--errrr I mean took away half of the data cache... Gonna finally close this nightmare of an issue. Reopen if it rears its ugly head again, but before long, I hope to have 3rdmix added and in a PR.

Jul 03 '24 03:07 gyrovorbis

Nope, still happening. Pulled latest kos master and rebuilt all toolchain profiles last night.

Tested with stable 13.2.0 toolchain and the latest 14.2.0 toolchain.

Problem happens with -O2 or -O3 but disappears with -O3 -flto=auto. Binaries: 2ndmix-slowdown.zip This is on Gentoo Linux host; as mentioned above, last time I had different experiences based on host system (Gentoo host, Fedora VM showed it, but Ubuntu VM did not...)

These are the rest of my KOS_CFLAGS:

-fomit-frame-pointer -fno-builtin -mno-lra -fstack-protector-all -m4-single-only -ml -ffunction-sections -fdata-sections -matomic-model=soft-imask -ftls-model=local-exec -D__DREAMCAST__  -I/home/darc/dc-code/0kostree/kos/include -I/home/darc/dc-code/0kostree/kos/kernel/arch/dreamcast/include -I/home/darc/dc-code/0kostree/kos/addons/include -I/home/darc/dc-code/0kostree/kos/../kos-ports/include -D_arch_dreamcast -D_arch_sub_pristine -Wall -g

The OCRAM thing is irrelevant because it didn't coincide with this bug's introduction -- an improvement in performance in one area does not preclude investigating a regression elsewhere. 3rdmix is irrelevant; if codebase A doesn't change but runs at 60fps one day and 30fps another day that's a regression outside of that codebase and introducing a new codebase B that runs at 9000fps doesn't address the regression exhibited in codebase A.

Maybe I could try to reproduce in a Fedora docker container or something.

Aug 02 '24 19:08 darcagn

Could you test with timer_getticks set to __attribute__((always_inline))?

This did not fix it, at least under 14.2.0 with -O2.

Aug 02 '24 20:08 darcagn

Could you evaluate (using counters) the chances of two timer reads having different values? It might be that the algorithm loops way more than it should, just because the timers are way too fast.

Aug 02 '24 22:08 pcercuei