Optimize execution
(This might benefit from first addressing #20?)
Currently, execution is gated to real-time by doing execution steps. Execute a number of T states that would occur in the 64 Hz window between NMIs being issued. Once the number of T states is met or exceeded, check to see if the NMI should be called. After NMI is called, wait 15 ms before calling the next execution step. This 15 ms is handled by SDL ticks and probably ends up taking longer than that due to timing.
Not really sure what can be done about this, once everything is more stable it might be worth running some tests on various ways to gate the execution more closely to real time.
@CaptainNic If you got some ideas on how to have an equivalent high accuracy sleep similar to clock_nanosleep() in Windows, I'm all ears.
https://docs.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps
https://docs.microsoft.com/en-us/windows/win32/multimedia/multimedia-timers?redirectedfrom=MSDN
TLDR: it's complicated lol
I can take a look this weekend probably.
If its not worth the trouble for Win then Its probably fine to sacrifice some accuracy there. Win support is an afterthought for me since the only thing I need it for is the msloader due to parport. With WiFiStation support, I don't even need that anymore.
Dumping some notes here:
System idling, built for release with gprof.
$ ps -o pid,comm,etime,time -p 22553
PID COMMAND ELAPSED TIME
22553 msemu 04:29 00:01:00
~0.223 load
This is down from ~1 before sleeping.
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
28.07 1.10 1.10 z80ex_mread
17.86 1.80 0.70 ms_run
16.97 2.47 0.67 176221055 0.00 0.00 debug_testbp
7.27 2.75 0.29 61879057 0.00 0.00 debug_dasm
5.61 2.97 0.22 z80ex_intread
5.36 3.18 0.21 cf_read
4.08 3.34 0.16 z80ex_mwrite
4.08 3.50 0.16 42313393 0.00 0.00 io_read
2.68 3.61 0.11 17248 0.01 0.01 debug_isbreak
2.68 3.71 0.11 13431881 0.00 0.00 log_debug
1.15 3.76 0.05 11969585 0.00 0.00 ram_read
0.77 3.79 0.03 1 30.01 30.01 cf_deinit
0.64 3.81 0.03 debug_prompt
0.51 3.83 0.02 10106662 0.00 0.00 ram_write
0.51 3.85 0.02 17248 0.00 0.00 ui_update_lcd
0.51 3.87 0.02 1 20.01 20.01 io_deinit
0.38 3.89 0.02 1 15.01 15.01 ram_deinit
0.26 3.90 0.01 41280 0.00 0.00 lcd_write
0.26 3.91 0.01 z80ex_pread
0.26 3.92 0.01 z80ex_pwrite
0.13 3.92 0.01 mw
granularity: each sample hit covers 2 byte(s) for 0.26% of 3.92 seconds
index % time self children called name
<spontaneous>
[1] 45.6 1.10 0.69 z80ex_mread [1]
0.39 0.00 104194052/176221055 debug_testbp [3]
0.16 0.00 41001034/42313393 io_read [8]
0.09 0.00 11969585/13431881 log_debug [10]
0.05 0.00 11969585/11969585 ram_read [12]
Basically, the debug system is taking the majority of the time. But, mread is called extremely often so optimizations there should go a long way too.
changes in 831b512c627f18ab4adb696dfce8103587b4aaea
ps -o pid,comm,etime,time -p 23178
PID COMMAND ELAPSED TIME
23178 msemu 05:37 00:01:13
~0.216
Not too much improvement
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
28.26 1.50 1.50 z80ex_mread
18.09 2.46 0.96 218130000 0.00 0.00 debug_testbp
12.25 3.11 0.65 ms_run
11.59 3.73 0.62 76570927 0.00 0.00 debug_dasm
6.59 4.08 0.35 cf_read
5.84 4.39 0.31 z80ex_mwrite
5.09 4.66 0.27 z80ex_intread
4.14 4.88 0.22 16637633 0.00 0.00 log_debug
2.45 5.01 0.13 21512 0.01 0.01 debug_isbreak
1.41 5.08 0.08 14817656 0.00 0.00 ram_read
1.32 5.15 0.07 1 70.02 70.02 cf_deinit
1.04 5.21 0.06 12474182 0.00 0.00 ram_write
0.66 5.24 0.04 debug_prompt
0.38 5.26 0.02 1 20.01 20.01 ram_deinit
0.38 5.28 0.02 z80ex_pread
0.19 5.29 0.01 21512 0.00 0.00 ui_update_lcd
0.19 5.30 0.01 21511 0.00 0.00 ui_render
0.19 5.31 0.01 io_write
From commit 84bac68c15376e9bbe4e3be048c3a6fda2968e5e
$ ps -o pid,comm,etime,time -p 24000
PID COMMAND ELAPSED TIME
24000 msemu 04:24 00:00:41
Roughly ~0.155, or ~28% reduction of CPU time by removing debug calls.
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
36.04 0.58 0.58 z80ex_mread
23.61 0.96 0.38 ms_run
16.47 1.23 0.27 cf_read
11.18 1.41 0.18 z80ex_intread
8.70 1.55 0.14 z80ex_mwrite
1.86 1.58 0.03 ram_read
1.24 1.60 0.02 z80ex_pread
0.62 1.61 0.01 ram_write
0.31 1.61 0.01 1 5.00 5.00 cf_deinit
0.00 1.61 0.00 1186896 0.00 0.00 io_read
mread still dominates, debug could probably use some tuning but doesn't have as much of an impact as I thought. Implementing the sleep optimizations is by far the biggest improvement. For reference, without sleep implemented:
$ ps -o pid,comm,etime,time -p 24550
PID COMMAND ELAPSED TIME
24550 msemu 04:23 00:03:55
Or ~0.893
Rebased against master and this somehow got, better? I also am on a new laptop so there is probably some efficiency gain there. I'll have to reproduce all of the testing steps.
As of e52de24d3b2081fe725ab0821189e31e1be98235
ps -o pid,comm,etime,time -p 110524
PID COMMAND ELAPSED TIME
110524 msemu 04:41 00:00:21
Or ~0.074