msemu Optimize execution

(This might benefit from first addressing #20?)

Currently, execution is gated to real-time by doing execution steps. Execute a number of T states that would occur in the 64 Hz window between NMIs being issued. Once the number of T states is met or exceeded, check to see if the NMI should be called. After NMI is called, wait 15 ms before calling the next execution step. This 15 ms is handled by SDL ticks and probably ends up taking longer than that due to timing.

Not really sure what can be done about this, once everything is more stable it might be worth running some tests on various ways to gate the execution more closely to real time.

Sep 22 '19 22:09 kbembedded

@CaptainNic If you got some ideas on how to have an equivalent high accuracy sleep similar to clock_nanosleep() in Windows, I'm all ears.

Sep 08 '22 06:09 kbembedded

https://docs.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps

https://docs.microsoft.com/en-us/windows/win32/multimedia/multimedia-timers?redirectedfrom=MSDN

TLDR: it's complicated lol

I can take a look this weekend probably.

Sep 08 '22 13:09 CaptainNic

If its not worth the trouble for Win then Its probably fine to sacrifice some accuracy there. Win support is an afterthought for me since the only thing I need it for is the msloader due to parport. With WiFiStation support, I don't even need that anymore.

Sep 08 '22 16:09 kbembedded

Dumping some notes here:

System idling, built for release with gprof.

$ ps -o pid,comm,etime,time -p 22553
    PID COMMAND             ELAPSED     TIME
  22553 msemu                 04:29 00:01:00

~0.223 load

This is down from ~1 before sleeping.

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 28.07      1.10     1.10                             z80ex_mread
 17.86      1.80     0.70                             ms_run
 16.97      2.47     0.67 176221055     0.00     0.00  debug_testbp
  7.27      2.75     0.29 61879057     0.00     0.00  debug_dasm
  5.61      2.97     0.22                             z80ex_intread
  5.36      3.18     0.21                             cf_read
  4.08      3.34     0.16                             z80ex_mwrite
  4.08      3.50     0.16 42313393     0.00     0.00  io_read
  2.68      3.61     0.11    17248     0.01     0.01  debug_isbreak
  2.68      3.71     0.11 13431881     0.00     0.00  log_debug
  1.15      3.76     0.05 11969585     0.00     0.00  ram_read
  0.77      3.79     0.03        1    30.01    30.01  cf_deinit
  0.64      3.81     0.03                             debug_prompt
  0.51      3.83     0.02 10106662     0.00     0.00  ram_write
  0.51      3.85     0.02    17248     0.00     0.00  ui_update_lcd
  0.51      3.87     0.02        1    20.01    20.01  io_deinit
  0.38      3.89     0.02        1    15.01    15.01  ram_deinit
  0.26      3.90     0.01    41280     0.00     0.00  lcd_write
  0.26      3.91     0.01                             z80ex_pread
  0.26      3.92     0.01                             z80ex_pwrite
  0.13      3.92     0.01                             mw

granularity: each sample hit covers 2 byte(s) for 0.26% of 3.92 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]     45.6    1.10    0.69                 z80ex_mread [1]
                0.39    0.00 104194052/176221055     debug_testbp [3]
                0.16    0.00 41001034/42313393     io_read [8]
                0.09    0.00 11969585/13431881     log_debug [10]
                0.05    0.00 11969585/11969585     ram_read [12]

Basically, the debug system is taking the majority of the time. But, mread is called extremely often so optimizations there should go a long way too.

Sep 10 '22 02:09 kbembedded

changes in 831b512c627f18ab4adb696dfce8103587b4aaea

ps -o pid,comm,etime,time -p 23178
    PID COMMAND             ELAPSED     TIME
  23178 msemu                 05:37 00:01:13

~0.216

Not too much improvement

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 28.26      1.50     1.50                             z80ex_mread
 18.09      2.46     0.96 218130000     0.00     0.00  debug_testbp
 12.25      3.11     0.65                             ms_run
 11.59      3.73     0.62 76570927     0.00     0.00  debug_dasm
  6.59      4.08     0.35                             cf_read
  5.84      4.39     0.31                             z80ex_mwrite
  5.09      4.66     0.27                             z80ex_intread
  4.14      4.88     0.22 16637633     0.00     0.00  log_debug
  2.45      5.01     0.13    21512     0.01     0.01  debug_isbreak
  1.41      5.08     0.08 14817656     0.00     0.00  ram_read
  1.32      5.15     0.07        1    70.02    70.02  cf_deinit
  1.04      5.21     0.06 12474182     0.00     0.00  ram_write
  0.66      5.24     0.04                             debug_prompt
  0.38      5.26     0.02        1    20.01    20.01  ram_deinit
  0.38      5.28     0.02                             z80ex_pread
  0.19      5.29     0.01    21512     0.00     0.00  ui_update_lcd
  0.19      5.30     0.01    21511     0.00     0.00  ui_render
  0.19      5.31     0.01                             io_write

Sep 10 '22 05:09 kbembedded

From commit 84bac68c15376e9bbe4e3be048c3a6fda2968e5e

$ ps -o pid,comm,etime,time -p 24000
    PID COMMAND             ELAPSED     TIME
  24000 msemu                 04:24 00:00:41

Roughly ~0.155, or ~28% reduction of CPU time by removing debug calls.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 36.04      0.58     0.58                             z80ex_mread
 23.61      0.96     0.38                             ms_run
 16.47      1.23     0.27                             cf_read
 11.18      1.41     0.18                             z80ex_intread
  8.70      1.55     0.14                             z80ex_mwrite
  1.86      1.58     0.03                             ram_read
  1.24      1.60     0.02                             z80ex_pread
  0.62      1.61     0.01                             ram_write
  0.31      1.61     0.01        1     5.00     5.00  cf_deinit
  0.00      1.61     0.00  1186896     0.00     0.00  io_read

mread still dominates, debug could probably use some tuning but doesn't have as much of an impact as I thought. Implementing the sleep optimizations is by far the biggest improvement. For reference, without sleep implemented:

$ ps -o pid,comm,etime,time -p 24550
    PID COMMAND             ELAPSED     TIME
  24550 msemu                 04:23 00:03:55

Or ~0.893

Sep 17 '22 07:09 kbembedded

Rebased against master and this somehow got, better? I also am on a new laptop so there is probably some efficiency gain there. I'll have to reproduce all of the testing steps.

As of e52de24d3b2081fe725ab0821189e31e1be98235

ps -o pid,comm,etime,time -p 110524
    PID COMMAND             ELAPSED     TIME
 110524 msemu                 04:41 00:00:21

Or ~0.074

Mar 22 '23 06:03 kbembedded