elks icon indicating copy to clipboard operation
elks copied to clipboard

SL train benchmarking

Open Vutshi opened this issue 1 year ago • 7 comments

I benchmarked the notorious sl train on MartyPC, which cycle-accurately emulates 8088 hardware.

Results are as follows: Last year commit 2d489a19f71e8dd63beb669ba8ea33acf845e5a1:

ELKS 0.8.0-dev
login: root
# time sl
Real    1m57.100s

The current master 34ea442f0cacfa2cd1c75d1ce336d6af05d2a885 is a tiny bit faster:

ELKS 0.8.0
login: root
# time sl 
Real    1m56.770s

The above two builds are done with the following simplified config bench_conf.txt.

The release image of ELKS 0.8 is noticeably slower:

ELKS 0.8.0
login: root
# time sl
Real    2m5.040s

The key difference is that my simple config builds take 1.66s for the train to appear on the screen, while the release image takes 10.66s.

I wonder why...

Vutshi avatar Sep 25 '24 12:09 Vutshi

It might be related to compressed executable.

tyama501 avatar Sep 25 '24 15:09 tyama501

It might be related to compressed executable.

Indeed, the option is apparently switched off in my config # CONFIG_APPS_COMPRESS is not set

I’ll try to activate it.

Vutshi avatar Sep 25 '24 15:09 Vutshi

OK, compression explains it. Now I have:

ELKS 0.8.0                                                                      
login: root                                                                     
# time sl                                                                              
Real    2m4.950s

The next question is whether the speed improvement meets @ghaerr's expectation?

Vutshi avatar Sep 25 '24 15:09 Vutshi

Well, I would say that the relatively minor enhancements to the kernel and libc ascii<->int conversion routines didn't amount to getting the train to the station any faster. The actual results indicate it's traveling a bit more slowly, not sure the reason for that. There is still the need to perform hardware DIV and MUL instructions, and DIV is very slow on the 8088. I had hoped for better, but there's a lot going on under the hood. I have some ideas for an upcoming system/application profiler, which would show where most of the time is being spent. This should help @Vutshi get on the tracks faster.

Perhaps a bigger concern is the default 360k floppy distribution using compressed executables - is that taking 9s longer to decompress before even starting?! That's a big tradeoff against getting ttyclock and a few other game type programs on the image (compression results in ~30% more space). Perhaps that decision should be reexamined?

Thanks for the testing @Vutshi!

ghaerr avatar Sep 26 '24 08:09 ghaerr

@ghaerr,

The actual results indicate it's traveling a bit more slowly, not sure the reason for that.

No, no. It is actually a little bit faster (0.33s) now as compared to one year ago. So the new DIV probably helps but the main bottleneck is just somewhere else.

Vutshi avatar Sep 26 '24 17:09 Vutshi

I have some ideas for an upcoming system/application profiler, which would show where most of the time is being spent.

That would be awesome! By the way, doesn’t Open Watcom Compiler have some kind of profiler?”

Vutshi avatar Sep 26 '24 18:09 Vutshi

doesn’t Open Watcom Compiler have some kind of profiler?

Good idea. I just checked and yes they do. It only runs on DOS, QNX and OS/2 but it'll still be interesting to look into.

I've been doing some research, and it seems the better profilers use a mechanism saving the entire call stack vs the older alternative of just saving the CS:IP during sampling. It is said that analyzing the call stack allows the user to not only see "how much" but "why" a certain routine is being used.

We can implement a call-stack profiler in ELKS using the gcc -finstrument-functions-simple that is now implemented in our C debug library (which the new --ftrace also uses). Perhaps what I've been missing is to save the function trace call stack into a file rather than displaying it, then somehow write an analyzer to look at it all instead of a person. :)

Even if not automated, a function profiler with timing should be able to tell us more about why and where sl is sl(ow).

ghaerr avatar Sep 26 '24 20:09 ghaerr

Hi, Just for information.

Users in twitter(X) tested sl on PC-98 with Pentium III 533MHz, Celelon 433MHz, and Pentium II 300MHz. These are the almost fastest and the last PC-98 in the beginning of 21st century.

The sl took about 9 seconds for all of these machines running right to left.

I think these machines are so fast that cycle(40000); in the sl doesn't cause much difference. (and remaining wait is maybe constant interrupt?)

tyama501 avatar Oct 26 '24 16:10 tyama501

I think these machines are so fast that cycle(40000); in the sl doesn't cause much difference.

sl has three modes of operation: sl with no arguments runs in normal mode, which is supposed to wait 40ms (40,000 us) between each "frame" of the train output using cycle(0) and cycle(40000). The normal mode is meant to display the train traveling at the same speed on all systems (unless the system is very slow, like @Vutshi's). If the elapsed time between frames is greater than 40ms, then no wait between frames will be performed.

sl -f is fast mode, where the train is drawn as fast as possible, with no waiting between frames. This mode is useful for testing to see how fast your system is when writing to the terminal as fast as possible.

The last mode sl -s is a "slow" mode, where characters might be drawn outside or on the edge of the terminal window, but never seen anyways. This is the original mode but was changed to speed up basic operation with the initial port to ELKS. We'll ignore that for this discussion.

The sl took about 9 seconds for all of these machines running right to left.

It would be interesting for your X users to compare running sl vs sl -f to see if both run at 9 seconds. That should answer the question of whether the systems are drawing much faster than 40ms/frame.

(and remaining wait is maybe constant interrupt?)

What do you mean here?

ghaerr avatar Oct 26 '24 21:10 ghaerr

Hello @ghaerr ,

Oh, I see. I didn't know the cycle is the wait between frames, so I thought the remaining is like wait between frames.

I only uses 286 or V30, and these are very slow even without -f so I misunderstand.

Thank you.

tyama501 avatar Oct 26 '24 22:10 tyama501

One if the user retested with -f and

0.10 sec for Celelon 433MHz, 0.32 sec for Pentium 133MHz.

Very fast!

tyama501 avatar Oct 27 '24 06:10 tyama501

Additionally,

0.09 sec for Pentium III 533MHz.

tyama501 avatar Oct 27 '24 10:10 tyama501