elks icon indicating copy to clipboard operation
elks copied to clipboard

Porting Advice

Open fhendrikx opened this issue 8 months ago • 39 comments

Hi,

I'm porting ELKS to a new platform. This platform is similar, but not the same as the standard PC platform. I have modified EMU86 appropriately and this is working correctly with a "monitor" in ROM. The monitor is used to load the Elks Image and rootfs to the right segments (0xE000 and 0x8000 respectively).

The following questions are probably somewhat hard for you to answer, but I welcome any thoughts you may have about where the issues may lie or what to look out for:

  1. Under EMU86 it boots correctly with no errors. However, I cannot seem to get it to read any characters from my input. Adding debug, I can see that kbd_timer() in kbd-poll.c is not being called. Which probably explains why I don't get any input: conio_poll is never called.

  2. On the real hardware, it boots correctly, but seems unable to find init. It returns error code -8. See the attached screenshot.

Image

  1. On the real hardware, it seems to come to a halted state pretty quickly after hitting the login prompt. I thought this might be power saving in the idle loop, but the timer interrupt should keep that from being in that state too long.

Reading this back just now, it feels like something isn't right with the timer interrupt. That would explain the kbd_timer not being called, and the kernel entering a halted state (and staying there).

Thanks in advance, Ferry

fhendrikx avatar Apr 18 '25 01:04 fhendrikx

Hello @Ferry! Thanks for your interest in ELKS. It looks like you've already managed to come pretty far, cool! :)

I can see that kbd_timer() in kbd-poll.c is not being called

Without special configuration, ELKS requires the hardware INT 0 timer in order to multitask applications, as well as the kernel counting down registered timeouts, which in this case ends up calling kbd_timer. I would bet that's the issue. Can you emulate INT 0? EMU86 by default doesn't, I think. My EMU86 fork may have that added, I can't remember. If not, then we'll need to hack a solution which involves configuring CONFIG_TIMER_0F, which essentially has the idle task in main.c execute a INT 0F, which maps to an INT 7 (normally printer interrupt) that ends up being attached to the original INT 0 timer code...

On the real hardware, it boots correctly, but seems unable to find init. It returns error code -8.

So the hardware timer's working on real hardware, but /bin/init isn't on the ROM image. If you set CONFIG_SYS_NO_BININIT it will look for and exec /bin/sh instead, which should be on ROMFS. At least I think that's what's happening, errno -8 is ENOEXEC, "Exec format error", which means the executable header isn't proper. Look closely at the end of the build printfs to see what is being built into ROMFS to further debug this.

On the real hardware, it seems to come to a halted state pretty quickly after hitting the login prompt. I thought this might be power saving in the idle loop, but the timer interrupt should keep that from being in that state too long.

Yes, the idle task executes HLT to save power. If there's no timer interrupt or CONFIG_TIMER_0F isn't set, it'll just sit there and never resume. So it seems perhaps the hardware timer isn't working on your hardware. You'll need to port the PIT code in elks/arch/i86/kernel/timer-8254.c (or other) for your system.

ghaerr avatar Apr 18 '25 03:04 ghaerr

ELKS requires the hardware INT 0 timer in order to multitask applications, as well as the kernel counting down registered timeouts, which in this case ends up calling kbd_timer. I would bet that's the issue.

Yep, the hardware has a timer on IRQ 0 (INT 0x20; the first 32 INTs are reserved on this platform) trigged via a standard PIT (8254).

EMU86 by default doesn't, I think. My EMU86 fork may have that added, I can't remember.

I'll pull it down and compare it against the mfld version.

So the hardware timer's working on real hardware

That part is debatable at this moment.

but /bin/init isn't on the ROM image. If you set CONFIG_SYS_NO_BININIT it will look for and exec /bin/sh instead, which should be on ROMFS.

I'm using the same romfs image on both EMU86 and the hardware, just getting different results. Could be that EMU86 is a little more tolerant of certain things than the hardware.

At least I think that's what's happening, errno -8 is ENOEXEC, "Exec format error", which means the executable header isn't proper. Look closely at the end of the build printfs to see what is being built into ROMFS to further debug this.

Okay, will have a look, see what I can find.

the idle task executes HLT to save power. If there's no timer interrupt or CONFIG_TIMER_0F isn't set, it'll just sit there and never resume. So it seems perhaps the hardware timer isn't working on your hardware. You'll need to port the PIT code in elks/arch/i86/kernel/timer-8254.c (or other) for your system.

Yeah, that has been done. I've even checked the ports and values through EMU86... all good there. I will go back and modify the ROM monitor to check the interrupts under EMU86.

fhendrikx avatar Apr 18 '25 04:04 fhendrikx

Now that I think of it, the hardware timer issue may be that the PIC mask is not set to allow the interrupt. The kernel keeps certain masks the way they were set by the bios, and i'm thinking this may not have been set up on your system like it is on a PC.

ghaerr avatar Apr 18 '25 04:04 ghaerr

If this is the problem, you can have the kernel set the mask for you, I think the file is irq-8259.c or something like that.

ghaerr avatar Apr 18 '25 04:04 ghaerr

Now that I think of it, the hardware timer issue may be that the PIC mask is not set to allow the interrupt. The kernel keeps certain masks the way they were set by the bios, and i'm thinking this may not have been set up on your system like it is on a PC.

That is definitely something to check.. thank you.

I had also look at the EMU86 in your repositories list, and it has additional code for other platforms (mostly 8018x), but nothing obviously different for the timer stuff.

Just added some debug to the timer_proc() in EMU86, and I can watch it count between 0 and 3000 for elks on both IBMPC and the new platform. I think this part may be working... but I like your mask idea. Will have a look at that now.

fhendrikx avatar Apr 18 '25 04:04 fhendrikx

The only place I can find any "masks" are in arch/i86/kernel/irq-8259.c... which uses them for enable_irq() and disable_irq()... that said, I'm not using that file.. I have a custom irq-solo86.c. The only thing interesting in there is the irq_vector which returns irq + 20h.

fhendrikx avatar Apr 18 '25 04:04 fhendrikx

Is it useful to send you a URL to the changes?

fhendrikx avatar Apr 18 '25 04:04 fhendrikx

Sure, I'll look at them tomorrow.

You can copy masking code to your file, The point is to check the code to see whether our kernel actually unmasks IRQ 0 so that the CPU gets the interrupt.

Also, by default, our interrupt handler calls the original INT 0 vector every five hardware timer interrupts. So you'll want to make sure that that vector is initialized to point to a RETI. That is in irqtab.S.

ghaerr avatar Apr 18 '25 04:04 ghaerr

The code in irqtab.S doesn't seem to call the original INT 0 unless ARCH_IBMPC is defined. Might not be an issue after all?

My changes to support elks are here:

[https://github.com/fhendrikx/elks/tree/ARCH_SOLO86](ELKS changes for Solo)

Thanks in advance for any suggestions.

fhendrikx avatar Apr 18 '25 09:04 fhendrikx

I have a custom irq-solo86.c.

The point is to check the code to see whether our kernel actually unmasks IRQ 0 so that the CPU gets the interrupt.

Your "custom" IRQ handling file function enable_irq() is empty! I'm kind of running blind here, how does your system enable/disable external interrupts? Normally the enable_irq() and disable_irq() adjust the 8259 mask register for this purpose.

It would seem the problem you're having is that the hardware timer interrupt is not being received by the kernel.

The code in irqtab.S doesn't seem to call the original INT 0 unless ARCH_IBMPC is defined. Might not be an issue after all?

Not an issue, sorry for the confusion. I forgot that was for IBM PC only and wasn't looking at the code.

ghaerr avatar Apr 18 '25 15:04 ghaerr

Your "custom" IRQ handling file function enable_irq() is empty! I'm kind of running blind here, how does your system enable/disable external interrupts? Normally the enable_irq() and disable_irq() adjust the 8259 mask register for this purpose.

Yep, that is fine. External interrupts are controlled by the device initialisation, not a mask. For the timer, I simply need to enable the port (TIMER_ENBL_PORT) and it's good to go. This is being done in the timer-8254 code.

It would seem the problem you're having is that the hardware timer interrupt is not being received by the kernel.

Yeah, I agree. I must be missing something somewhere else... possibly the mapping from IRQ0 to INT 0x20?

Thanks for looking.

fhendrikx avatar Apr 18 '25 19:04 fhendrikx

Probably worth saying that this board is like the PC platform, but a lot of the "supporting" chips are replaced by a single CPLD that does all the hard work. Makes some things a lot simpler.

fhendrikx avatar Apr 18 '25 19:04 fhendrikx

possibly the mapping from IRQ0 to INT 0x20?

The IBM PC mapping from IRQ 0 is to INT 8. The interrupt vector table offset is specified in irq_vector, like you have it, so not sure what is going on. To quickly debug this, I sometimes will put a printk in the timer interrupt routine itself: timer.c::timer_tick(). That should show you whether you're ever getting an interrupt. It is also possible you're getting just one interrupt and something needs to be done to get the next one.

External interrupts are controlled by the device initialisation, not a mask.

Ok, got it.

I simply need to enable the port (TIMER_ENBL_PORT) and it's good to go. This is being done in the timer-8254 code.

Maybe double check that?

ghaerr avatar Apr 18 '25 20:04 ghaerr

As suspected, I'm not seeing the output of the printk in the timer_tick() function. Now I need to figure out if the issue lies with EMU86, or with the ELKS changes.

fhendrikx avatar Apr 18 '25 21:04 fhendrikx

Now I need to figure out if the issue lies with EMU86

I seem to remember something about counting instructions before simulating an INT 8 (IRQ 0) on EMU86... or it could have been another emulator. The point is that the HLT instruction in the idle main loop could be preventing the generation of the interrupt if that's the case. To test that, just comment out idle_halt() in the idle loop in main.c. This doesn't answer why things aren't working on real hardware though.

ghaerr avatar Apr 18 '25 21:04 ghaerr

Right, I tracked down the issue with EMU86... it's expecting an EOI (or something of that ilk). So I've added one, and it's now working fine. I've also tracked down the issue with the "corrupted" init process (off-by-one error when loading the ROMFS into RAM). EMU86 now seems happy.

Finally, I've also commented out the HLT in the idle loop, and all is looking much happier on the actual hardware. This is a screenshot from the actual hardware.

Image

That said, I press enter, and it seems to hang... then I press CTRL-C, and the prompt comes back.. Then I type something, and it hangs... press CTRL-C and the prompt returns. Back-space seems to work nicely... I'm thinking it may be \n vs \r stuff... any thoughts?

fhendrikx avatar Apr 20 '25 03:04 fhendrikx

I press enter, and it seems to hang... then I press CTRL-C, and the prompt comes back

Now that I think of it, /bin/sh does send some ANSI strings and other stuff at every shell prompt, and it's possible your terminal emulator isn't handling it. Can you configure the system to run /bin/sash instead of /bin/sh? I'm not sure how you're building the ROMS image, but usually CONFIG_APP_TAGS=:defash ... rather than :ash ... will do the trick. That installs sash as /bin/sh, which doesn't do anything funky with the terminal.

I've also commented out the HLT in the idle loop

Why was this required for real hardware, do you know?

ghaerr avatar Apr 20 '25 04:04 ghaerr

Thanks for your help in this... it's been very useful. 👍

I have CONFIG_APP_TAGS=":defash|:128k|:192k|", but when booting it says "/bin/init: Exec failed: /bin/sh (errno 2)"?.. what am I missing here?

Why was this required for real hardware, do you know?

I'm not sure... we tested the logic again and can confirm that in the halt state, an interrupt, as it should, takes it out of that state. We tested interrupts with the timer and keyboard, and it works correctly. More investigation is required. Maybe I'm just wrong on that, and it was something else I changed. Tomorrow will tell.

fhendrikx avatar Apr 20 '25 08:04 fhendrikx

but when booting it says "/bin/init: Exec failed: /bin/sh (errno 2)"?.. what am I missing here?

Errno 2 is ENOENT (see include/linuxmt/errno.h) which means it didn't find /bin/sh. Perhaps the :defash isn't working, you'll have to check the output of the mkromfs program at the end of the build as previously discussed to see for sure.

You can also look at init/main.c::do_init_task() to see the steps the kernel is taking in this last stage of boot up. There is also a CONFIG_SYS_NO_BININIT that changes operation, check that and your .config.

The whole business of building an application list from elkscmd/Applications is a bit tricky. See elkscmd/Makefile.install. Make sure you don't have CONFIG_APP_ASH set, as that'll bring in /bin/sh as bash, but it seems you don't have any /bin/sh at all.

Finally, you can punt and ls -l elks/target to see the filesystem tree that is about to be created using image/Make.image::romfs where mkromfs is used to generate the ROMFS. This could be executed manually by copying files to elks/target then executing cd image; make romfs.

I hope that helps.

ghaerr avatar Apr 21 '25 03:04 ghaerr

@fhendrikx What is this hardware platform you are porting to?

toncho11 avatar Apr 21 '25 10:04 toncho11

Errno 2 is ENOENT (see include/linuxmt/errno.h) which means it didn't find /bin/sh. Perhaps the :defash isn't working, you'll have to check the output of the mkromfs program at the end of the build as previously discussed to see for sure.

It wasn't working, but its actually "defsash". That command works, in that there is a shell, but still not the right one. I will need to dig into that script.

Finally, you can punt and ls -l elks/target to see the filesystem tree that is about to be created using image/Make.image::romfs where mkromfs is used to generate the ROMFS. This could be executed manually by copying files to elks/target then executing cd image; make romfs.

I hope that helps.

It surely does. Thanks again.

It's coming together nicely now. Just need to sort the character translation internally, and we should be good.

fhendrikx avatar Apr 21 '25 20:04 fhendrikx

@fhendrikx What is this hardware platform you are porting to?

We're porting to Solo/86, a newly developed system built around a 286. It is intended to be simple and easy to put together and exists primarily for hobbyists. We didn't want to design yet another PC compatible, so we built something clean and simple without all of the legacy. Some things are done in the tried-and-true ways, and others are wholly new.

This effort with ELKS is to provide us with an OS and prove that everything in the hardware works correctly, before opening up the github repositories (with the designs, etc) publicly.

fhendrikx avatar Apr 21 '25 20:04 fhendrikx

Just a quick update... everything is now working well with the 128K ROM.

Thanks to @ghaerr for all your help. It is much appreciated.

I will move onto the storage hardware next.

fhendrikx avatar Apr 22 '25 01:04 fhendrikx

It's coming together nicely now.

What were the final changes you had to do to make everything work? Did more code have to be written, or was it mostly getting the configuration file(s) organized to work for your system? I'm interested in what you had to do to get things working.

This effort with ELKS is to provide us with an OS and prove that everything in the hardware works correctly,

Nice!! Thank you for choosing ELKS to run on your new hardware! :)

ghaerr avatar Apr 22 '25 03:04 ghaerr

What were the final changes you had to do to make everything work?

I think I fixed the issue I was seeing with the timer some days ago, but it didn't make it in immediately. Probably because the code was right, just the definition kept in a .h file somewhere was wrong.

Did more code have to be written, or was it mostly getting the configuration file(s) organized to work for your system? I'm interested in what you had to do to get things working.

We needed to do some more work on our UART. This handles both a physical screen/keyboard interface, and a telnet interface. Some changes to the way telnet does the initial connection to our hardware, and then ensuring we handled 0A and 0D correctly from both interfaces.

There are some remaining questions around interrupt handling that we are discussing internally, but I don't expect it will make much difference at this point.

Finally, I don't think I've bottomed out on the elkscmd stuff yet. Thankfully after all the changes above we were quite fine running the default ash shell. I'll look into this more as we build and debug the storage stuff next.

Nice!! Thank you for choosing ELKS to run on your new hardware! :)

Well, I should be thanking you for your work on ELKS. You made our choice really easy! :)

fhendrikx avatar Apr 22 '25 10:04 fhendrikx

@fhendrikx Let us know when you release Solo86 :) It is already interesting!

toncho11 avatar Apr 25 '25 17:04 toncho11

@fhendrikx Let us know when you release Solo86 :) It is already interesting!

Of course, I'm looking forward to it :)

fhendrikx avatar Apr 27 '25 05:04 fhendrikx

@ghaerr

I'm keen to build a CF driver for Solo. As you know, the CF stuff is just standard ATA really. Any recommendations in terms of integration with existing code? I see the directhd driver is ATA too... but it's marked as "not working".

Any thoughts welcome.

fhendrikx avatar May 09 '25 02:05 fhendrikx

Hi @fhendrikx,

Yes I recommend you start from scratch using ssd.c as a base framework for the driver, it was written recently and has none of the huge cruft that the directhd.c former driver has within it.

The ssd.c (/dev/ssd) base framework has most of what you need to start, and you'll only have to write two routines - ssddev_read and ssddev_write. The framework also allows for async (IODELAY) completion, which I don't think you'll want or need - just read and write immediately in the calling kernel thread.

The ssd.c example keeps the actual read/write device in separate files (ssd-xms.c for XMS I/O, ssd_asm.s for an 8018X SSD implementation, and ssd-test.c for testing). You'll want to copy ssd.c to a wholly new file for your driver, and probably include the read/write routines directly in a single file. You'll also see there's an ioctl implementation that allows setting the size of the CF, but that's only useful for ram disks and testing, so it will likely be able to be wholly removed.

When I recently wrote the XMS ramdisk, I was able to use the ssd.c framework as a start and got the driver finished in a couple hours. Yours will likely take longer, but feel free to copy some of the ATA init/read/write stuff out of directhd.c if you like it, otherwise just start from scratch - should be pretty straightforward.

You can choose to keep /dev/ssd as the registered block device, or perhaps better just use the major/minor numbers for /dev/hda. Things get a lot more complicated for partitions, do you think you'll need them?

Thank you!

ghaerr avatar May 09 '25 07:05 ghaerr

@ghaerr thank you again, that was super useful.

The ssd.c (/dev/ssd) base framework has most of what you need to start, and you'll only have to write two routines - ssddev_read and ssddev_write. The framework also allows for async (IODELAY) completion, which I don't think you'll want or need - just read and write immediately in the calling kernel thread.

Agreed.

The ssd.c example keeps the actual read/write device in separate files (ssd-xms.c for XMS I/O, ssd_asm.s for an 8018X SSD implementation, and ssd-test.c for testing).

That all looks very logical. Will look to copy that.

You'll want to copy ssd.c to a wholly new file for your driver, and probably include the read/write routines directly in a single file.

Ah, so that deviates a little from the others, any reason why?

You'll also see there's an ioctl implementation that allows setting the size of the CF, but that's only useful for ram disks and testing, so it will likely be able to be wholly removed.

Nice.

You can choose to keep /dev/ssd as the registered block device, or perhaps better just use the major/minor numbers for /dev/hda. Things get a lot more complicated for partitions, do you think you'll need them?

In the spirit of keeping things simple, we will probably not bother with partitions for the moment.

Any reason to use /dev/hda over /dev/ssd?

Thanks!

fhendrikx avatar May 09 '25 23:05 fhendrikx