gcc-ia16 icon indicating copy to clipboard operation
gcc-ia16 copied to clipboard

Custom memory layout for tiny model TSR

Open andrewbird opened this issue 4 years ago • 52 comments

I'm looking at getting gcc-ia16 to cross compile the share.com TSR for FreeDOS. Currently the only compiler capable of building it is Turbo C 2.01. I've had some initial success with GCC using a local keep() function, see https://github.com/andrewbird/share/commit/b94094af4b5bed6efbafb6ee10a38106f7cadfb8 however the original code does some tricks to throw away the initialisation code and shrink the resident memory size. I'm wondering if I could achieve something similar or better using a custom linker script? From what I see in the memory map file, gcc-ia16 lays out the startup code, init and fini before the text of the program and consequently would still be included in the resident memory at TSR. I think that ideally I'd want something like this for tiny model .COM

(jump to real startup code)
(resident text section)
(resident data section)
        < some way of determining this point to use as the size to retain in the keep() function >
(normal startup code)
(usual init code)
(usual fini code)
(transient text section)
(transient data section)

I've never had cause to write a linker script before, so I am a complete novice. Is this sort of control possible? Is there a better solution?

Also I'm using this command line, how would I replicate that as separate compile / link ops?

ia16-elf-gcc -Wall -fpack-struct -fno-toplevel-reorder -mcmodel=tiny -o $@ $< -li86 -Wl,-Map=share.map

andrewbird avatar Sep 12 '21 15:09 andrewbird

I had some success with the linker script and was able to reduce the resident size, however in order to place the temporary code at the end of the output I needed to move my table data out from the heap into a new data segment. This meant that my on disk size grew significantly. The only way I can see around this is copy the transient code higher at install time, jump to it and then overwrite the original location with my data tables. Here's my latest attempt https://github.com/andrewbird/share/commit/c4271c2e2a90da25f748a94e629e862d858798f3

One thing I do notice though, there are lots of these that end up in the resident code __libi86_intr_call_0133. They are only 3 bytes each, but when there are 377 of them they become quite significant.

andrewbird avatar Sep 13 '21 17:09 andrewbird

Hello @andrewbird,

Sorry for the late reply! I will see in the meantime if I can add better support for TSRs to the gcc-ia16 toolchain and libi86...

One thing I do notice though, there are lots of these that end up in the resident code __libi86_intr_call_0133.

Actually, under gcc-ia16 + libi86, you can try compiling your code with optimization (e.g. -Os or -O2) — this should help optimize away most of these __libi86_intr_call... routines. These little routines are used to implement the calls to the 256 possible interrupts, for use with int86 etc. E.g.

__libi86_intr_call_0041:
        int     $0x21
        ret
__libi86_intr_call_0042:
        int     $0x22
        ret

If you enable optimization, and your interrupt numbers are compile-time constants, then <i86.h> will use some magic incantations to link in only the needed int invocations.

Thank you!

tkchia avatar Sep 13 '21 18:09 tkchia

Hello @andrewbird,

From what I see in the memory map file, gcc-ia16 lays out the startup code, init and fini before the text of the program and consequently would still be included in the resident memory at TSR.

In case this is useful: the .init and .fini sections are customarily used to insert snippets (not entire routines) of code to call out to initialization routines before main starts, and to termination routines after main exits. In the default arrangement in gcc-ia16 though, the .init section simply contains a call to a routine that goes through functions pointers in .ctors.* sections. (The gccint info page has a bit of discussion about this stuff.)

To mark an entire routine as a "startup" routine, the customary way — or at least the GCC-approved way — is to place it in a .text.startup or .text.startup.* section. In fact, by default GCC will automatically place main and any __attribute__((constructor)) routines inside .text.startup.

However, using a .non_resident_text section name should work well too.

Thank you!

tkchia avatar Sep 13 '21 19:09 tkchia

@tkchia yes that -Os really helped! At load of share.com the symbols looked like this

dosdebug> usermap load-gnu ../fdos/share.git/share.map 02c3
dosdebug> usermap list
  02c3:0120     _dos_freemem
  02c3:0132     _dos_getvect
  02c3:0144     _dos_setvect
  02c3:0158     __libi86_ret_really_set_errno
  02c3:0163     __libi86_int86_do
  02c3:01be     __libi86_intr_call_0041
  02c3:01c1     __libi86_intr_call_0057
  02c3:0235     _atoi_r
  02c3:0262     _atol_r
  02c3:027a     __errno
  02c3:02a3     memset
  02c3:02c5     strchr
  02c3:02e0     strlen
  02c3:02fc     _strtol_r
  02c3:052f     strtol
  02c3:0547     write
  02c3:055f     _write_r
  02c3:058b     __call_exitprocs
  02c3:06e3     _reclaim_reent
  02c3:07ec     _free_r
  02c3:08b5     _malloc_r
  02c3:09fd     _sbrk_r
  02c3:0a23     __udivsi3
  02c3:0a34     __umodsi3
  02c3:0a4a     __ia16_ldivmodu
  02c3:0a99     _sbrk
  02c3:0ae5     _write
  02c3:0bae     __ia16_abort_impl
  02c3:0c00     __DTOR_END__
  02c3:0c02     _global_impure_ptr
  02c3:0c04     _ctype_
  02c3:1116     __dso_handle
  02c3:111a     _impure_ptr
  02c3:114c     __ctype_ptr__
  02c3:1150     _heaplen
  02c3:1166     _psp
  02c3:116e     _global_atexit
  02c3:1170     __malloc_sbrk_start
  02c3:1172     __malloc_free_list
  02c3:11f6     table_pool
  02c3:51f6     _start
  02c3:5327     _exit
  02c3:532c     __msdos_crt_exit
  02c3:536a     init_tables
  02c3:53c8     main

And the code size was

02b0:0000 0x0011 [SHARE - Environment]
02c2:0000 0x9d3c [SHARE] (END)

After running main

02c2:0000 0x01ae [SHARE]

Which compares very favourably with Turbo C at 0x0237. Obviously I still have a significant part of the interrupt handler commented out, have you any more thoughts about register access within the handler? Would replicating the Watcom method of access via union INTPACK r be a viable method?

I had previously tried to mark _dos_freemem(), _dos_getvect() and _dos_setvect() with .non_resident_text section name as I know they are only called from installation code, but they still ended up in the .text section. Is there any way of marking their section from their caller's section (but it could get complicated if a function was called from both .text and .non_resident_text I guess)?

However, using a .non_resident_text section name should work well too.

Yes I'm making it up as I go along!

Thank you!

andrewbird avatar Sep 13 '21 21:09 andrewbird

I pushed another version now that reduces both the on disk size and the resident memory to better than Turbo C. It does this by placing the non resident text over .bss and heap sections. At startup it copies that text up to just before 0xf000, then proceeds to the normal startup routines that initialize .bss and sets up the heap. Using this method I now only have one special section and malloc can now function as usual.

Type Turbo C Gcc ia16
On disk size 6300 5224
Resident memory size 0x0240 0x01b5

Obviously the new startup code would be better off in assembly rather than fixed bytes and not being very sure about what registers I need to preserve I probably saved too much. https://github.com/andrewbird/share/commit/b21cd7dcea727f625adc3cd1402b55f5a097b6bf

andrewbird avatar Sep 14 '21 13:09 andrewbird

Hello @andrewbird,

I had previously tried to mark _dos_freemem(), _dos_getvect() and _dos_setvect() with .non_resident_text section name as I know they are only called from installation code, but they still ended up in the .text section. Is there any way of marking their section from their caller's section (but it could get complicated if a function was called from both .text and .non_resident_text I guess)?

The short answer is, no unfortunately. Basically, the input sections where _dos_getvect (.) etc. come from are taken from the modules in the library i.e. libi86.a — and the output sections they end up in are determined by the linker script.

There is no easy way (for now) to automatically place _dos_getvect (.) in a special transient section if it will only be called from transient code.

Thank you!

tkchia avatar Sep 18 '21 19:09 tkchia

Yes that aligns with my various experiments. I guess for now it's best to recreate the simple ones locally with int86(), but in any case I'm very happy with the resident and file sizes achieved, I just need to figure out a fairly clean way of accessing/modifying the registers from within the interrupt handler and to implement a chain interrupt function.

Thank you!

andrewbird avatar Sep 18 '21 21:09 andrewbird

Interestingly switching to a local function for getvect() based on int86x() increased the resident size by 6 paragraphs! I see your implementation of _dos_getvect() uses gcc inline assembler, so that would be the way to go if I wasn't trying to avoid sprinkling compiler specific assembly language all over the C source. So for now I'll stick with the i86 library code ending up in the resident portion, and it's good to know your library is lean!

Thank you!

andrewbird avatar Sep 19 '21 11:09 andrewbird

So I'm still looking to shave some bytes off the share tsr. I'm looking at the map file and I see this

 .rodata        0x000000000000125d      0x281 /usr/ia16-elf/lib/libc.a(lib_a-ctype_.o)
                0x000000000000125d                _ctype_                       

What is it, and can I exclude it somehow?

Thank you!

andrewbird avatar Sep 23 '21 11:09 andrewbird

I see it's used by strtol(), which is presumably included to provide atol() which I do use. So I guess the answer is that I'm using it. Oh well!

Thank you!

andrewbird avatar Sep 23 '21 13:09 andrewbird

Hello @andrewbird,

What is it, and can I exclude it somehow?

_ctype_ is the array of flags that is used to implement <ctype.h> 's isalpha (.), isdigit (.), etc. macros, for the default locale ("C"). I think many C runtime libraries employ a similar method to implement <ctype.h>. Thank you!

tkchia avatar Sep 23 '21 17:09 tkchia

Hello @andrewbird,

I see it's used by strtol(), which is presumably included to provide atol() which I do use. So I guess the answer is that I'm using it.

Well, there are other options if you're looking for tiny size: you could write your own atol() and use direct comparisons for digits 0-9, rather than the isdigit() macros which drag in the table defined in <ctype.h>. With a small, efficient implementation of atol like the one below included in your source, this should end up being quite a bit smaller that what you have, provided it meets with your program requirements. This would prevent the standard library version from being pulled in to your program.

long atol(char *s) /* simple, somewhat hacked version of atoi that returns long for smaller size*/
{
    int n = 0; /* declare long if requires decoding of numbers > 64K*/
    while (*s >= '0' && *s <= '9')
        n = n*10 + *s++ - '0';
    return (long)n;
}

Thank you!

ghaerr avatar Sep 23 '21 17:09 ghaerr

Hello @ghaerr Thanks for the advice, I'd just written but not yet tested this

+/* Naive implementation of atol(), only decimal digits allowed, no signs */
+long int atol(const char *s) NON_RESIDENT;
+long int atol(const char *s) {
+       long int val;
+       const char *p;
+
+       for (val = 0, p = s; *p; p++) {
+               if (*p == ' ')
+                       continue;
+               if (*p < '0' || *p > '9')
+                       break;
+               val *= 10;
+               val += *p - '0';
+       }
+
+       return val;
+}

It saved 1.2k on the file size, plus allowed me to steer it into my new non resident text section which means I save around the same on the resident size.

Thank you again!

andrewbird avatar Sep 23 '21 17:09 andrewbird

Hello @tkchia , @ghaerr , I just wanted to say thank you for all your help to me on this gnu linker voyage. My PR has now been merged into FDOS/share so it's certainly been a worthwhile effort, and I've learnt a little bit on the way. A most interesting experience for me.

Thank you!

andrewbird avatar Sep 24 '21 00:09 andrewbird

Hi @tkchia, I'm just trying out your new TSR support that just landed in the PPA. I'm having a little difficulty getting it to link fully, see https://github.com/andrewbird/share/commit/062910471e291d1e130a779c29d1364007c45982

So here's the build attempt and output message

ia16-elf-gcc -Wall -fpack-struct -mcmodel=tiny -mtsr -c share.c -o share.obj -Os
nasm -f elf gcc_help.asm -o gcc_help.obj
ia16-elf-ld share.obj gcc_help.obj -o share.com -L/usr/ia16-elf/lib -li86 --script=dtr-mts.ld -Map=share.map
ia16-elf-ld: cannot find -l:crtbegin.o
ia16-elf-ld: cannot find -l:crtend.o
make: *** [Makefile:2: share.com] Error 1

So some questions:

  • I tried all the linker scripts, and dtr-mts.ld seemed to be the one that produced the least errors, was that a good choice?
  • Should I need to specify the -L path to find libi86 and the linker script (previously I was doing it in my linker script, which obviously wasn't ideal either)?
  • What do I need to do to find the crt* objects?

Thank you!

andrewbird avatar Oct 01 '21 12:10 andrewbird

After a little look at my previous linker script is seems I need to specify the -L path to gcc

diff --git a/build.sh b/build.sh
index 04cd88c..eaad3d0 100755
--- a/build.sh
+++ b/build.sh
@@ -5,7 +5,7 @@ if [ x"${COMPILER}" = "xgcc" ] ; then
   export COPT="-Wall -fpack-struct -mcmodel=tiny -mtsr -c share.c -o share.obj -Os"
   export XOBJS="gcc_help.obj"
   export LD="ia16-elf-ld"
-  export LOPT="share.obj ${XOBJS} -o share.com -L/usr/ia16-elf/lib -li86 --script=dtr-mts.ld -Map=share.map"
+  export LOPT="share.obj ${XOBJS} -o share.com -L/usr/lib/x86_64-linux-gnu/gcc/ia16-elf/6.3.0 -L/usr/ia16-elf/lib -li86 --script=dtr-mtsl.ld -Map=share.map"
   make

So as the link now completes, only one question remains, is it expected to need to specify the -L paths to lib86 and gcc?

andrewbird avatar Oct 01 '21 13:10 andrewbird

A little comparison between the custom linker script & loading code and your new TSR support.

Type Custom New TSR support
On disk size (bytes) 5416 5424
Resident size (paragraphs) 0x01c0 0x0205

I'm a little surprised by the resident size growth?

andrewbird avatar Oct 01 '21 14:10 andrewbird

Using ia16-elf-gcc to do the link stage means I don't have to specify the paths or the script

ia16-elf-gcc -Wall -fpack-struct -mcmodel=tiny -mtsr -c share.c -o share.obj -Os
nasm -f elf gcc_help.asm -o gcc_help.obj
ia16-elf-gcc -mtsr -Wl,-Map=share.map share.obj gcc_help.obj -o share.com -li86

andrewbird avatar Oct 01 '21 15:10 andrewbird

Hello @andrewbird,

ia16-elf-gcc -mtsr -Wl,-Map=share.map share.obj gcc_help.obj -o share.com -li86

Yes, this is the way.

You probably now need to put the transient code in .text.startup, if you have not done so already. I.e.

#define NON_RESIDENT __attribute__((section(".text.startup")))

I hope to further tweak the gcc-ia16 toolchain — maybe add new function and variable attributes — so that there is a documented and flexible way to specify variables and functions as being transient. Meanwhile, the above should work.

Thank you!

tkchia avatar Oct 01 '21 16:10 tkchia

Hello @tkchia,

#define NON_RESIDENT __attribute__((section(".text.startup")))

Yes I had already done this and it seemed to be steering those functions into the non-resident area just fine. I've attached the map file so you can see, but it seems there must be something else swelling the resident sections. share.zip

So since you now use .text.startup as the section to discard in the tsr script, does that mean you can mark certain library functions that you know are startup / end code like crtbegin, crtend, exit, __call_exitprocs as section .text.startup and in the normal linker script they will be linked into the .text section by wildcard rule .text.*?

I hope to further tweak the gcc-ia16 toolchain

Sure thing, I'll test as they arrive into the PPA

Thank you!

andrewbird avatar Oct 01 '21 17:10 andrewbird

Hello @tkchia , Am I correct in saying that malloc() allocations start at heap_end_minimum? If so that may be the problem, as the share loader figures out where the end of what's supposed to be resident is by mallocing a final byte after all other allocations are done and then measuring the distance from the PSP to it. On the custom ld script script the heap_end_minimum is 0x12a0, whereas with the new tsr support it's 0x16f0. Other than that the .text, .data, bss section sizes look very similar to those produced by the custom script.

Thank you!

andrewbird avatar Oct 01 '21 17:10 andrewbird

Hello @andrewbird,

Ah — I think I see the problem. Your custom script placed the heap start right after the resident portion, while my script was placing it at the end of the transient portion. (I had considered treating the heap as part of the resident portion, but this is tricky to get right in the general case.)

Let me try to come up with a good way for the C runtime to report the end of the resident portion. I will let you know.

Thank you!

tkchia avatar Oct 01 '21 18:10 tkchia

Hello @tkchia ,

while my script was placing it at the end of the transient portion.

But doesn't that mean if we want to release the transient portion we can never use malloc'd memory afterwards? In the share program malloc'd memory is used to allocate the share and lock tables that will be used by the resident code. Their sizes are not known at compile time as they are determined by command line parameters.

Let me try to come up with a good way for the C runtime to report the end of the resident portion.

I'm not sure if that's too helpful on its own, but if we could redefine the location of the heap at runtime it could be?

I hope I'm not asking too many silly questions, this level of control on placing memory in a C program is both new and interesting to me.

Thank you!

andrewbird avatar Oct 02 '21 08:10 andrewbird

Can this work?

Entry stage

  • startup data and startup text in position that final bss will occupy
. high
data.startup
text.startup
data
text
. low

Setup stage before main()

  • bss.startup initialised at top of segment, data.startup and text.startup copied just beneath it, jump to text.startup, bss initialised, heap setup, stack.startup setup
  • heap grows up, stack down
. high
bss.startup
data.startup
text.startup
stack.startup
.
heap
bss
data
text
. low

Run main()

  • user code determines size to keep resident, sets up stack, and keep(size)
  • heap grows up, stack down
. high
stack
.
heap
bss
data
text
. low

andrewbird avatar Oct 02 '21 09:10 andrewbird

Hello @andrewbird,

In the share program malloc'd memory is used to allocate the share and lock tables that will be used by the resident code. Their sizes are not known at compile time as they are determined by command line parameters.

Hmm — this is a problem. 😐

One problem with moving .text.startup etc. above the heap, is that the maximum heap size, whether it is 32 KiB, or 48 KiB, or 60 KiB, etc. will probably need to be hard-coded in advance, so that we know where exactly to position .text.startup etc.

It will be nice if we can automatically position the transient code and data according to their size — if the transient portion is large we probably want it to start lower in memory. But it seems the GNU linker scripts do not allow one to position sections "backwards" in memory — e.g. I cannot easily say "allocate a .bss.startup output section in such a way that the section ends at offset 0xff00 or thereabouts".

Thank you!

tkchia avatar Oct 02 '21 18:10 tkchia

Hello @andrewbird and @tkchia,

(I had considered treating the heap as part of the resident portion, but this is tricky to get right in the general case.)

May I ask why is this tricky, if the heap is of fixed size? A fixed heap directly after the resident portion, then followed by the transient .text.start/.data.start sections should solve this problem and eliminate the requirement for a custom linker script, right?

Let me try to come up with a good way for the C runtime to report the end of the resident portion.

Wouldn't malloc(1), after all other allocations are completed, return the end of the resident data section, which, along with some manipulation with DS and CS values, allow the PSP offset to be known?

Thank you!

ghaerr avatar Oct 02 '21 18:10 ghaerr

Hello @ghaerr, hello @andrewbird,

I am experimenting with some changes to gcc-ia16 and newlib-ia16 that will allow the heap to appear before the transient code and data, but in a more rigorous way (https://github.com/tkchia/newlib-ia16/commit/5c10cbfe9baffadeb31a6a81f09de4f40067b54d etc.). Thank you!

tkchia avatar Oct 02 '21 21:10 tkchia

But it seems the GNU linker scripts do not allow one to position sections "backwards" in memory — e.g. I cannot easily say "allocate a .bss.startup output section in such a way that the section ends at offset 0xff00 or thereabouts".

I'm not sure you need to worry about that as such since it would be NOLOAD, hence not allocated, and the AT final address is computable.

So wouldn't this work, it's similar to my gcc_help.ld but extended to handle the extra .data.startup and .bss.startup?

section .text {
...
}
section .data {
...
edata - .
}

section .bss NOLOAD {
/*  NOLOAD means it can overlap with text.startup and maybe .data.startup */
}

/* this is where heap ends up, its location and size are not important */

section .text.startup AT (0xff00 - sizeof(.bss.startup) - sizeof(.data.startup) - sizeof(.text.startup)) {
/* load address starts at .edata */
...
}

section .data.startup AT (0xff00 - sizeof(.bss.startup) - sizeof(.data.startup)) {
...
}

section .bss.startup (NOLOAD) AT (0xff00 - sizeof(.bss.startup)) {
/* again it's not allocated, only cleared */
}

section .stack.startup (NOLOAD) AT (0xff00) {
/* again it's not allocated, and only shown here for reference 0xff00 -> 0xffff */
}

Or did I miss something obvious?

andrewbird avatar Oct 02 '21 22:10 andrewbird

I just tried this and it seems to give me what I expect (although SIZEOF didn't seem to work, end of section - start of section did)

    .text.startup (0xfd00 - __xx_bss_startup_length - __xx_data_startup_length - __xx_text_startup_length) :
                        AT (__sbss_keep) {                                      
...
}
__xx_text_startup_length = (__etext_startup - __stext_startup);

    .data.startup (0xfd00 - __xx_bss_startup_length - __xx_data_startup_length) :
                        AT (__sbss_keep + SIZEOF (.text.startup)) {             
...
}
__xx_data_startup_length = (__edata_startup - __sdata_startup);

    .bss.startup (0xfd00 - __xx_bss_startup_length) (NOLOAD) : {                
...
}
__xx_bss_startup_length = (__ebss_startup - __sbss_startup);

And the .MAP file looked like this

.bss            0x00000000000011c4       0xb8                                   
                0x00000000000011c4                __sbss_keep = .               

.text.startup   0x000000000000f88a      0x461 load address 0x00000000000011c4   
                0x00000000000011c4                __stext_startup_load = LOADADDR (.text.startup)

.data.startup   0x000000000000fceb        0xd load address 0x0000000000001625   
                0x0000000000001625                __sdata_startup_load = LOADADDR (.data.startup)

.bss.startup    0x000000000000fcf7        0x9                                   
                0x000000000000fcf7                __sbss_startup = .            

Thank you!

andrewbird avatar Oct 03 '21 13:10 andrewbird

Hello @andrewbird,

This is very cool, thanks! I will update the Newlib package in the PPA to use this.

tkchia avatar Oct 03 '21 14:10 tkchia