QNICE-FPGA
QNICE-FPGA copied to clipboard
Feature request: Add support for software floating point in the C-compiler
As mentioned in our last meeting, adding software emulated floating point is a first step before implementing hardware support. That way we can gauge the speed of the floating point calculations and better evaluate the need for hardware support.
We talked about different floating point formats, and I have here yet another suggestion. The idea is to choose a format that gives reasonable accuracy and avoids unnecessary bit shifts etc. So here goes:
Proposal for floating point format
- Each floating point number uses 3 words (i.e. 48 bits).
- One word (i.e. 16 bits) for the exponent (offset by 0x8000).
- Two words (i.e. 32 bits) for the mantissa (in sign magnitude format; bit 31 is sign and replaces the msb of the normalized mantissa).
Examples
| Floating point value | Exponent (real) | Mantissa (real) | Exponent (binary) | Mantissa (binary) |
|---|---|---|---|---|
| 1.0 | 0 | 1.0 | 0x8000 | 0x0000 |
| 2.0 | 1 | 1.0 | 0x8001 | 0x0000 |
| 3.0 | 1 | 1.5 | 0x8001 | 0x4000 |
| -1.0 | 0 | -1.0 | 0x8000 | 0x8000 |
| -2.0 | 1 | -1.0 | 0x8001 | 0x8000 |
| -3.0 | 1 | -1.5 | 0x8001 | 0xC000 |
In other words:
1.0 <= |Mantissa (real)| < 2.0.Mantissa (binary) = (Mantissa (real)-1) * 0x8000for positive numbers.Mantissa (binary) = |Mantissa (real)| * 0x8000for negative numbers.
The value 0 is represented by setting the exponent = 0x0000.
Having 16 bits for the exponent is certainly a luxury, but this avoids some bit shifting.
What do you think? Is this too much? Should we prefer a 32-bit floating point number, where 8 bits are the exponent and 24 bits the mantissa ?
Hi Michael - thank you so much for the issue and your proposal! Please excuse me for not having opened the issue myself as promised, I was so buried in (stupid) work during the last couple of days that it always slipped to the next day... 👍
I like your idea of a proprietary format a lot and spending 3 words for one FP number might speed up things considerably.
Having 32 bits for the mantissa is plenty so that we might want to ignore the standard hidden bit feature, i.e. we could store just the MSB = 1 instead of treating it as implicitly set. This would further simplify things as we would not have to reserve an exponent value like 0 to denote the absolute value zero of the FP number.
We also could just say that the mantissa as well as the exponent are two's complement numbers which would further simplify the software implementation.
Of course, this is at odds with nearly every FP implementation but the only problem we would have with something like that is that we have to change the C compiler in order to convert FP constants into our format.
What do you think?
I think we should go for a format that makes implementation easy (and fast). With 3 words, we have plenty of bits to use, so storing the mantissa in full 32-bit two's complement seems like a good idea.
I would assume changing the C-compilers handling of FP constants is a small task.
Sounds great, gentlemen :-)
About the compiler's floating point constant representation
For your convenience, I did the following experiment. I wrote this C program here:
#include <stdio.h>
int main()
{
float f = 3.1415;
printf("f = %f\n", f);
return 0;
}
When we are done with our implementation, we would expect this output:
f = 3.141500
Here is the assembler code that VBCC generates. For you gentlemen to investigate the FP constant handling:
.text
.global _main
_main:
incrb
sub 6,R13
move 0x0e56,R2
move 0x4049,R3
move R13,R8
add 2,R8
move R3,@--R13
move R2,@--R13
asub #___flt32toflt64,1
move R13,R11
add 8,R11
move @--R11,@--R13
move @--R11,@--R13
move @--R11,@--R13
move @--R11,@--R13
move #l3,R8
asub #_printf,1
move R8,R0
xor R8,R8
add 6,R13
l1:
add 6,R13
decrb
move @R13++,R15
.type _main,@function
.size _main,$-_main
.type l3,@object
.size l3,16
.text
l3:
.short 102
.short 32
.short 61
.short 32
.short 37
.short 102
.short 10
.short 0
.global ___flt32toflt64
.global _printf
Obviously, our 3.1415 float is represented like this:
move 0x0e56,R2
move 0x4049,R3
And it also looks like, the compiler transforms floats to 64-bit before passing them to printf:
asub #___flt32toflt64,1
Attached the .c and the .asm file: simple-float.zip
You compile it like this: qvc tmp.c -k
And due to not having yet implemented the floating point C library functions, this is the output:
tmp.o: In function "_main":
Error 21: tmp.o (.text+0x1a): Reference to undefined symbol ___flt32toflt64.
Dear Mirko, dear Michael - please excuse the long delay. I just wanted to let both of you know how deeply impressed I am with what you did already in the last couple of days. And I would like to excuse me for not having done anything towards a software FP implementation. I am currently buried in work and meetings which will go on for the remainder of this week, so that I am at the moment unable to do something useful for QNICE. ;-(