QNICE-FPGA icon indicating copy to clipboard operation
QNICE-FPGA copied to clipboard

Feature request: Add support for software floating point in the C-compiler

Open MJoergen opened this issue 5 years ago • 5 comments

As mentioned in our last meeting, adding software emulated floating point is a first step before implementing hardware support. That way we can gauge the speed of the floating point calculations and better evaluate the need for hardware support.

We talked about different floating point formats, and I have here yet another suggestion. The idea is to choose a format that gives reasonable accuracy and avoids unnecessary bit shifts etc. So here goes:

Proposal for floating point format

  • Each floating point number uses 3 words (i.e. 48 bits).
  • One word (i.e. 16 bits) for the exponent (offset by 0x8000).
  • Two words (i.e. 32 bits) for the mantissa (in sign magnitude format; bit 31 is sign and replaces the msb of the normalized mantissa).

Examples

Floating point value Exponent (real) Mantissa (real) Exponent (binary) Mantissa (binary)
1.0 0 1.0 0x8000 0x0000
2.0 1 1.0 0x8001 0x0000
3.0 1 1.5 0x8001 0x4000
-1.0 0 -1.0 0x8000 0x8000
-2.0 1 -1.0 0x8001 0x8000
-3.0 1 -1.5 0x8001 0xC000

In other words:

  • 1.0 <= |Mantissa (real)| < 2.0.
  • Mantissa (binary) = (Mantissa (real)-1) * 0x8000 for positive numbers.
  • Mantissa (binary) = |Mantissa (real)| * 0x8000 for negative numbers.

The value 0 is represented by setting the exponent = 0x0000.

Having 16 bits for the exponent is certainly a luxury, but this avoids some bit shifting.

What do you think? Is this too much? Should we prefer a 32-bit floating point number, where 8 bits are the exponent and 24 bits the mantissa ?

MJoergen avatar Oct 02 '20 06:10 MJoergen

Hi Michael - thank you so much for the issue and your proposal! Please excuse me for not having opened the issue myself as promised, I was so buried in (stupid) work during the last couple of days that it always slipped to the next day... 👍

I like your idea of a proprietary format a lot and spending 3 words for one FP number might speed up things considerably.

Having 32 bits for the mantissa is plenty so that we might want to ignore the standard hidden bit feature, i.e. we could store just the MSB = 1 instead of treating it as implicitly set. This would further simplify things as we would not have to reserve an exponent value like 0 to denote the absolute value zero of the FP number.

We also could just say that the mantissa as well as the exponent are two's complement numbers which would further simplify the software implementation.

Of course, this is at odds with nearly every FP implementation but the only problem we would have with something like that is that we have to change the C compiler in order to convert FP constants into our format.

What do you think?

bernd-ulmann avatar Oct 02 '20 09:10 bernd-ulmann

I think we should go for a format that makes implementation easy (and fast). With 3 words, we have plenty of bits to use, so storing the mantissa in full 32-bit two's complement seems like a good idea.

I would assume changing the C-compilers handling of FP constants is a small task.

MJoergen avatar Oct 02 '20 09:10 MJoergen

Sounds great, gentlemen :-)

sy2002 avatar Oct 02 '20 16:10 sy2002

About the compiler's floating point constant representation

For your convenience, I did the following experiment. I wrote this C program here:

#include <stdio.h>

int main()
{
    float f = 3.1415;
    printf("f = %f\n", f);
    return 0;
}

When we are done with our implementation, we would expect this output:

f = 3.141500

Here is the assembler code that VBCC generates. For you gentlemen to investigate the FP constant handling:

	.text
	.global	_main
_main:
	incrb
	sub	6,R13
	move	0x0e56,R2
	move	0x4049,R3
	move	R13,R8
	add	2,R8
	move	R3,@--R13
	move	R2,@--R13
	asub	#___flt32toflt64,1
	move	R13,R11
	add	8,R11
	move	@--R11,@--R13
	move	@--R11,@--R13
	move	@--R11,@--R13
	move	@--R11,@--R13
	move	#l3,R8
	asub	#_printf,1
	move	R8,R0
	xor	R8,R8
	add	6,R13
l1:
	add	6,R13
	decrb
	move	@R13++,R15
	.type	_main,@function
	.size	_main,$-_main
	.type	l3,@object
	.size	l3,16
	.text
l3:
	.short	102
	.short	32
	.short	61
	.short	32
	.short	37
	.short	102
	.short	10
	.short	0
	.global	___flt32toflt64
	.global	_printf

Obviously, our 3.1415 float is represented like this:

	move	0x0e56,R2
	move	0x4049,R3

And it also looks like, the compiler transforms floats to 64-bit before passing them to printf:

	asub	#___flt32toflt64,1

Attached the .c and the .asm file: simple-float.zip

You compile it like this: qvc tmp.c -k

And due to not having yet implemented the floating point C library functions, this is the output:

tmp.o: In function "_main":
Error 21: tmp.o (.text+0x1a): Reference to undefined symbol ___flt32toflt64.

sy2002 avatar Oct 03 '20 14:10 sy2002

Dear Mirko, dear Michael - please excuse the long delay. I just wanted to let both of you know how deeply impressed I am with what you did already in the last couple of days. And I would like to excuse me for not having done anything towards a software FP implementation. I am currently buried in work and meetings which will go on for the remainder of this week, so that I am at the moment unable to do something useful for QNICE. ;-(

bernd-ulmann avatar Oct 05 '20 08:10 bernd-ulmann