universal Stateless quire function support

In order to implement an API which can be fulfilled by a stateless FPU yet is able to efficiently support quire operations, I need the following operations:

posit[2] posit_add_exact(posit a, posit b):
- the arguments are 2 posits a and b of the same parameters
- the return value is a pair of posits, the first one is the nearest value to the actual sum and the second result is the difference between the first value and the exact result.
- if the exponents of a and b are such that there is no bit-overlap in the mantissas, this function returns max(a,b), min(a,b)
posit[2] posit_sub_exact(posit a, posit b): same as add_exact with b negated
posit<nbits*2,es+1> posit_mul_promote(posit a, posit b):
- the arguments are 2 posits a and b of the same parameters
- the result is a posit with nbits twice that of the arguments and an es one more than that of the arguments
- this function is equivalent to converting a and b to the larger size and then multiplying.
- this function should never round (if it does then I've made a mistake)
posit<nbits*2,es+1> posit_div_promote(posit a, posit b):
- Result should be the same as posit_div( posit<nbits2,es+1>(a), posit<nbits2,es+1>(b) )
posit posit_frexp(posit a, int* exp_out):
- Defined in https://en.cppreference.com/w/cpp/numeric/math/frexp
posit posit_ldexp(posit a, int exp):
- Defined here https://en.cppreference.com/w/cpp/numeric/math/ldexp

Mar 31 '19 14:03 cjdelisle

A couple of comments about posit_add_exact:

If adding a + b can be represented perfectly in 1 posit, the result will be [sum, 0]
The first output of this function will always be further from zero than the second output (negative or positive) such that one can take an array of random posits and sort them using:

int comparator(positN_t *a, positN_t *b) {
    if (*a == 0 && *b == 0) { return 0; }
    if (*a == 0) { return -1; }
    if (*b == 0) { return 1; }
    positN_t res[2] = posit_add_exact(*a, *b);
    *a = res[0]; *b = res[1];
    return 1;
}
...
qsort(array, length, sizeof(positN_t), comparator)

and after that, the sum of the array will be the same as it was before but the array will be reduced to the smallest number of non-zero posits that still sum to the same number, they will be ordered from nearest to zero to farthest from zero.

Mar 31 '19 14:03 cjdelisle

Is there a particular underlying dynamic that forces the mul_promote to have to be recorded in a posit that has a different es value? The quires for a posit with es and es+1 are not compatible in the sense that their radix points are different.

In the C++ code, we keep the posit config of the add and multiply for quire accumulation the same as the posit arguments, but simply carry all the extra adder/multiplier bits.

Mar 31 '19 14:03 Ravenwater

The first reason is because I expect hardware to settle on posit<8,0>, posit<16,1>, posit<32,2> and posit<64,3> so if you convert a posit<8,0> to a posit<16,0> that's not super useful because none of the hardware instructions will support it. With this rule, you can do posit<32,2> multiplication and posit<64,3> add_exact to implement dot product with stateless hardware and no software manipulation of posits.

Mar 31 '19 15:03 cjdelisle

That is indeed a very strong argument.

Research questions to be answered:

1- are all the samples of posit<nbits,es> contained inside posit<nbits*2,es+1>? 2- are all the samples of unrounded add and mul in <nbits,es> contained inside <nbits*2,es+1>?

Mar 31 '19 17:03 Ravenwater

Good questions, I always consider that multiplication doubles the required bits. My reasoning about why this should work is:

max mantissa size is nbits - 3 - es + 1 (hidden bit)
posit<8,0> -> 8-3-0+1 -> 6
posit<16,1> -> 16-3-1+1 -> 13
posit<32,2> -> 32-3-2+1 -> 28
posit<64,3> -> 64-3-3+1 -> 59
posit<128,4> -> 128-3-4+1 -> 122

If the regime size is larger than 2 bits, it should be simple to reason that the resulting regime size should be smaller than the sum of regime sizes (is it 1/2 the sum?) so regimes larger than 2 should yield a larger number of available bits when multiplying.

For addition we should not care because add_exact() outputs 2 posits of the same size rather than doing a promotion.

disclaimer: I'm a programmer, not a mathematician, if this reasoning looks hacky it's because it is...

Mar 31 '19 18:03 cjdelisle

Added arbitrary posit conversions to get a feel for the promotion. I was worried about the geometric region, but all is good. conversion-comparison

Apr 06 '19 21:04 Ravenwater

with respect to add_exact/sub_exact. The regions between [0, minpos] and [maxpos, NaR] do not have a representation for r.

Posit TwoSum validation  for posit<8,1>
geometric rounding region
a                      : 00000001 : +0.000244140625
b                      : 00000001 : +0.000244140625
s                      : 00000010 : +0.0009765625
aApprox = s - a        : 00000010 : +0.0009765625
bApprox = s - aApprox  : 00000000 : +0
aDiff = a - aApprox    : 11111110 : -0.0009765625
bDiff = b - bApprox    : 00000001 : +0.000244140625
r = aDiff + bDiff      : 11111110 : -0.0009765625
s + r                  : 00000000 : +0
a + b                  : 00000010 : +0.0009765625
FAIL
a                      : 00000001 : +0.000244140625
b                      : 00000010 : +0.0009765625
s                      : 00000010 : +0.0009765625
aApprox = s - a        : 00000000 : +0
bApprox = s - aApprox  : 00000010 : +0.0009765625
aDiff = a - aApprox    : 00000001 : +0.000244140625
bDiff = b - bApprox    : 00000000 : +0
r = aDiff + bDiff      : 00000001 : +0.000244140625
s + r                  : 00000010 : +0.0009765625
a + b                  : 00000010 : +0.0009765625
PASS

this implies that the operators add_exact/sub_exact need to special case that case.

@cjdelisle Is that possible for your code?

Apr 07 '19 13:04 Ravenwater

Can you push that in a branch so I can see what I'm looking at in the code ? I'm trying to make out just where the problems are...

Apr 07 '19 14:04 cjdelisle

I am experimenting in the https://github.com/stillwater-sc/hpr-blas repo as we doing a lot of math experiments for application requirements, so I already had Knuth's twoSum algorithm implemented there.

The test is numerical_two_sum.

Apr 07 '19 14:04 Ravenwater

When you have a posit with es > 0, you end up having three distinct 'regions' of discretization:

the regime, which are blocks of 2^2^es size
the exponent, which sample the regime in 2^exp steps
the fraction, which sample the exponent region linearly

When you are adding values that operate purely with regime encodings (no exponent or fraction bits), i.e. minpos/maxpos, then you don't have enough bits to express the difference between exact and the rounded result, as they would be expressed in terms of exponent bits, which are lopped off in that extreme operating point of the posit arithmetic.

Apr 07 '19 15:04 Ravenwater

Ahh I see, but I think it should be ok to declare that maxpos and minpos remain the maximum and minimum values, even when using add_exact(). I'm still working on trying to replicate your results with a little test...

Apr 07 '19 15:04 cjdelisle

Thinking further, it strikes me that in no case should add_exact fail, if it can do nothing else, it can just return its arguments, right ?

Apr 07 '19 15:04 cjdelisle

I wrote a bruteforce implementation of posit8_add_exact() as a reference:

static posit8x2_t posit8_add_exact_bruteforce(posit8_t a, posit8_t b) {
    // searching for [x,y] where x+y = a+b and y is the smallest
    // obvious first step is swap a,b so that b is always <= a
    if (posit8_cmp(a,b) < 0) {
        posit8_t _a = a;
        a = b;
        b = _a;
    }
    double realsum = posit8_tod(a) + posit8_tod(b);
    posit8x2_t best;
    best.x = a;
    best.y = b;
    // nan special case
    if (isnan(realsum)) {
        // return a NAR and a ZERO so that someone running add_exact
        // as a mutative sorting comparitor and discarding zeros
        // will end with 1 NAR only.
        best.x = NAR8;
        best.y = ZERO8;
        return best;
    }
    // search for a smaller value of y which satisfies x+y == a+b
    for (int i = 0; i <= 0xffff; i++) {
        posit8_t y = posit8_reinterpret(i >> 8);
        posit8_t x = posit8_reinterpret(i & 0xff);
        double sum = posit8_tod(x) + posit8_tod(y);
        if (sum != realsum) { continue; }
        if (posit8_cmp(y, best.y) < 0) {
            best.x = x;
            best.y = y;
        }
    }
    return best;
}

I'm not exactly sure how to improve it but I think it's pretty obvious that it should never fail to return to values which sum to the same as the inputs...

Apr 07 '19 16:04 cjdelisle

I fixed a few issues with this function and improved the performance a little bit, I think this is correct though I'm really never that sure...

static posit8_t posit8_abs(posit8_t x) {
    if (x.v & 0x80) { return posit8_sub(ZERO8, x); }
    return x;
}

static posit8x2_t posit8x2(posit8_t x, posit8_t y) {
    posit8x2_t out;
    out.x = x;
    out.y = y;
    return out;
}

static posit8x2_t posit8_add_exact_bruteforce(posit8_t a, posit8_t b) {
    // searching for [x,y] where x+y = a+b and y is the smallest
    // obvious first step is swap a,b so that b is always <= a
    if (posit8_cmp(posit8_abs(a), posit8_abs(b)) < 0) {
        //printf("abs a < b  %f %f\n", posit8_tod(a), posit8_tod(b));
        return posit8_add_exact_bruteforce(b, a);
    }
    double realsum = posit8_tod(a) + posit8_tod(b);
    // nan special case
    if (isnan(realsum)) {
        // return a NAR and a ZERO so that someone running add_exact
        // as a mutative sorting comparitor and discarding zeros
        // will end with 1 NAR only.
        return posit8x2(NAR8, ZERO8);
    }
    // search for a smaller value of y which satisfies x+y == a+b
    // start scanning at 0 and work up switching between positive and negative
    // to find the value of smallest magnitude which can satisfy x+y == a+b
    uint8_t end = posit8_bits(b) & 0x7f;
    for (int i = 0; i < end; i++) {
        posit8_t py = posit8_reinterpret(i);
        posit8_t ny = posit8_sub(ZERO8, py);
        double dpy = posit8_tod(py);
        double dny = posit8_tod(ny);

        for (int j = 0; j < (1<<8); j++) {
            posit8_t x = posit8_reinterpret(j);
            double dx = posit8_tod(x);
            if ((dx + dpy) == realsum) { return posit8x2(x, py); }
            if ((dx + dny) == realsum) { return posit8x2(x, ny); }
        }
    }
    printf("No better solution was found for %f %f\n", posit8_tod(a), posit8_tod(b));
    return posit8x2(a, b);
}

For "smaller" we really want smaller magnitude because the point is that you can use this as a mutative sorting comparator and it will collapse a list of numbers down to the shortest set which represents the same sum.

Apr 07 '19 17:04 cjdelisle

so for adding minpos + minpos, for posit configurations that have exponent bits, the sum is minpos and you would not have a value that is smaller than minpos. In that situation, which I can recognize mathematically, if I return minpos and minpos, will your mutative sorting still work?

Also, I am not familar with mutative sorting, can you provide me a link you think is good to get educated?

Apr 07 '19 18:04 Ravenwater

If I return minpos and minpos, will your mutative sorting still work? <-- yes

"mutative sorting comparitor" is just a term I invented now to describe a form of addition which uses a sorting like approach. So in general this is a big no-no but what you can potentially do (depending on the sorting function) is to actually change the inputs inside of the comparitor, e.g.

static int compare(const void* a, const void* b) {
    posit8x2_t out = posit8_add_exact( ((posit8_t*)a)[0], ((posit8_t*)b)[0]);
    ((posit8_t*)a)[0] = out.x;
    ((posit8_t*)b)[0] = out.y;
    // we will never return -1 because we already effectively did the swap
    return (out.x == out.y) ? 0 : 1;
}
...later on
qsort(list, list_len, sizeof(posit8_t), compare);

Again, this is generally a Bad Idea to just rely on the implementation of quicksort not to freak out when we modify it's inputs from under it, but AFAIK in typical single-threaded sorting algorithms, this will actually work, and if you want something more robust, you can copy a well known sorting algorithm and then use it to build an efficient adder. The result after "sorting" an array of posits should be a bunch of zeros followed by the shortest number of posits which still sum to the same value.

Apr 07 '19 18:04 cjdelisle

An additional feature of this pattern is if you have a CUDA sort function, you can pretty quickly build a CUDA adder from it...

Apr 07 '19 18:04 cjdelisle

Would you be able to give me a sorting algorithm that you want to work with this add_exact, so that I can use it as a target for the development and testing of these two functions: add_exact, sub_exact.

Apr 07 '19 18:04 Ravenwater

I should be able to put together a basic test, and I expect that (ab)using libc qsort will work on our systems....

Apr 07 '19 18:04 cjdelisle

WDYT of this as a way of implementing it:

static posit8x2_t posit8_add_exact1(posit8_t a, posit8_t b) {
    // if either is nar, return (nar,0)
    if (posit8_isnar(a) | posit8_isnar(b)) { return posit8x2(NAR8, ZERO8); }

    // b should always be less than or equal to a, in magnitude
    if (posit8_cmp(posit8_abs(a), posit8_abs(b)) < 0) {
        posit8_t _a = a;
        a = b;
        b = _a;
    }

    // if b is zero then return a,0
    if (!posit8_cmp(b, ZERO8)) { return posit8x2(a, ZERO8); }

    double dsum = posit8_tod(a) + posit8_tod(b);
    posit8_t sum = posit8_fromd(dsum);
    double rdsum = posit8_tod(sum);
    double drem = dsum - rdsum;
    posit8_t rem = posit8_fromd(drem);
    double rdsum2 = posit8_tod(rem);
    double remrem = drem - rdsum2;
    if (remrem != 0) {
        printf("inexact\n");
        abort();
    }
    return posit8x2(sum, rem);
}

Apr 07 '19 21:04 cjdelisle

With that implementation of add_exact I get a nicer add time:

[100%] Linking CXX executable c_api_exact_test
[100%] Built target c_api_exact_test
750.265625 == 750.265625, 1024 posits compressed to 15 in 10 cycles 	485 micros
515.859375 == 515.859375, 1024 posits compressed to 13 in 10 cycles 	547 micros
108.203125 == 108.203125, 1024 posits compressed to 4 in 9 cycles 	520 micros
928.515625 == 928.515625, 1024 posits compressed to 19 in 11 cycles 	518 micros
310.234375 == 310.234375, 1024 posits compressed to 9 in 9 cycles 	490 micros
353.000000 == 353.000000, 1024 posits compressed to 7 in 10 cycles 	498 micros
444.843750 == 444.843750, 1024 posits compressed to 11 in 10 cycles 	506 micros
753.218750 == 753.218750, 1024 posits compressed to 15 in 9 cycles 	496 micros
445.218750 == 445.218750, 1024 posits compressed to 11 in 10 cycles 	467 micros
370.796875 == 370.796875, 1024 posits compressed to 8 in 9 cycles 	469 micros
610.593750 == 610.593750, 1024 posits compressed to 13 in 10 cycles 	487 micros
-33.093750 == -33.093750, 1024 posits compressed to 5 in 11 cycles 	507 micros
485.187500 == 485.187500, 1024 posits compressed to 11 in 11 cycles 	488 micros
674.203125 == 674.203125, 1024 posits compressed to 13 in 10 cycles 	569 micros
785.718750 == 785.718750, 1024 posits compressed to 17 in 10 cycles 	542 micros
518.453125 == 518.453125, 1024 posits compressed to 11 in 10 cycles 	539 micros
500.578125 == 500.578125, 1024 posits compressed to 12 in 10 cycles 	528 micros
837.859375 == 837.859375, 1024 posits compressed to 17 in 9 cycles 	564 micros
154.781250 == 154.781250, 1024 posits compressed to 5 in 10 cycles 	865 micros
256.687500 == 256.687500, 1024 posits compressed to 5 in 11 cycles 	783 micros
392.875000 == 392.875000, 1024 posits compressed to 9 in 10 cycles 	849 micros
390.078125 == 390.078125, 1024 posits compressed to 9 in 10 cycles 	862 micros
335.250000 == 335.250000, 1024 posits compressed to 7 in 9 cycles 	743 micros
351.984375 == 351.984375, 1024 posits compressed to 7 in 10 cycles 	703 micros
587.328125 == 587.328125, 1024 posits compressed to 12 in 10 cycles 	726 micros
363.421875 == 363.421875, 1024 posits compressed to 9 in 10 cycles 	751 micros
891.812500 == 891.812500, 1024 posits compressed to 17 in 10 cycles 	754 micros
570.468750 == 570.468750, 1024 posits compressed to 13 in 9 cycles 	719 micros
469.593750 == 469.593750, 1024 posits compressed to 11 in 9 cycles 	728 micros
488.250000 == 488.250000, 1024 posits compressed to 11 in 9 cycles 	796 micros
719.984375 == 719.984375, 1024 posits compressed to 15 in 10 cycles 	787 micros
-155.312500 == -155.312500, 1024 posits compressed to 5 in 9 cycles 	783 micros
Average time = 627
notgay:build user$

Apr 07 '19 22:04 cjdelisle

I've implemented a more generalized add_exact() function, here it converts to posit16_t rather than to double in order to do the addition and subtraction. The idea is that if we can exhaustively verify posit8 and posit16, then we can hope to believe that posit32 and posit64 are ok.

https://github.com/cjdelisle/universal/commit/8549c133a83cb3386d8d97601314ede3cd5be823#diff-15dd6e28f0963f9915bfc3d74ed3a901R74

I also fixed a bug in add_exact_bruteforce() caused by my seemingly chronic incorrect assumption that I can flip the sign just by flipping a bit...

Apr 08 '19 07:04 cjdelisle