n bit integer values

Open AaronGoldman opened this issue 4 years ago • 1 comments

Proposal int0 = -1 and int1 = {-1, 0}

As we encode integers in different formats it makes sence to have types that can restrict the domain to a fixed number of bits These subsets of integers can be referred to as the n bit integers and come in signed and unsigned varieties. Some examples of integers sizes that may make sence are the powers of 2, the vint sizes, and the utf-8 sizes.

raw
| unsigned integer                  | signed integer                                    |
| name | min |                  max | name |                  min |                 max |
| u0   |   0 |                    0 | i0   |                 -1.0 |                -1.0 |
| u1   |   0 |                    1 | i1   |                   -1 |                   0 |
| u2   |   0 |                    3 | i2   |                   -2 |                   1 |
| u4   |   0 |                   15 | i4   |                   -8 |                   7 |
| u8   |   0 |                  255 | i8   |                 -128 |                 127 |
| u16  |   0 |                65535 | i16  |               -32768 |               32767 |
| u32  |   0 |           4294967295 | i32  |          -2147483648 |          2147483647 |
| u64  |   0 | 18446744073709551615 | i64  | -9223372036854775808 | 9223372036854775807 |

vint
| unsigned integer                  | signed integer                                    |
| name | min |                  max | name |                  min |                 max |
| u7   |   0 |                  127 | i7   |                  -64 |                  63 |
| u14  |   0 |                16383 | i14  |                -8192 |                8191 |
| u21  |   0 |              2097151 | i21  |             -1048576 |             1048575 |
| u28  |   0 |            268435455 | i28  |           -134217728 |           134217727 |
| u35  |   0 |          34359738367 | i35  |         -17179869184 |         17179869183 |
| u42  |   0 |        4398046511103 | i42  |       -2199023255552 |       2199023255551 |
| u49  |   0 |      562949953421311 | i49  |     -281474976710656 |     281474976710655 |
| u56  |   0 |    72057594037927935 | i56  |   -36028797018963968 |   36028797018963967 |
| u64  |   0 | 18446744073709551615 | i64  | -9223372036854775808 | 9223372036854775807 |

utf8
| unsigned integer                  | signed integer                                    |
| name | min |                  max | name |                  min |                 max |
| u7   |   0 |                  127 | i7   |                  -64 |                  63 |
| u11  |   0 |                 2047 | i11  |                -1024 |                1023 |
| u16  |   0 |                65535 | i16  |               -32768 |               32767 |
| u21  |   0 |              2097151 | i21  |             -1048576 |             1048575 |
| u26  |   0 |             67108863 | i26  |            -33554432 |            33554431 |
| u32  |   0 |           4294967295 | i32  |          -2147483648 |          2147483647 |

The table of n bit integers can be derived formulaically. for unsigned ints the min is 0 and the max is (2 ^ n) - 1 we steal one of the positive numbers to have a representation for 0 leaving us with the range [0, (2 ^ n) - 1]

for the signed integers we first split the numbers in half leaving just 2 ^ (n-1) numbers for each sign. once again we steal one of the positive numbers for the zero leaving us with the range [-(2 ^ (n - 1)), (2 ^ (n - 1)) - 1]

where this gets interesting is the case of i1 witch can only 2 values once we divide the space in two there is only a single positive value and we steal it to make the 0 leaving us with the range [-1, 0] even weirder is the case of the stateless zero bit integers for u0 it is straightforward 2^0-1 = 0 so the only value is 0 and the range is [0, 0] for the i0 the naive formula would render the i0max as 1/2 and the min at -1/2 theas are not integers and not equal thus not acceptable answers for the value of i0 so we extended the formula with a integer deviation by one just to force range back into the integers giving us the range of [-1, -1] witch is a single value needing no memory and is a signed integer thuse a acceptable value for i0. As a benofit it fits with the intuition of splitting the range between the positive and negative numbers then failing to steal the zero as there is no positive value to take.

#practical concerns

The fact that the. formula is not in and of itself a compelling reson for i1 to be [-1, 0] but in the hardwere we used the fact that the first bit of a signed 2s complement number is set to test that the number is negative. By violating this property for i1 we risk making it a special case that not only the language designer but also the users must deal with whenever that have a signed number that they wish to test the sign of. The case is less compelling for i0 being -1 as it has no first bit as it is a no memory type and would be used as part of constant propagation. That said I still think it might make generics a little cleaner to have u0 and i0 not be equal as a signed and unsigned value are only equal when the bit patterns match and and the first bit is not set on either number. As there is no first bit the first bits can't be 0 and the number cannot be equal.

#union types one nisity we get from i0 = -1 = True and u0 = 0 = False is that i1 is i0 ∪ u0 In general if we have a iN and a uN we need to go up a bit to represent is to i(N+1) this symatery is only preserved if i0 and u0 are different values.

This way all signed ints. and unsigned int can

i0 = -1 = True u0 = 0 = False i1 = {-1, 0} = bool

Python code for the table

iname = lambda n: 'i' + str(n)
imin  = lambda n: -(2**(n - 1)) // 1
imax  = lambda n: (2**(n - 1) - 1) // 1
uname = lambda n: 'u' + str(n)
umin  = lambda n: 0
umax  = lambda n: (2**n) - 1


line  = lambda n: '| {:4} | {:3} | {:20} | {:4} | {:20} | {:19} |'.format(
    uname(n), umin(n), umax(n), iname(n), imin(n), imax(n))
header = """| unsigned integer                  | signed integer                                    |
| name | min |                  max | name |                  min |                 max |
"""

print('raw')
# 0 byte  i0 
# 1 byte  i1  0000 000x
# 1 byte  i2  0000 00xx
# 1 byte  i4  0000 xxxx
# 1 byte  i8  xxxx xxxx
# 2 byte  i16 xxxx xxxx  xxxx xxxx
# 4 byte  i32 xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx
# 8 byte  i64 xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx
body = '\n'.join(line(n) for n in [0, 1, 2, 4, 8, 16, 32, 64])
print(header + body)
print()

print('vint')
#               Lo        Hi      LoHi
# print(2**16 - 2 ** 10 - 2**10 + 2**20) # 1,112,064 max Unicode code point
# 3 bytes of vint for any Unicode code point compaired to 4 byte utf-8
# 1 byte  i7  0xxx xxxx
# 2 byte  i14 10xx xxxx  xxxx xxxx
# 3 byte  i21 110x xxxx  xxxx xxxx  xxxx xxxx
# 4 byte  i28 1110 xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx
# 5 byte  i35 1111 0xxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx
# 6 byte  i42 1111 10xx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx
# 7 byte  i49 1111 110x  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx
# 8 byte  i56 1111 1110  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx
# 9 byte  i64 1111 1111  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx
body = '\n'.join(line(n) for n in [7, 14, 21, 28, 35, 42, 49, 56, 64])
print(header + body)
print()

print('utf8')
# Unicode code space [0,1114111] or [0x000000, 0x10FFFF] < 21 bits
#  1 byte i7  0xxx xxxx
#  2 byte i11 110x xxxx  10xx xxxx
#  3 byte i16 1110 xxxx  10xx xxxx  10xx xxxx
#  4 byte i21 1111 0xxx  10xx xxxx  10xx xxxx  10xx xxxx  # Unicode 
#  5 byte i26 1111 10xx  10xx xxxx  10xx xxxx  10xx xxxx  10xx xxxx
#  6 byte i32 1111 11xx  10xx xxxx  10xx xxxx  10xx xxxx  10xx xxxx  10xx xxxx
body = '\n'.join(line(n) for n in [7, 11, 16, 21, 26, 32])
print(header + body)

Apr 30 '21 23:04 AaronGoldman

Heh, nice PR. Will look at it when the H-M stuff dies down, but honestly seems like the Right Thing To Do.

So consider this as an "approved" AA language change, just not acted-on yet.

Cliff

On 4/30/2021 4:51 PM, Aaron Goldman wrote:

Proposal |int0 = -1| and |int1 = {-1, 0}|

As we encode integers in different formats it makes sence to have types that can restrict the domain to a fixed number of bits These subsets of integers can be referred to as the n bit integers and come in signed and unsigned varieties. Some examples of integers sizes that may make sence are the powers of 2, the vint sizes, and the utf-8 sizes.

|raw | unsigned integer | signed integer | | name | min | max | name | min | max | | u0 | 0 | 0 | i0 | -1.0 | -1.0 | | u1 | 0 | 1 | i1 | -1 | 0 | | u2 | 0 | 3 | i2 | -2 | 1 | | u4 | 0 | 15 | i4 | -8 | 7 | | u8 | 0 | 255 | i8 | -128 | 127 | | u16 | 0 | 65535 | i16 | -32768 | 32767 | | u32 | 0 | 4294967295 | i32 | -2147483648 | 2147483647 | | u64 | 0 | 18446744073709551615 | i64 | -9223372036854775808 | 9223372036854775807 | vint | unsigned integer | signed integer | | name | min | max | name | min | max | | u7 | 0 | 127 | i7 | -64 | 63 | | u14 | 0 | 16383 | i14 | -8192 | 8191 | | u21 | 0 | 2097151 | i21 | -1048576 | 1048575 | | u28 | 0 | 268435455 | i28 | -134217728 | 134217727 | | u35 | 0 | 34359738367 | i35 | -17179869184 | 17179869183 | | u42 | 0 | 4398046511103 | i42 | -2199023255552 | 2199023255551 | | u49 | 0 | 562949953421311 | i49 | -281474976710656 | 281474976710655 | | u56 | 0 | 72057594037927935 | i56 | -36028797018963968 | 36028797018963967 | | u64 | 0 | 18446744073709551615 | i64 | -9223372036854775808 | 9223372036854775807 | utf8 | unsigned integer | signed integer | | name | min | max | name | min | max | | u7 | 0 | 127 | i7 | -64 | 63 | | u11 | 0 | 2047 | i11 | -1024 | 1023 | | u16 | 0 | 65535 | i16 | -32768 | 32767 | | u21 | 0 | 2097151 | i21 | -1048576 | 1048575 | | u26 | 0 | 67108863 | i26 | -33554432 | 33554431 | | u32 | 0 | 4294967295 | i32 | -2147483648 | 2147483647 | |

The table of n bit integers can be derived formulaically. for unsigned ints the min is |0| and the max is |(2 ^ n) - 1| we steal one of the positive numbers to have a representation for |0| leaving us with the range |[0, (2 ^ n) - 1]|

for the signed integers we first split the numbers in half leaving just |2 ^ (n-1)| numbers for each sign. once again we steal one of the positive numbers for the zero leaving us with the range |[-(2 ^ (n - 1)), (2 ^ (n - 1)) - 1]|

where this gets interesting is the case of |i1| witch can only 2 values once we divide the space in two there is only a single positive value and we steal it to make the 0 leaving us with the range |[-1, 0]| even weirder is the case of the stateless zero bit integers for u0 it is straightforward |2^0-1 = 0| so the only value is |0| and the range is |[0, 0]| for the i0 the naive formula would render the i0max as |1/2| and the min at |-1/2| theas are not integers and not equal thus not acceptable answers for the value of i0 so we extended the formula with a integer deviation by one just to force range back into the integers giving us the range of |[-1, -1]| witch is a single value needing no memory and is a signed integer thuse a acceptable value for |i0|. As a benofit it fits with the intuition of splitting the range between the positive and negative numbers then failing to steal the zero as there is no positive value to take.

#practical concerns

The fact that the. formula is not in and of itself a compelling reson for |i1| to be |[-1, 0]| but in the hardwere we used the fact that the first bit of a signed 2s complement number is set to test that the number is negative. By violating this property for |i1| we risk making it a special case that not only the language designer but also the users must deal with whenever that have a signed number that they wish to test the sign of. The case is less compelling for |i0| being |-1| as it has no first bit as it is a no memory type and would be used as part of constant propagation. That said I still think it might make generics a little cleaner to have u0 and i0 not be equal as a signed and unsigned value are only equal when the bit patterns match and and the first bit is not set on either number. As there is no first bit the first bits can't be |0| and the number cannot be equal.

#union types one nisity we get from |i0| = |-1| = |True| and |u0| = |0| = |False| is that |i1| is |i0| ∪ |u0| In general if we have a |iN| and a |uN| we need to go up a bit to represent is to |i(N+1)| this symatery is only preserved if |i0| and |u0| are different values.

This way all signed ints. and unsigned int can

|i0| = |-1| = |True| |u0| = |0| = |False| |i1| = |{-1, 0}| = |bool|

Python code for the table

iname = lambda n:'i' + str(n)

imin = lambda n:-(2**(n - 1))// 1

imax = lambda n: (2**(n - 1)- 1)// 1

uname = lambda n:'u' + str(n)

umin = lambda n:0

umax = lambda n: (2**n)- 1

line = lambda n:'| {:4} | {:3} | {:20} | {:4} | {:20} | {:19} |'.format(
 uname(n),umin(n),umax(n),iname(n),imin(n),imax(n))
header = """| unsigned integer | signed integer |

| name | min | max | name | min | max |

"""

print('raw')

0 byte i0

1 byte i1 0000 000x

1 byte i2 0000 00xx

1 byte i4 0000 xxxx

1 byte i8 xxxx xxxx

2 byte i16 xxxx xxxx xxxx xxxx

4 byte i32 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx

8 byte i64 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx

xxxx xxxx xxxx xxxx xxxx

body = '\n'.join(line(n)for n in [0,1,2,4,8,16,32,64])

print(header + body)

print()

print('vint')

Lo Hi LoHi

print(216 - 2 ** 10 - 210 + 2**20) # 1,112,064 max Unicode code

point

3 bytes of vint for any Unicode code point compaired to 4 byte utf-8

1 byte i7 0xxx xxxx

2 byte i14 10xx xxxx xxxx xxxx

3 byte i21 110x xxxx xxxx xxxx xxxx xxxx

4 byte i28 1110 xxxx xxxx xxxx xxxx xxxx xxxx xxxx

5 byte i35 1111 0xxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx

6 byte i42 1111 10xx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx

7 byte i49 1111 110x xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx

xxxx xxxx xxxx

8 byte i56 1111 1110 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx

xxxx xxxx xxxx xxxx xxxx

9 byte i64 1111 1111 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx

xxxx xxxx xxxx xxxx xxxx xxxx xxxx

body = '\n'.join(line(n)for n in [7,14,21,28,35,42,49,56,64])

print(header + body)

print()

print('utf8')

Unicode code space [0,1114111] or [0x000000, 0x10FFFF] < 21 bits

1 byte i7 0xxx xxxx

2 byte i11 110x xxxx 10xx xxxx

3 byte i16 1110 xxxx 10xx xxxx 10xx xxxx

4 byte i21 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx # Unicode

5 byte i26 1111 10xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx

6 byte i32 1111 11xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx

body = '\n'.join(line(n)for n in [7,11,16,21,26,32])

print(header + body)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cliffclick/aa/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL3JYEON2SC5GSJNYOO3E3TLM67JANCNFSM435V7VOQ.

May 01 '21 01:05 cliffclick