aa
aa copied to clipboard
n bit integer values
Proposal int0 = -1
and int1 = {-1, 0}
As we encode integers in different formats it makes sence to have types that can restrict the domain to a fixed number of bits These subsets of integers can be referred to as the n bit integers and come in signed and unsigned varieties. Some examples of integers sizes that may make sence are the powers of 2, the vint sizes, and the utf-8 sizes.
raw
| unsigned integer | signed integer |
| name | min | max | name | min | max |
| u0 | 0 | 0 | i0 | -1.0 | -1.0 |
| u1 | 0 | 1 | i1 | -1 | 0 |
| u2 | 0 | 3 | i2 | -2 | 1 |
| u4 | 0 | 15 | i4 | -8 | 7 |
| u8 | 0 | 255 | i8 | -128 | 127 |
| u16 | 0 | 65535 | i16 | -32768 | 32767 |
| u32 | 0 | 4294967295 | i32 | -2147483648 | 2147483647 |
| u64 | 0 | 18446744073709551615 | i64 | -9223372036854775808 | 9223372036854775807 |
vint
| unsigned integer | signed integer |
| name | min | max | name | min | max |
| u7 | 0 | 127 | i7 | -64 | 63 |
| u14 | 0 | 16383 | i14 | -8192 | 8191 |
| u21 | 0 | 2097151 | i21 | -1048576 | 1048575 |
| u28 | 0 | 268435455 | i28 | -134217728 | 134217727 |
| u35 | 0 | 34359738367 | i35 | -17179869184 | 17179869183 |
| u42 | 0 | 4398046511103 | i42 | -2199023255552 | 2199023255551 |
| u49 | 0 | 562949953421311 | i49 | -281474976710656 | 281474976710655 |
| u56 | 0 | 72057594037927935 | i56 | -36028797018963968 | 36028797018963967 |
| u64 | 0 | 18446744073709551615 | i64 | -9223372036854775808 | 9223372036854775807 |
utf8
| unsigned integer | signed integer |
| name | min | max | name | min | max |
| u7 | 0 | 127 | i7 | -64 | 63 |
| u11 | 0 | 2047 | i11 | -1024 | 1023 |
| u16 | 0 | 65535 | i16 | -32768 | 32767 |
| u21 | 0 | 2097151 | i21 | -1048576 | 1048575 |
| u26 | 0 | 67108863 | i26 | -33554432 | 33554431 |
| u32 | 0 | 4294967295 | i32 | -2147483648 | 2147483647 |
The table of n bit integers can be derived formulaically.
for unsigned ints the min is 0
and the max is (2 ^ n) - 1
we steal one of the positive numbers to have a representation for 0
leaving us with the range [0, (2 ^ n) - 1]
for the signed integers we first split the numbers in half leaving just 2 ^ (n-1)
numbers for each sign.
once again we steal one of the positive numbers for the zero
leaving us with the range [-(2 ^ (n - 1)), (2 ^ (n - 1)) - 1]
where this gets interesting is the case of i1
witch can only 2 values once we divide the space in two there is only a single positive value and we steal it to make the 0 leaving us with the range [-1, 0]
even weirder is the case of the stateless zero bit integers
for u0 it is straightforward 2^0-1 = 0
so the only value is 0
and the range is [0, 0]
for the i0 the naive formula would render the i0max as 1/2
and the min at -1/2
theas are not integers and not equal thus not acceptable answers for the value of i0 so we extended the formula with a integer deviation by one just to force range back into the integers giving us the range of [-1, -1]
witch is a single value needing no memory and is a signed integer thuse a acceptable value for i0
. As a benofit it fits with the intuition of splitting the range between the positive and negative numbers then failing to steal the zero as there is no positive value to take.
#practical concerns
The fact that the. formula is not in and of itself a compelling reson for i1
to be [-1, 0]
but in the hardwere we used the fact that the first bit of a signed 2s complement number is set to test that the number is negative. By violating this property for i1
we risk making it a special case that not only the language designer but also the users must deal with whenever that have a signed number that they wish to test the sign of.
The case is less compelling for i0
being -1
as it has no first bit as it is a no memory type and would be used as part of constant propagation. That said I still think it might make generics a little cleaner to have u0 and i0 not be equal as a signed and unsigned value are only equal when the bit patterns match and and the first bit is not set on either number. As there is no first bit the first bits can't be 0
and the number cannot be equal.
#union types
one nisity we get from i0
= -1
= True
and u0
= 0
= False
is that i1
is i0
∪ u0
In general if we have a iN
and a uN
we need to go up a bit to represent is to i(N+1)
this symatery is only preserved if i0
and u0
are different values.
This way all signed ints. and unsigned int can
i0
= -1
= True
u0
= 0
= False
i1
= {-1, 0}
= bool
Python code for the table
iname = lambda n: 'i' + str(n)
imin = lambda n: -(2**(n - 1)) // 1
imax = lambda n: (2**(n - 1) - 1) // 1
uname = lambda n: 'u' + str(n)
umin = lambda n: 0
umax = lambda n: (2**n) - 1
line = lambda n: '| {:4} | {:3} | {:20} | {:4} | {:20} | {:19} |'.format(
uname(n), umin(n), umax(n), iname(n), imin(n), imax(n))
header = """| unsigned integer | signed integer |
| name | min | max | name | min | max |
"""
print('raw')
# 0 byte i0
# 1 byte i1 0000 000x
# 1 byte i2 0000 00xx
# 1 byte i4 0000 xxxx
# 1 byte i8 xxxx xxxx
# 2 byte i16 xxxx xxxx xxxx xxxx
# 4 byte i32 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
# 8 byte i64 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
body = '\n'.join(line(n) for n in [0, 1, 2, 4, 8, 16, 32, 64])
print(header + body)
print()
print('vint')
# Lo Hi LoHi
# print(2**16 - 2 ** 10 - 2**10 + 2**20) # 1,112,064 max Unicode code point
# 3 bytes of vint for any Unicode code point compaired to 4 byte utf-8
# 1 byte i7 0xxx xxxx
# 2 byte i14 10xx xxxx xxxx xxxx
# 3 byte i21 110x xxxx xxxx xxxx xxxx xxxx
# 4 byte i28 1110 xxxx xxxx xxxx xxxx xxxx xxxx xxxx
# 5 byte i35 1111 0xxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
# 6 byte i42 1111 10xx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
# 7 byte i49 1111 110x xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
# 8 byte i56 1111 1110 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
# 9 byte i64 1111 1111 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
body = '\n'.join(line(n) for n in [7, 14, 21, 28, 35, 42, 49, 56, 64])
print(header + body)
print()
print('utf8')
# Unicode code space [0,1114111] or [0x000000, 0x10FFFF] < 21 bits
# 1 byte i7 0xxx xxxx
# 2 byte i11 110x xxxx 10xx xxxx
# 3 byte i16 1110 xxxx 10xx xxxx 10xx xxxx
# 4 byte i21 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx # Unicode
# 5 byte i26 1111 10xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
# 6 byte i32 1111 11xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
body = '\n'.join(line(n) for n in [7, 11, 16, 21, 26, 32])
print(header + body)
Heh, nice PR. Will look at it when the H-M stuff dies down, but honestly seems like the Right Thing To Do.
So consider this as an "approved" AA language change, just not acted-on yet.
Cliff
On 4/30/2021 4:51 PM, Aaron Goldman wrote:
Proposal |int0 = -1| and |int1 = {-1, 0}|
As we encode integers in different formats it makes sence to have types that can restrict the domain to a fixed number of bits These subsets of integers can be referred to as the n bit integers and come in signed and unsigned varieties. Some examples of integers sizes that may make sence are the powers of 2, the vint sizes, and the utf-8 sizes.
|raw | unsigned integer | signed integer | | name | min | max | name | min | max | | u0 | 0 | 0 | i0 | -1.0 | -1.0 | | u1 | 0 | 1 | i1 | -1 | 0 | | u2 | 0 | 3 | i2 | -2 | 1 | | u4 | 0 | 15 | i4 | -8 | 7 | | u8 | 0 | 255 | i8 | -128 | 127 | | u16 | 0 | 65535 | i16 | -32768 | 32767 | | u32 | 0 | 4294967295 | i32 | -2147483648 | 2147483647 | | u64 | 0 | 18446744073709551615 | i64 | -9223372036854775808 | 9223372036854775807 | vint | unsigned integer | signed integer | | name | min | max | name | min | max | | u7 | 0 | 127 | i7 | -64 | 63 | | u14 | 0 | 16383 | i14 | -8192 | 8191 | | u21 | 0 | 2097151 | i21 | -1048576 | 1048575 | | u28 | 0 | 268435455 | i28 | -134217728 | 134217727 | | u35 | 0 | 34359738367 | i35 | -17179869184 | 17179869183 | | u42 | 0 | 4398046511103 | i42 | -2199023255552 | 2199023255551 | | u49 | 0 | 562949953421311 | i49 | -281474976710656 | 281474976710655 | | u56 | 0 | 72057594037927935 | i56 | -36028797018963968 | 36028797018963967 | | u64 | 0 | 18446744073709551615 | i64 | -9223372036854775808 | 9223372036854775807 | utf8 | unsigned integer | signed integer | | name | min | max | name | min | max | | u7 | 0 | 127 | i7 | -64 | 63 | | u11 | 0 | 2047 | i11 | -1024 | 1023 | | u16 | 0 | 65535 | i16 | -32768 | 32767 | | u21 | 0 | 2097151 | i21 | -1048576 | 1048575 | | u26 | 0 | 67108863 | i26 | -33554432 | 33554431 | | u32 | 0 | 4294967295 | i32 | -2147483648 | 2147483647 | |
The table of n bit integers can be derived formulaically. for unsigned ints the min is |0| and the max is |(2 ^ n) - 1| we steal one of the positive numbers to have a representation for |0| leaving us with the range |[0, (2 ^ n) - 1]|
for the signed integers we first split the numbers in half leaving just |2 ^ (n-1)| numbers for each sign. once again we steal one of the positive numbers for the zero leaving us with the range |[-(2 ^ (n - 1)), (2 ^ (n - 1)) - 1]|
where this gets interesting is the case of |i1| witch can only 2 values once we divide the space in two there is only a single positive value and we steal it to make the 0 leaving us with the range |[-1, 0]| even weirder is the case of the stateless zero bit integers for u0 it is straightforward |2^0-1 = 0| so the only value is |0| and the range is |[0, 0]| for the i0 the naive formula would render the i0max as |1/2| and the min at |-1/2| theas are not integers and not equal thus not acceptable answers for the value of i0 so we extended the formula with a integer deviation by one just to force range back into the integers giving us the range of |[-1, -1]| witch is a single value needing no memory and is a signed integer thuse a acceptable value for |i0|. As a benofit it fits with the intuition of splitting the range between the positive and negative numbers then failing to steal the zero as there is no positive value to take.
#practical concerns
The fact that the. formula is not in and of itself a compelling reson for |i1| to be |[-1, 0]| but in the hardwere we used the fact that the first bit of a signed 2s complement number is set to test that the number is negative. By violating this property for |i1| we risk making it a special case that not only the language designer but also the users must deal with whenever that have a signed number that they wish to test the sign of. The case is less compelling for |i0| being |-1| as it has no first bit as it is a no memory type and would be used as part of constant propagation. That said I still think it might make generics a little cleaner to have u0 and i0 not be equal as a signed and unsigned value are only equal when the bit patterns match and and the first bit is not set on either number. As there is no first bit the first bits can't be |0| and the number cannot be equal.
#union types one nisity we get from |i0| = |-1| = |True| and |u0| = |0| = |False| is that |i1| is |i0| ∪ |u0| In general if we have a |iN| and a |uN| we need to go up a bit to represent is to |i(N+1)| this symatery is only preserved if |i0| and |u0| are different values.
This way all signed ints. and unsigned int can
|i0| = |-1| = |True| |u0| = |0| = |False| |i1| = |{-1, 0}| = |bool|
Python code for the table
iname = lambda n:'i' + str(n)
imin = lambda n:-(2**(n - 1))// 1
imax = lambda n: (2**(n - 1)- 1)// 1
uname = lambda n:'u' + str(n)
umin = lambda n:0
umax = lambda n: (2**n)- 1
line = lambda n:'| {:4} | {:3} | {:20} | {:4} | {:20} | {:19} |'.format(
uname(n),umin(n),umax(n),iname(n),imin(n),imax(n))
header = """| unsigned integer | signed integer |
| name | min | max | name | min | max |
"""
print('raw')
0 byte i0
1 byte i1 0000 000x
1 byte i2 0000 00xx
1 byte i4 0000 xxxx
1 byte i8 xxxx xxxx
2 byte i16 xxxx xxxx xxxx xxxx
4 byte i32 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
8 byte i64 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
xxxx xxxx xxxx xxxx xxxx
body = '\n'.join(line(n)for n in [0,1,2,4,8,16,32,64])
print(header + body)
print()
print('vint')
Lo Hi LoHi
print(216 - 2 ** 10 - 210 + 2**20) # 1,112,064 max Unicode code
point
3 bytes of vint for any Unicode code point compaired to 4 byte utf-8
1 byte i7 0xxx xxxx
2 byte i14 10xx xxxx xxxx xxxx
3 byte i21 110x xxxx xxxx xxxx xxxx xxxx
4 byte i28 1110 xxxx xxxx xxxx xxxx xxxx xxxx xxxx
5 byte i35 1111 0xxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
6 byte i42 1111 10xx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
7 byte i49 1111 110x xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
xxxx xxxx xxxx
8 byte i56 1111 1110 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
xxxx xxxx xxxx xxxx xxxx
9 byte i64 1111 1111 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
xxxx xxxx xxxx xxxx xxxx xxxx xxxx
body = '\n'.join(line(n)for n in [7,14,21,28,35,42,49,56,64])
print(header + body)
print()
print('utf8')
Unicode code space [0,1114111] or [0x000000, 0x10FFFF] < 21 bits
1 byte i7 0xxx xxxx
2 byte i11 110x xxxx 10xx xxxx
3 byte i16 1110 xxxx 10xx xxxx 10xx xxxx
4 byte i21 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx # Unicode
5 byte i26 1111 10xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
6 byte i32 1111 11xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
body = '\n'.join(line(n)for n in [7,11,16,21,26,32])
print(header + body)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cliffclick/aa/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL3JYEON2SC5GSJNYOO3E3TLM67JANCNFSM435V7VOQ.