Unums.jl
Unums.jl copied to clipboard
Pick the specification
In Gustafson's book (End of Errors), he proposes the basic unum:
| signbit | exponent | fraction | ubit | esize-1 | fsize-1 |
Discuss: is there a better ordering to these fields to make common operations and compression/decompression more efficient and/or easier to implement?
In addition, he proposes a more complicated, fixed-width spec on p. 100:
| negative? | NaN? | Inf? | 0? | 2nd? | exponent | revealedbit | fraction | ubit | esize-1 | fsize-1 |
I think we should discuss a software implementation using fixed-width unums:
- Is this the best ordering of fields?
- Should we throw out the esize/fsize fields and instead have the spec define the number of bits for the exponent and the total size of the structure? (We would implicitly calculate the size of the fraction by filling in the rest of the available space). The advantage is more usable space for precision/accuracy. Are there disadvantages I'm not considering?
- Can we get rid of the revealed bit? I think we can... I get the impression he put that there just to fill in the 64th bit.
I thought he represented NaN and Inf somehow using just the first format. I've got to finish reading his book to at least attempt to comment intelligently on this! :person_frowning:
The extra bits don't actually add information... they summarize the contents. Many of the Inf/NaN/negative/0 exceptional states require semi-complex logic involving all of the bits of the number. The proposal basically does this logic once and stores the result, so that in the future you can check a single bit to check for negativity, or NaN, etc. In many cases it has the potential for big savings (and simpler code?) but the savings might be reserved for a hardware implementation (in which it's easier to deal in single bits)
Ah, that would be fine, for me, best if they are all in one place, so I just don't store them when I pack numbers, and can recreate them upon unpacking (because space in the DB or memory is the biggest determinant of performance for me)
I was thinking also about your decimal unum idea, and how they could be represented in printed form, without losing information. maybe something like:
unum"~-1.23400e-20"
with the ~ to represent the inexact bit, and the trailing 0s to give an indication of the size of the fraction? I'm sure I'm all wet, please let me know!
Thinking about this type:
abstract AbstractFixedUnum <: AbstractUnum
doc"""
A fixed-width unum. Parameters:
EBASE = base of exponent
NOTE: may ditch this param:
- defaults to 2?
- want to easily implement decimals... but this may be tough to implement well
ESZ = number of bits in the exponent field (fraction field fills all available space)
UINT = the underlying storage type
The format is as follows:
| exponent | fraction | NaN? | Inf? | zero? | firstInUBound? | negative? | inexact? |
"""
immutable FixedUnum{EBASE, ESZ, UINT} <: AbstractFixedUnum
bits::UINT
end
All fields left of negative?
could potentially be reduced/compressed for serialization. Serialization would require also writing out the EBASE
value. What do you think?
I think EBASE should just be a bit, binary or 10, no others are really worthwhile for this.
I see it's not in the number, something would need to restrict EBASE to 2 or 10. Could you have BinaryUnum{ESZ, UINT} & DecimalUnum{ESZ, UINT}?
What do you think about:
typealias Unum{ESZ, UINT} FixedUnum{2, ESZ, UINT}
typealias DecimalUnum{ESZ, UINT} FixedUnum{10, ESZ, UINT}
typealias Unum16 Unum{3, UInt16}
typealias Unum32 Unum{7, UInt32}
typealias Unum64 Unum{16, UInt64}
typealias Unum128 Unum{20, UInt128}
typealias DecimalUnum16 DecimalUnum{3, UInt16}
typealias DecimalUnum32 DecimalUnum{7, UInt32}
typealias DecimalUnum64 DecimalUnum{16, UInt64}
typealias DecimalUnum128 DecimalUnum{20, UInt128}
I can understand putting limits on what EBASE can be, but... do we gain anything beyond saving a couple bits in serialization?
I think it's just more complexity, that simply isn't needed. I think the only reason to use a base other than 2 is for conversion issues with human input/output.
Are the 3, 7, 16, 20 bits for exponents what are recommended in the book? 16 - ubit - sign - 3 exp = 11 bits for fraction on Unum16? I'm not sure you need as many bits for decimals, 3 bits would be 10^-4 to 10^3, right? 3 bits with binary only shifting up or down by 2 bits, decimal, it's shifting by 3-4?
No those were thoughtless, arbitrary choices. I am open to better defaults.
On Thu, Jul 30, 2015 at 1:54 PM, Scott P. Jones [email protected] wrote:
Are the 3, 7, 16, 20 bits for exponents what are recommended in the book? 16 - ubit - sign - 3 exp = 11 bits for fraction on Unum16? I'm not sure you need as many bits for decimals, 3 bits would be 10^-4 to 10^3, right? 3 bits with binary only shifting up or down by 2 bits, decimal, it's shifting by 3-4?
— Reply to this email directly or view it on GitHub https://github.com/tbreloff/Unums.jl/issues/1#issuecomment-126417443.
A quick note... I'm going to remove the summary bits for now. I'm building it in a flexible enough way that luckily that's pretty easy.
I doubt anything you do is thoughtless or arbitrary! I hope you didn't think I was implying otherwise! (really need that "Scott's comments" repository for peer review before release! :grinning:)
There are many questions to answer here; let me chip away at them, one at a time.
The "revealed bit" is certainly the least important or least time-saving of all the summary bits in the unpacked form. In the "Exercise for the Reader" following the 64-bit unpacked form, I pose the puzzle of picking an unpacked format for a 32-bit fixed size, and that one does not store the "revealed bit." And I agree that the first implementation in Julia should be for an unpacked, fixed-size form.
Actually, since the subject started with why the unum fields are located where they are, let me provide my reasoning:
It makes it easier to see the float part of the unum if it is a set of contiguous fields, identical to where IEEE floats put them. So sign bit, exponent, and fraction need to be together and in that order.
If you make the sign bit the leftmost bit, especially in the expanded form, then testing for a negative value can use the same machine-level instructions that are used for floats.
The ubit has to go at the end of the fraction bits, because it is a monotonic extension of their meaning.
That leaves the esize and fsize fields. They should go after the ubit, both because that makes the utag a contiguous block of bits and because they can't go to the left of the sign bit or it spoils the close resemblance of a unum sign bit to a float sign bit. Now there really is no reason that
| esize – 1 | fsize – 1|
is preferable to
| fsize – 1 | esize – 1|,
except that the first way matches the order of the exponent and fraction for the float; exponent first, then the fraction.
Next issue is Tom's suggestion that the esize and fsize fields be thrown out. I see no way to do that, but you are welcome to try! Each number needs its own description. If you try to get rid of them, it winds up a little like "block floating point" where a single exponent applies to block of integers; that is barely different from fixed point arithmetic, and it died a quick death in the marketplace as soon as machines were available that stored the exponent for every number separately.
OK, good information. I am more interested in how I can pack unums most efficiently for storage, in such a way that it is easy also to unpack them reasonably rapidly into whatever form Tom's code wants. Another thing I'd been interested in is whether the same unum ideas could be applied to decimal floating point. For decades, I've worked with a decimal floating point format, simply a 64-bit 2's complement value with a 1-byte signed scale), which was easy to compact and easy to use for calculations (esp. back in the days when floating point hardware was either non-existent or varied from platform to platform). Your comments about the "memory wall" resonated strongly with me, because I'd been fighting the "disk wall" for ages, trying to come up with efficient ways of packing numbers and strings for storage.
John: can you expand on your "block floating point" comment? I don't quite understand the operation(s) that they are necessary for, but I certainly believe that I'm missing something.
The format I'm testing still allows for flexible exponent and fraction sizes on a per-unum basis (i.e. No need to change an "environment"). What am I missing?
On Jul 31, 2015, at 9:57 PM, JohnLGustafson [email protected] wrote:
Next issue is Tom's suggestion that the esize and fsize fields be thrown out. I see no way to do that, but you are welcome to try! Each number needs its own description. If you try to get rid of them, it winds up a little like "block floating point" where a single exponent applies to block of integers; that is barely different from fixed point arithmetic, and it died a quick death in the marketplace as soon as machines were available that stored the exponent for every number separately.
— Reply to this email directly or view it on GitHub.
John: Does the esize need to be stored as esize-1? If it were esize, you could have 0 exponent bits (i.e. for storing simple integers without using up another bit. (I'm not a mathematician, and I haven't gotten through all 400 pages yet, so forgive me if that's a stupid question!).
If you have 0 exponent bits, it breaks the formulas for the representation. You get things like 2 to the -1/2 power, for example. A real mess. It does NOT simply turn the float into an integer. Using 1 exponent bit acts exactly like an integer, and that is an amazing and non-obvious fact. If you have no exponent bit, then you have no way to represent NaN or ±oo. Even integers need a way of expressing those things, and especially (maxreal, oo).
Single precision IEEE floats use 8 bits of exponent. That neatly fits in only 3 bits if binary 111 represents 8, not 7. So everything points to using an offset of 1, or -1, depending on how you look at it.
It is possible to have a fraction field of zero size, since the exponent still expresses a hidden bit of fraction. That's one of the Exercises for the Reader. But that makes it inconsistent with the esize field, and who really wants a numerical representation that has zero bits of fraction? The Warlpiri unums seem like the smallest possible useful set.
On Jul 31, 2015, at 7:47 PM, Scott P. Jones [email protected] wrote:
John: Does the esize need to be stored as esize-1? If it were esize, you could have 0 exponent bits (i.e. for storing simple integers without using up another bit. (I'm not a mathematician, and I haven't gotten through all 400 pages yet, so forgive me if that's a stupid question!).
— Reply to this email directly or view it on GitHub https://github.com/tbreloff/Unums.jl/issues/1#issuecomment-126853647.
This is really interesting, and can be a great extension to integers! I'll probably add a new set of:
typealias IntUnum{NBITS} Unum{1,NBITS}
Which represent the full range of integers (minus 2 bits), and also represent the open intervals, nan and infs. Pretty cool!
On Aug 1, 2015, at 9:05 AM, JohnLGustafson [email protected] wrote:
If you have 0 exponent bits, it breaks the formulas for the representation. You get things like 2 to the -1/2 power, for example. A real mess. It does NOT simply turn the float into an integer. Using 1 exponent bit acts exactly like an integer, and that is an amazing and non-obvious fact. If you have no exponent bit, then you have no way to represent NaN or ±oo. Even integers need a way of expressing those things, and especially (maxreal, oo).
Single precision IEEE floats use 8 bits of exponent. That neatly fits in only 3 bits if binary 111 represents 8, not 7. So everything points to using an offset of 1, or -1, depending on how you look at it.
It is possible to have a fraction field of zero size, since the exponent still expresses a hidden bit of fraction. That's one of the Exercises for the Reader. But that makes it inconsistent with the esize field, and who really wants a numerical representation that has zero bits of fraction? The Warlpiri unums seem like the smallest possible useful set.
On Jul 31, 2015, at 7:47 PM, Scott P. Jones [email protected] wrote:
John: Does the esize need to be stored as esize-1? If it were esize, you could have 0 exponent bits (i.e. for storing simple integers without using up another bit. (I'm not a mathematician, and I haven't gotten through all 400 pages yet, so forgive me if that's a stupid question!).
— Reply to this email directly or view it on GitHub https://github.com/tbreloff/Unums.jl/issues/1#issuecomment-126853647.
— Reply to this email directly or view it on GitHub.
@JohnLGustafson Thanks for the info. I was used to the decimal numbers in M/Mumps (now CachéObjectScript), which didn't represent Inf or NaN, because they always gave a MaxNumber or DivideByZero exception. What are your thoughts on a decimal floating point version of unums? The idea of having the unexact bit seems perfectly applicable, I was trying to think of how to represent that as a string for input/output, maybe "-1.2345...e30" where the ... would represent the inexact bit.
I'll have to figure out what needs to be done to convert an exact integer unum to something like an Int64/UInt64, Int128/UInt128 or even BigInt, as long as that makes sense.
I could see allowing direct conversion to Integers when the IntUnum is exact, finite, (and positive for Unsigned), but requiring a call to round otherwise. (convert would throw an InexactError)
On Aug 1, 2015, at 11:13 AM, Scott P. Jones [email protected] wrote:
@JohnLGustafson Thanks for the info. I was used to the decimal numbers in M/Mumps (now CachéObjectScript), which didn't represent Inf or NaN, because they always gave a MaxNumber or DivideByZero exception. What are your thoughts on a decimal floating point version of unums? The idea of having the unexact bit seems perfectly applicable, I was trying to think of how to represent that as a string for input/output, maybe "-1.2345...e30" where the ... would represent the inexact bit.
I'll have to figure out what needs to be done to convert an exact integer unum to something like an Int64/UInt64, Int128/UInt128 or even BigInt, as long as that makes sense.
— Reply to this email directly or view it on GitHub.
Yes, that's what I meant (of course, you expressed it better!)
I have thought a lot about a decimal float equivalent for unums, and I hope someone can come up with it. I did not see an easy path, especially since there are at least two competing schemes for encoding the decimal fraction with as little waste as possible. One difficulty is that of "wobbling precision"... the slope changes by a factor of ten instead of a factor of two when the exponent goes up or down, and that really does not play well with the economy of the ubox method.
The ability to avoid errors when translating human representation to and from internal computer representation is a big plus for decimal floats, of course. IBM is a strong proponent of decimal floats. The usual argument against decimal floats is that they are inherently at least twice as slow (on chip) as binary even after intense hardware optimization, but of course that is very much like what unums do; they require more gates and (we suspect) more time but get better answers.
Besides the wobbling precision problem, what bothers me about decimal floats is that very few floats come from or go to human-readable form! Think about all the floats that turn into sounds, pixels, motor positions, or other non-character representations. Thank about how many floats within a program are internal to the calculation instead of being read in as character strings or displayed as character strings. It does seem like a dubious engineering choice to make fewer than 1% of the the operations (decimal string I/O) more accurate at the expense of making over 99% of the operations twice as slow. But for some applications, it would be great to have the option.
For now, let's just concentrate on the binary unum idea. It's plenty of work as it is!
On Aug 1, 2015, at 8:13 AM, Scott P. Jones [email protected] wrote:
@JohnLGustafson https://github.com/JohnLGustafson Thanks for the info. I was used to the decimal numbers in M/Mumps (now CachéObjectScript), which didn't represent Inf or NaN, because they always gave a MaxNumber or DivideByZero exception. What are your thoughts on a decimal floating point version of unums? The idea of having the unexact bit seems perfectly applicable, I was trying to think of how to represent that as a string for input/output, maybe "-1.2345...e30" where the ... would represent the inexact bit.
I'll have to figure out what needs to be done to convert an exact integer unum to something like an Int64/UInt64, Int128/UInt128 or even BigInt, as long as that makes sense.
— Reply to this email directly or view it on GitHub https://github.com/tbreloff/Unums.jl/issues/1#issuecomment-126926727.
Yes my first pass will focus on binary unums.
Can you comment on potential pitfalls of this spec:
| exponent | fraction | signbit | ubit |
Questions:
- If we assume that the total size of each field will be constant for a given type, are there pitfalls to not including the esize/fsize fields? My implementation will not have an "environment"... you can create a 64-bit unum with a 8 bit exponent field and a 54 bit fraction field, while in the next command you could create a 32-bit unum with a 1 bit exponent field and 29 bit fraction field.
- Should we allow for "wasted space"? We can align the total bits to 32/64, but only use say 26 bits for the spec. Is this useful/necessary?
- I put the signbit next to the ubit, mainly to keep a consistent tag section when compressing for serialization or some other reason. It may also make some code a little more straightforward. Any issues there? I suppose it may make it more complicated on converting to/from floats and signed ints...
I re-read your comments and did some thinking, and I was wrong to get excited about the IntUnum concept. It would have to be a separate implementation, it couldn't use the same formula that unums use.
If I was to do this, I'd probably have another type parameter which effectively changes the unum formula so that the "fraction" is treated like an integer (no need for exponent), but we still have the ubit, so I believe we can represent infs/nans with the same rules. Then we get the IntUnum I was hoping for (basically a signed integer with the ubit). This IntUnum would be a completely different type, but we could likely reuse a lot of the code/framework for unums (thanks to julia!)
Also, I'm having an internal battle about the importance of the "size" fields. I think there are a lot of positives, at the expense of 5-10 lost bits. However there may be certain pieces that are harder to generate optimized code. One big positive: there would be way fewer julia types to generate code for, and probably there will be better type stability in a lot of the algos. So... I think I'm putting them back in.
On Aug 1, 2015, at 9:05 AM, JohnLGustafson [email protected] wrote:
If you have 0 exponent bits, it breaks the formulas for the representation. You get things like 2 to the -1/2 power, for example. A real mess. It does NOT simply turn the float into an integer. Using 1 exponent bit acts exactly like an integer, and that is an amazing and non-obvious fact. If you have no exponent bit, then you have no way to represent NaN or ±oo. Even integers need a way of expressing those things, and especially (maxreal, oo).
Single precision IEEE floats use 8 bits of exponent. That neatly fits in only 3 bits if binary 111 represents 8, not 7. So everything points to using an offset of 1, or -1, depending on how you look at it.
It is possible to have a fraction field of zero size, since the exponent still expresses a hidden bit of fraction. That's one of the Exercises for the Reader. But that makes it inconsistent with the esize field, and who really wants a numerical representation that has zero bits of fraction? The Warlpiri unums seem like the smallest possible useful set.
On Jul 31, 2015, at 7:47 PM, Scott P. Jones [email protected] wrote:
John: Does the esize need to be stored as esize-1? If it were esize, you could have 0 exponent bits (i.e. for storing simple integers without using up another bit. (I'm not a mathematician, and I haven't gotten through all 400 pages yet, so forgive me if that's a stupid question!).
— Reply to this email directly or view it on GitHub https://github.com/tbreloff/Unums.jl/issues/1#issuecomment-126853647.
— Reply to this email directly or view it on GitHub.
what bothers me about decimal floats is that very few floats come from or go to human-readable form
@JohnLGustafson About decimal floats, it really is application specific. For the types of database centric applications (in the healthcare / financial area mostly) that I worked on, almost nobody was doing anything more than +, -, *, divide, intdivide, modulo, round to a certain decimal place, and conversion back and forth between the internal numeric format(s) and strings. (semantically, there actually was only one type, strings). Conversions, followed by additions, were the most common operations, there were occasional multiplications, rounding, etc. Since the numbers were not normalized, adding two numbers with the same scale was just an native integer operation, and was faster than using floating point, even with floating point hardware (when I wrote the decimal arithmetic package, it was for platforms like the 8088 (16-bit, no coprocessor), IBM 370 (VM/CMS) (32-bit, had it's own floating point format), PDP-11, VAX, 16-bit DG. Even when the scales were not equal, it was usually just x10 or x100 (done with bit shifts and adds). The floating point values are also heavily used as database keys, something a bit difficult IMO with IEEE floating point values, when you need to make sure that all number sort consistently (decimal floating point, integers, IEEE floats, and numeric strings [strings that are canonical numeric representations, too big to fit into internal "integer" or float types])
The addition of your ubit would make things even better, IMO, because it would be possible to be able to do arithmetic on numbers that I couldn't handle before (didn't have an arbitrary precision decimal float back then), without having to go to arbitrary precision, and still keep track that the result is not exact. The ability to also distinguish both positive and negative underflows and overflows, without confusing the overflows with +Inf, -Inf, seems very useful to me also, compared to the IEEE decimal float standard.
I'm not trying to say that decimal floats (or decimal unums) should be used ever except in these sorts of specific applications, but that is a very large set of applications.
I guess I also like binary unums because, if you do need to do the sorts of fancy number crunching that most of the people here are interested in, I think you could do the conversion from a decimal unum to binary unum and keep track of whether or not the conversion was exact or not.
Another thing I'm very interested in, is how to convert both binary and decimal unums into keys that would be suitable for sorting correctly, and in such a way that from the key, you can get back to the original representation. I'm not sure that is possible with unums, but I hope so. (I'd done that in the past, handling 64-bit + 1 byte scale decimal floats, IEEE floats, and "canonic numeric" strings, out to 510 digits, and it worked out quite well, but it was very tricky adding the IEEE floats, because if they couldn't be represented exactly as a decimal float, they had to generate a key that sorted correctly "in between" the decimal values).
Actually, I forgot, comparisons, esp. equality, were also more frequent than any operations, in the sort of applications of our customers. I optimized the h*ll out of those!
Tom, initially I indeed named my package Unums.jl. I renamed it to UnumTests.jl. My intention was to attempt a 1-on-1 translation of the existing Mathematica reference package, including unum fields and usage definitions, and stay as close to Appendices A and B in John's book as possible as far as functions are concerned.