refmt
refmt copied to clipboard
Unmarshalling without type info prefers int64.
This diff changes the obj
package's use of int
to int64
. Using int
is problematic because if one is building for 32-bit architectures as well as 64-bit, using int
will result in different behavior when large numbers are handled on 32-bit architectures (they'll overflow when they would've been fine on a 64-bit build).
Since the only downside of using int64, as far as I can figure, is that it takes a few more characters on the source side -- 'int' is already uses 64-bits of space on a 64-bit architecture anyway -- I guess I'll concede that it might as well be the default. The additional explicitness ensures we get the same behavior on 64-bit and 32-bit builds.
Users might find it a bit odd that unmarshalling into an interface{}
may result in a map[string]interface{}
that's concretely populated with a bunch of int64
values, but it seems there's little choice. Making a choice between int
and int64
configurable adds a lot of complexity (and probably a performance hit, if a minor one) for very little gain.
For comparison, it's worth noting that stdlib's json.Unmarshal
behavior around numbers is much more fun than this: you get float64
by default (which does not losslessly convey the same range as int64
, to my understanding)... unless you engage with Decoder.UseNumber
, which is a whole other ball of very interesting wax. (I haven't actually ever seen the UseNumber
system in use, so I'm not sure what to think of it; and either way, I'm defaulting to doubtful that its semantics would translate cleanly to a system that also handles CBOR.) So all in all, I'm not sure we get much useful guidance from observing that package.
As discussed previously in commit 36d31e7716ec3a8517035447a57dde1d88cd528e (which started the int consistency work, but used 'int', at the time with the reasoning "plainer is better"): it's important to address both integer size and signedness because otherwise we would experience some odd behaviors in that the CBOR unmarshal and JSON unmarshal would behave differently when targeting an empty interface: JSON would yield int64, while CBOR would yield uint64, due to CBOR's Interesting (ahem) Opinions about signedness of numbers. This would make consistency test suites for the obj
package having consistency behavior across codecs fail. This is why the obj
package must ignore whether the token contains an int or a uint in order to have reasonable behavior.
This change only impacts situations when the obj
package is working without any significant type information (as when filling an interface{}
or map[string]interface{}
, etc). If unmarshalling into a concrete type that uses some other numeric type like uint64
, etc, those unmarshallings do not pass through this cast and are not affected by this change.
Note: unless I'm gravely mistaken, int64
and uint64
can be cast back and forth without loss of precision. Yes, one of them will show up negative, and the other won't... but the point remains: the actual bits won't change.
Therefore: it seems to me that the refmt library having a default behavior of producing (signed) int64
values when handling freeform objects for which we have no type info should be entirely safe, correct, and lossless, even if your value is in the highest ranges of uint64
: you can just cast it back to the unsigned form before doing whatever math you like to do in your application.
If anyone has objections to this proposed behavior or sees flaws with this understanding of numeric precision bits, speak up. Ideally with a PR that handles this better. (Frankly, I don't handle the number 18,446,744,073,709,551,615 very often, so, while I'm trying to do right by whoever does, it's awfully hard to write code that one doesn't oneself have a usecase for or see exercised in the wild: I'm counting on someone with practical notes to chime in if I'm missing a big clue here.)