kaitai_struct
kaitai_struct copied to clipboard
Support for 64-bit integers in JavaScript
I am currently trying to compute a value that is scattered around using an instance, similar to the following example:
something:
seq:
- id: a
type: u1
- type: b24
- id: b
type: u4
instances:
fullthing:
value: '(a<<32) + b'
However this doesn't currently work, as the value is assumed to be a 32 bit integer and the calculation overflows. There is no way that I can see to tell it to use 64 bit arithmetic. Allowing "type: u8" or providing a way to "grow" types in this context would come in really handy for this use case.
It all depends on a target language. Such a problem does not exist in Python / Ruby / Perl, for example. I can guess that you're using C++?
type: u8 won't really help here. In C++, having stuff like:
uint32_t a;
uint32_t b;
uint64_t c = a << b;
still would be capped to 32 bits, as it does shift first, then promotes result to 64 bits later.
I am using the WebIDE for the time being, very handy tool to learn 👍 I guess a mechanism to promote integer types to longer ones would be the better call then, seeing the point you raise. Maybe something like (a.to_u8<<32) + b?
WebIDE is JavaScript-based, and JavaScript does not support:
- integers more than 53 bits,
- bit shifts for integers more than 32 bits.
So it's yet another (much more complex) problem to support at least 64 bits integers in JavaScript per se. There are quite a few libraries that allow "BigInteger"-style support for arbitrary-sized integers in JS, but:
- all of them are non-standard
- there is no clear leader that everyone chooses — there are like 3-4-5 popular libraries and several dozens of less popular ones
- a huge majority of users would actually never need 64-bit integers => no need for one extra hard dependency
Here are the possible solutions as I see them:
- Continue using the current method; i.e. silently cast to 53 bits and back again. In my view this is guaranteed to create broken code if u8's are used. I can imagine a well meaning programmer staring at a precision bug for weeks before figuring out that kaitai does not serialize 64-bit types accurately in JavaScript. ("Why did kaitai just arbitrarily change the number as it was loaded?")
- If the compiler is asked to emit 64-bit types, and the target language does not support 64-bit numbers with complete accuracy, refuse to compile the ksy file. This would make a few ksy files incompatible for JavaScript, which kinda destroys the cross-platform nature of kaitai. As u16's and u32's and whatever else comes down the line, this problem will only grow more acute.
- Silently recast u8's to be byte arrays on JavaScript. I'm guessing this will introduce a lot of casting errors on JavaScript vs. other targets, so in turn I expect this will break some previously working ksy files.
- If the compiler is asked to omit 64-bit types, and the target language does not support 64-bit numbers with complete accuracy, emit a generated language file that contains a syntax error with the message "your ksy file contains a 64-bit integer type, and javascript (or whatever language) does not support 64-bit types, so some data will serialize inaccurately; " And you can define a constant that lets you override the syntax error. This basically throws kaitai's problem into the user's lap and says "deal with this." Not the friendliest solution, but it lets the serialization happen and puts a big warning message in front of the user letting them know this is a bad idea..
- Include a bignum implementation in the generated JavaScript, iff there exist native 64-bit types in the schema. Note that if the code generated references no u8's or similar, bignum support does not need to be included at all. GreyCat's right that there are no standard bignum implementations for JavaScript; that means there's not likely to be one, so we should just bite the bullet and choose one. I suggest big.js, because it's the smallest implementation that does everything that's needed; it's 7.4kB minified. As far as I can tell, this is the only solution that gives correct results on JavaScript under all conditions.
Also https://github.com/WebAssembly/design/blob/master/Semantics.md
WebAssembly has the following value types: i64: 64-bit integer
Continue using the current method; i.e. silently cast to 53 bits and back again. In my view this is guaranteed to create broken code if u8's are used. I can imagine a well meaning programmer staring at a precision bug for weeks before figuring out that kaitai does not serialize 64-bit types accurately in JavaScript. ("Why did kaitai just arbitrarily change the number as it was loaded?")
I wanted to point out that actually this methods works in pretty larger number of cases. For example, filesystems/archives implementations which use 8-byte pointers in many cases do not reach full 64-bit precision, especially in JS environment, which tends to be more of experimental / development usage. It is unlikely that these pointers would ever go even past 32-bit, as even that would require massive 4GB file loaded in memory as UInt8Array (or whatever IDE uses).
And it's not completely "silent": we actually have that documented in JavaScript-specific notes.
As for other options that you've proposed, I would suggest a somewhat hybrid approach:
- When reading a
u8/s8data type, check if it would actually fit guaranteed 53-bit integer. If it won't, issue a clear runtime error, requesting one to compile with certain option (like CLI switch--javascript-bignum) to enable bignum library. - Implement bignum support, but only when explicitly requested with
--javascript-bignum. For a start, let us really choose a single one and use that. - WebIDE should silently compile with
--javascript-bignumto spare newbies this headache.
On detecting JS overflow: ES6 mandates safe range for integers with Number.MAX_SAFE_INTEGER and Number.MIN_SAFE_INTEGER — it's from -9007199254740991 to 9007199254740991 inclusive.
This means that algorithm to determine 53-bit overflow would be (given that b[0..7] represent 8 bytes of data to become integer, let's say that it's big endian):
For s8
- Determine sign of number to be read:
b[0] >= 0x80 ? negative : positive - If positive:
- Max legal number is 9007199254740991 =
00 1f ff ff ff ff ff ff - If b[0] > 0, then it's overflow
- If b[1] > 0x1f, then it's overflow
- Else it's safe
- Max legal number is 9007199254740991 =
- If negative:
- Min legal number is -9007199254740991 =
ff e0 00 00 00 00 00 01 - If b[0] < 0xff, then it's overflow
- If b[1] < 0xe0, then it's overflow
- If whole representation is
ff e0 00 00 00 00 00 00(or b[2]..b[7] are all 0), then it's overflow - Else it's safe
- Min legal number is -9007199254740991 =
For u8
Simpler, one branch of the previous algorithm:
- Max legal number is 9007199254740991 =
00 1f ff ff ff ff ff ff - If b[0] > 0, then it's overflow
- If b[1] > 0x1f, then it's overflow
- Else it's safe
It's not a horrible idea GreyCat, but I would merely point out that this could cause precision errors only to pop up in production and not earlier. Still, a clear and specific run-time error message is better than nothing. If people don't care about the precision, they could still disable the warning somehow and continue loading as before.
BTW, should console.warn() be considered a standard JS method to do a warning?
I would not recommend that. We should make possible to the library user to override this behavior and eg. throw and exception if he wants.
I presume this can be done via adding a callback field to the KaitaiStream instance, which can be called if an warning occurs but substreams should inherit this callback.
I would not recommend that.
What exactly do you mean by "that"?
Using console.warn() to warn about if we parsed an integer which could not be represented in Javascript.
Or in general: sometimes you don't want to show the warning on the console, but send the warning/error to an error reporting / aggregation service, so I would say a standard method to show a JS warning should be a configurable interface. Usually there is an onerror event handler on the main object, which fires if something happens.
Repost of kaitai-io/kaitai_struct_javascript_runtime#13:
BigInts are now a thing in JS. They're supported in modern browsers and Node 10.4 and higher.
From MDN:
BigInt is similar to Number in some ways, but also differs in a few key matters — it cannot be used with methods in the built-in Math object and cannot be mixed with instances of Number in operations; they must be coerced to the same type.
I think this would mean two runtimes and two compiler targets. Because the KS expression language allows math that gets translated into JS and all the read integers must be the same type, either BigInt or Number, no mixing allowed.
And on the compiler side if there is an expression like offset + 4 we need to know at compile time whether to do 4 (Number) or 4n (BigInt).
Another option with one runtime and target would be to wrap everything that could turn into a number with a function that turns numbers into bigints if required. Seems like it wouldn't be very readable, might work with a very short name though.
Checking for overflows is a good short-term solution. It's definitely better than silently dropping precision. I'll prepare a PR.
In the long-term we should fully support 64-bit integers though. I think BigInts are the way to go because they are supported natively in modern environments, but they come with their own problems.
I think this issue is related to my problem:
This is an example KSY:
meta:
id: foo
endian: be
seq:
- id: meh
size: 64
- id: footer
type: footer
types:
footer:
seq:
- id: f0
type: u8
- id: f1
type: u8
instances:
baz:
type: u8
pos: _root.footer.f0 - _root.footer.f1 + 16 # Add 16 for more visual results, no strict need to repro
I generated a sample data file with the following Python script:
import sys
with open(sys.argv[1],"wb") as out:
out.write(b"\x00"*16)
out.write(b"\x11"*16)
out.write(b"\x22"*16)
out.write(b"\x33"*16)
out.write(b"\xff\xff\xff\xff\xff\x00\x15\x20")
out.write(b"\xff\xff\xff\xff\xff\x00\x15\x00")
If you try this in the WebIDE (or with the VSCode extension, as both rely on JS for parsing) you'll see that baz will point to the 0x11s instead of the 0x33s. Note that the f0 and f1 values are parsed to the same integers, while in the stream they are not the same. In practice this results in internal pointers incorrectly resolved in the aforementioned IDE's, but working fine in Python for example.
(I have pointers with large values serialized, from which I have to strip the large base addresses to get in-stream offsets, just like in the provided minimal sample)
I don't know what the correct solution would be (I don't know too much JS), but I see this as a bug that results in inconsistent behavior across different target languages, and thus bugs in IDE's too.
Edit: Interestingly, the Covnerter page of the WebIDE displays correct i64 values (that differ from the ones displayed in the Object Tree), so it seems the capability is there, although it may be some dependency that you don't want to integrate with the JS compiler?
Edit (workaround): A pretty obvious workaround is to define a custom 64-bit type and use it instead of built-ins, for example:
address:
seq:
- id: high
type: u4
- id: low
type: u4
Value instances can help with further processing.