grain icon indicating copy to clipboard operation
grain copied to clipboard

feat(stdlib): JSON parsing and printing.

Open cician opened this issue 2 years ago • 4 comments

This PR aims to include my JSON parser/printer (https://github.com/cician/grain-json) into stdlib. Hopefully for 0.5 release.

I'm opening this PR as a draft for now since I expect to still make some important changes once stack allocated chars and thick numbers are in. Strictly speaking no changes are really required after stack chars PR is merged, but code can be improved for clarity by using chars instead of code points as raw numbers. Thick numbers on the other hand are going to affect the behavior of number parsing.

cician avatar Feb 06 '22 22:02 cician

I like the API. The thing missing is a searchable data structure but I think you address that in "Since then I decided that it's OK for this API not to present a directly searchable data structure and that this job can be left to a higher level API built on top. No specific plans yet, suggestions welcome." which I agree with.

Great work.

marcusroberts avatar Mar 31 '22 16:03 marcusroberts

Are we waiting for BigInt to land before landing this?

Sorry I left everyone in the limbo for more than a month and haven't made any progress on this. Now that I have again some time, I should make it more clear and explicit what is actually the blocker, even if it ends up being a bit of a recap of what have already been said in the past on Discord.

The current plan for number parsing is:

Integer numbers in the range -2^63..2^63 are parsed into Simple/Int32/Int64 variants of Number without problem.

Greater integer numbers would require BigInt, unless we choose to parse these into 64 bit floats with precision loss, and even that has its limit (10^308). I don't think it would make sense with BigInt on the horizon. If we really want, we could ship with a version that returns error in these cases and later change it to parse into BigInt without changes to the API.

Since there's currently no plan for inclusion of something like BigDecimal into Grain (AFAIK), the only choice is to parse numbers with decimal digits as floats. As most JSON parsers do, although some give you a choice. If I recall correctly this was the consensus.

So the major blocker is that we don't have float parsing. What is missing is basically float parsing, minus the actual parsing (which is the easy part and already done). Building a binary float from decimal integer+fraction is easy on the surface, but hard to get right in respect to rounding and stable round-tripping. I think it's very important given the Grain's primary target (business logic and smart contracts). I was planning on porting some existing code to achieve this, but haven't even started and haven't looked at the matter since February.

cician avatar Apr 10 '22 19:04 cician

@cician gotcha. If you want to rebase off of the bigint branch you definitely can, and the format is pretty simple. But otherwise, yes, keep us updated on floats, and let us know if there's anything we can do to help.

ospencer avatar Apr 10 '22 20:04 ospencer

@cician thiccnums have landed if you want to update this with them!

phated avatar May 24 '22 21:05 phated

@cician are you still around to work on this? Otherwise I'll take your fine work and finish it off for 0.6

marcusroberts avatar Jan 11 '23 20:01 marcusroberts

Oscar reached out to me a few days ago saying he's willing to pick this up. I started to take a look again after I saw the float parsing PR merged, but haven't gotten far, so it's probably for the better you guys finish it, in order to merge it in time for v0.6. I leave below a few notes on different approaches I though of. Let me know if something is not clear here or in the parser code.

Currently the parser is structured internally for consuming a sequential stream of chars, one at a time. You can keep it this way for number parsing as well, but don't have to, since the API only accepts a String input. I've done it this way thinking of maybe extending the API in the future if we have a concept of data streams or something.

Solution A

Simple and dirty.

A valid JSON number token is a subset of valid inputs of Number.parseFloat. JSON doesn't allow NaNs, infinities or numbers with additional leading zeros, underscores etc.

So let the existing parser code in parseNumberValue function do the validation and just accumulate number token chars in a Buffer. Reuse the Buffer from JSONParserState already used for strings. At the end of the function just build a string from the buffer and call parseFloat. Remove the code that computes significand and exponent. Done.

This is the quickest solution to implement, but somewhat ugly and and slow. There's a cost of accumulating chars in the buffer, an allocation of a temporary String for each number, a temporary Bytes instance allocated by Buffer.toString (could be avoided btw) and redundant work of actual number parsing between the JSON parser and parseFloat. Also there's coupling with parsing logic of parseFloat, which is somewhat grain specific and not fully documented.

Solution B

Similar to solution A, but accumulate just the decimal digits in a Buffer and the decimal point offset. Use it to build the float like in the slow path of parseFloat.

Unintuitively this approach may actually be slower than solution A by skipping the fast path. Needs allocating an instance of the Decimal record with Bytes from the decimal buffer, but we skip allocating a String and we avoid coupling JSON parser with behavior of parseFloat. Building a Bytes instance could be avoided by allowing access to the raw Bytes instance backing the Buffer.

Solution C

Optimized parsing function similar to parseFloat with unsafe code, but tailored for JSON and working of a slice of memory instead of a string instance.

This would result in great performance, but other than using unsafe code, would probably require moving away from the idea of a sequential char stream for JSON, unless parseLongMantissa in the slow path can somehow resume without need to restart from the beginning of the string.

Solution D

A hybrid between solutions B and C as a compromise to keep sequential streaming logic. Fast path is tried by reimplementing parseFloatToParts directly inside parseNumberValue, but the buffer of chars is always filled from the beginning to be used in the slow path.

I was attempting solution B, but only gotten as far as filling the buffer of decimals and breaking up the parseLongMantissa function to have an overload that starts from a Decimal record instead of a String.

edit: typo

cician avatar Jan 15 '23 12:01 cician

Thanks @cician that's really useful and food for thought!

marcusroberts avatar Jan 15 '23 13:01 marcusroberts

Great lib, @cician! (I ended up passing development to jake because i'm new to grain)

JairusSW avatar Mar 04 '23 00:03 JairusSW

Looks like the tests are passing which is good. This should be all ready for review a note is I wrote a TODO for eventually switching to a streaming based float parsing implementation that works on the indivdual chars rather then storing it in a buffer, so I need to open an issue for that and reference the issue in the comment.

A question for @cician is under the toString functions you have a FIXME magic numbers but I do not know what magic number's you are talking about because from what i understand NaN and Infintiy would be those numbers but those are already properly handled.

spotandjake avatar Mar 31 '23 20:03 spotandjake

Looks like the tests are passing which is good. This should be all ready for review a note is I wrote a TODO for eventually switching to a streaming based float parsing implementation that works on the indivdual chars rather then storing it in a buffer, so I need to open an issue for that and reference the issue in the comment.

A question for @cician is under the toString functions you have a FIXME magic numbers but I do not know what magic number's you are talking about because from what i understand NaN and Infintiy would be those numbers but those are already properly handled.

By magic numbers i was just referring to the usage of numbers without proper named constants. For example 8 in WasmI32.load(ptr, 8n). The code would be more robust to changes and self documenting with a named constant. Something like WasmI32.load(ptr, _INT32_BOXED_VALUE_OFFSET).

cician avatar Apr 01 '23 15:04 cician

Looks like the tests are passing which is good. This should be all ready for review a note is I wrote a TODO for eventually switching to a streaming based float parsing implementation that works on the indivdual chars rather then storing it in a buffer, so I need to open an issue for that and reference the issue in the comment. A question for @cician is under the toString functions you have a FIXME magic numbers but I do not know what magic number's you are talking about because from what i understand NaN and Infintiy would be those numbers but those are already properly handled.

By magic numbers i was just referring to the usage of numbers without proper named constants. For example 8 in WasmI32.load(ptr, 8n). The code would be more robust to changes and self documenting with a named constant. Something like WasmI32.load(ptr, _INT32_BOXED_VALUE_OFFSET).

oh that makes a lot of sense thanks for the quick response on that.

spotandjake avatar Apr 01 '23 15:04 spotandjake

Those changes have been applied sorry about closing the pr accidently, I clicked the wrong button when trying to push my changes.

spotandjake avatar Apr 03 '23 14:04 spotandjake

I made those changes alex.

One thing is the memoryBase tests are now failing and i do not know why considering i didnt play with anything that i think touches that, heapStart is now off by a bit.

spotandjake avatar Apr 17 '23 01:04 spotandjake

I applied most of your suggestions oscar, I dont quite see a good alternative to the ptr thing I could make another function in the runtime named like getCodePointFromByteOffset to move it to the runtime but that doesnt seem like a good solution to me. and as per your ignore one, I just wanted clarification on where atm i put them at the end of all the return paths but i want to make sure that is correct.

spotandjake avatar Apr 26 '23 21:04 spotandjake

Which ptr thing?

ospencer avatar Apr 26 '23 21:04 ospencer

Which ptr thing?

nvm thought you meant line 1405, but you guys meant 1407. fixed that.

spotandjake avatar Apr 26 '23 21:04 spotandjake

Blocked by #1776

spotandjake avatar Aug 31 '23 16:08 spotandjake

This has been updated and the docs have been regened

spotandjake avatar Jan 26 '24 17:01 spotandjake

Made those changes

spotandjake avatar Jan 26 '24 21:01 spotandjake

Made most of those changes. Just waiting for a few of those comments to be addressed.

spotandjake avatar Jan 29 '24 20:01 spotandjake

I switched over to the varaint docs. Im not fully satisified with how the examples end up being formatted into the output though, but I think they add important context and shouldnt be removed. So I think we might need to workshop the layout there a bit.

spotandjake avatar Feb 01 '24 21:02 spotandjake

Following up here because it's closed and nested in a bunch of issues: https://github.com/grain-lang/grain/pull/1133#discussion_r1498568223

Should we open an issue, to refactor/rewrite this module? I think the current implementation is good enough to merge but I think it would be worth coming back and improving it later.

spotandjake avatar Feb 28 '24 00:02 spotandjake

Should we open an issue, to refactor/rewrite this module? I think the current implementation is good enough to merge but I think it would be worth coming back and improving it later.

No need to redo it now and no need to open an issue. Someone will need to work on it at some point and be confused and they'll just refactor it.

phated avatar Mar 01 '24 19:03 phated