wren
wren copied to clipboard
Add custom number parsing
Added a custom number parser so we don't have to rely on strtoll
or strtod
This PR will add support for parsing octal, binary, and hexadecimal number systems as well as provide support for separating digits with underscores for better organization. Because of this I will probably close #945 and #946.
I think parsing the numbers inside the VM is better than re-implementing strtod
and strtoll
because we already validate the number so why not parse it at the same time? There are also some slight complexities when using strtoll and strtod that Wren just doesn't use so this way is a tad bit simpler as well. Lastly the parser only parses the values to a double unlike strtoll that parses to a 64-bit integer then converts to a double which would allow for more (albeit slightly less accurate) digits in a hex, oct, and bin literal.
Now a couple of questions I have.
- I've noticed that the literal:
0x
is a valid in wren when compiling but is not valid when callingNum.fromString
. I decided to patch it so that the literal0x
is a compile time error to be more consistent. - When calling
Num.fromString
, an invalid parse returns null instead of throwing an error or printing what it has so far. I get the argument that it is simpler to handle null than to handle error but I also noticed that a literal that is too large throws an error (see this test). I decided to copy this behavior but in my opinion I think we should either return null on all invalid parses if we do it for simplicity or return errors explaining why the parse went wrong as to be helpful.
from a quick glance there seems to be some duplication for some of the logic, with this kind of nuance it would be ideal to share it. Have you tried to consolidate it to a shared function or maybe that wasn't feasible?
One thing I wasn't sure was whether ObjString*.value
is always null-terminated so there were plenty of added checks to verify that I'm not incrementing past the string provided (although I would actually have to test if the parser could have broken out of a non null terminated string though which I have not done so yet). If I can confirm that they are always null terminated, then the checks could be removed making the two functions more or less the same.
A quick look at the source seems to suggest it is...
https://github.com/wren-lang/wren/blob/main/src/vm/wren_value.c#L684
I haven't checked if every path goes via this one, but seems to be the case.
Edit: It should be noted though that ObjString* can contain null bytes within it, so it's context sensitive whether it matters or not that it's always null terminated. If it's dealing with user strings then it's not safe to assume that.
The other thing is where readNumber
takes a parser, it calls lexError
when it fails while Num.fromString
returns null when it fails. I could probably pass in a function or pass in a string pointer to get error messages if it is mandatory to merge them together.
readNumber
also doesn't check for a leading negative or prefix like Num.fromString
does as well but I guess this may be negligible.
If it's dealing with user strings then it's not safe to assume that.
I'll leave in the check for now at least until we can say for certain that it is null terminated. I guess I'm just a little too paranoid 😅
Unfortunately, parsing doubles is one of the hardest things in the universe 😄
Doing it correctly requires bigfloats, or at least bigints.
Although there are shortcuts, they never cover all of the cases.
As an example, the following number will throw an error (incorrectly) in your parser:
100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
I do support not using C's strtod()
, but don't roll your own.
We will be forced to do it to some extends. Considering all the features we want, it will be hard to find an implementation that has the correct licence, is correct and follow (or at minimum is adaptable to) all our requirements...
Really fast float parsing: https://github.com/fastfloat/fast_float But it's C++.
Well, if custom number parsing is so hard, why not just do #945 which everyone seems to think is a good idea and can be done with what we already have.
Personally, I'd be inclined to just put #946 on the back-burner for now. This feature was introduced recently into Go (along with binary and octal literals) but I've rarely used it. Most of the time where I'd want to use separators, it's for decimal numbers which are exact powers of ten so it's easier to just write them in the form 1eN
anyway.
As an example, the following number will throw an error (incorrectly) in your parser:
100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
hmm my build seems to parse that number successfully.
System.print(Num.fromString("100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"))
// 1e+269
System.print(100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000)
// 1e+269
results seem equivalent to browsers console. If I were to put 309 zeroes or more, then it will start spitting errors which is equivalent to how the compiler behaved when it was using strtod
.
I do have a new idea to make one function that both Num.fromString
and readNumber
can use instead of writing duplicate code but I am also wondering in which file something like that belongs to. could anyone guide me to which file is the best place to have a number parser function for both wren_compiler.c
and wren_core.c
?
wren_utils.c
or maybe a new one should be fine to start with
results seem equivalent to browsers console. If I were to put 309 zeroes or more, then it will start spitting errors which is equivalent to how the compiler behaved when it was using
strtod
.
Missed what you do with the exponent. But you still have rounding problems. Try with 6.631236871469758276785396630275967243399099947355303144249971758736286630139265439618068200788048744105960420552601852889715006376325666595539603330361800519107591783233358492337208057849499360899425128640718856616503093444922854759159988160304439909868291973931426625698663157749836252274523485312442358651207051292453083278116143932569727918709786004497872322193856150225415211997283078496319412124640111777216148110752815101775295719811974338451936095907419622417538473679495148632480391435931767981122396703443803335529756003353209830071832230689201383015598792184172909927924176339315507402234836120730914783168400715462440053817592702766213559042115986763819482654128770595766806872783349146967171293949598850675682115696218943412532098591327667236328125E-316
, for example.
For deeper dive into this, I suggest you to read How strtod() Works (and Sometimes Doesn't).
Missed what you do with the exponent. But you still have rounding problems. Try with
6.631236871469758276785396630275967243399099947355303144249971758736286630139265439618068200788048744105960420552601852889715006376325666595539603330361800519107591783233358492337208057849499360899425128640718856616503093444922854759159988160304439909868291973931426625698663157749836252274523485312442358651207051292453083278116143932569727918709786004497872322193856150225415211997283078496319412124640111777216148110752815101775295719811974338451936095907419622417538473679495148632480391435931767981122396703443803335529756003353209830071832230689201383015598792184172909927924176339315507402234836120730914783168400715462440053817592702766213559042115986763819482654128770595766806872783349146967171293949598850675682115696218943412532098591327667236328125E-316
, for example.
Again, this behaves just like how the old parsing method did using strtod
. strtod
sets errno
whenever the normalized result would be less than DBL_MIN
(assuming that the value was not meant to be zero). DBL_MIN
is defined as 2.2250738585072014e-308
even though doubles can technically be as small as 4.9406564584124654e-324
by impacting the mantissa. I chose to stick with this behavior because I wasn't sure if it was done this way for a specific reason but it isn't to hard to check if the results is zero while the value is not meant to be zero if we wish to have exponents less than -308
.
One potential issue I may be noticing is I don't actually check if the exponent will overflow as yet when adding or incrementing but I am planning to consolidate the two areas I parse numbers to one function so can add that in the next commit.
For deeper dive into this, I suggest you to read How strtod() Works (and Sometimes Doesn't).
For my previous PR I had to do some research as to how doubles work since I wasn't completely sure why my implementations were off. I did try creating an alternative to strtod
here which functions very close to strtod
but I had issues with the compiler complaining that const char
was used as a non constant. I don't think we use all of the features of strtod
though so I thought not trying to replicate strtod
would be simpler in the long run.
If you ignore overflows, try 6.631236871469758276785396630275967243399099947355303144249971758736286630139265439618068200788048744105960420552601852889715006376325666595539603330361800519107591783233358492337208057849499360899425128640718856616503093444922854759159988160304439909868291973931426625698663157749836252274523485312442358651207051292453083278116143932569727918709786004497872322193856150225415211997283078496319412124640111777216148110752815101775295719811974338451936095907419622417538473679495148632480391435931767981122396703443803335529756003353209830071832230689201383015598792184172909927924176339315507402234836120730914783168400715462440053817592702766213559042115986763819482654128770595766806872783349146967171293949598850675682115696218943412532098591327667236328125
.
If you ignore overflows, try
6.631236871469758276785396630275967243399099947355303144249971758736286630139265439618068200788048744105960420552601852889715006376325666595539603330361800519107591783233358492337208057849499360899425128640718856616503093444922854759159988160304439909868291973931426625698663157749836252274523485312442358651207051292453083278116143932569727918709786004497872322193856150225415211997283078496319412124640111777216148110752815101775295719811974338451936095907419622417538473679495148632480391435931767981122396703443803335529756003353209830071832230689201383015598792184172909927924176339315507402234836120730914783168400715462440053817592702766213559042115986763819482654128770595766806872783349146967171293949598850675682115696218943412532098591327667236328125
.
I'm not sure I follow what your trying to tell me.
This does print 6.6312368714698
on both this version of the parser and the original one. Plugging into the JS console, I managed to get some extra decimal places (6.631236871469758
) so I think this is an issue with converting number to string rather than parsing string to number. We aren't going to actually get all of the digits saved exactly as the mantissa can only store integers that are at most 53-bit in size. 10(2^(53-1))
. With this function log10(2^(53-1))
we can expect that at most 16 base 10 digits can be stored.
The goal of this PR is not to surpass the original method of parsing by accuracy but by convenience, although I do attempt to try to be as accurate as the original method by using long doubles where data could potentially be lost.
You need to print all of the digits (the problem is usually with the last binary one): try with printf()
's %a
specifier.
- The parser is now consolidated into one function
wrenParseNum
inwren_utils.c
. - Added comments to explain how the parsing works
- Added overflow detection (only tested if the exponent part of a scientific number will overflow. Checking if too many digits will overflow is somewhat impractical as it would require more than 2GBs of numbers)
Some changed behavior:
-
Num.fromString
No longer throws errors when the number would be too big or small. Instead it simply returnsNull
like it does for any other unsuccessful parse. - values are checked to see if they are infinity or if they are zero when the string provided is not zero. This is slightly different to how
strtod
setserrno
sincestrtod
checks if the value would be smaller than2.2250738585072014e-308
to determine it is too small whereas the value can be as small as4.9406564584124654e-324
before it is zero. This behavior seems more inline to how many other languages behave (although they also usually allow zero and infinity as values as well instead of throwing errors but this may be a discussion for a different issue). - Some error messages are changed to say that values are too big or to small instead of just saying it is too large. Also some values are slightly different in that they do not handle formatting like they originally did.