spicy
spicy copied to clipboard
Internationalize the method which interprets the data as representing an ASCII-encoded number
https://www.fileformat.info/info/unicode/category/Nd/list.htm provides the data for internationalizing to_uint([ inout base: uint<64> ]) → uint<64>
method which interprets the data as representing an ASCII-encoded number, for all of UTF-8 and not exclusive to ASCII decimal digits.
This is using strtoul
internally. The man page says base
can be 36 max, so I think I'll just add a check for that cap.
Nevermind, we're doing this ourselves actually, but capping the base
still seems like a reasonable approach to me.
I think this deserves to stay open, even if does get deferred before getting it implemented. We can so easily support to_uint taking arguments with content from all Unicode supported languages, just by putting in the starting indices for each sequence of ten digits from the table at the provided URL.
It would be fine to enforce base <= 10 when the argument content has values coming from those international ranges, i.e. anything higher than U+006A 'z'.
Ok, reopening, but not yet convinced. :-)
Can you give me a code example how this would be used?
build/bin/spicy-driver Issue_195_in_Nko.spicy < Issue_195_FakeForeign_input.txt
where the spicy is like:
module Issue_195_in_Nko;
import spicy;
const Token = /[^ \t\r\n]+/;
const Integer = /[0-9]+/;
const FakeForeignInteger = /[0-9a-j]+/;
# const ForeignInteger = /[0-9\u07C0-\u07C9]+/;
const WhiteSpace = /[ \t]+/;
public type Requests = unit {
var runningTotal: uint64;
on %init {
self.runningTotal = 0;
}
: (RequestLine(self))[];
on %done {
print "That summed to: %d" % (self.runningTotal,);
}
};
type RequestLine = unit(parent: Requests) {
%byte-order = Spicy::ByteOrder::Little;
lineNo: Token;
: WhiteSpace;
ItemDesc: Token;
: WhiteSpace;
lineTotal: FakeForeignInteger;
: /[\r\n]+/ ;
on lineTotal {
parent.runningTotal += self.lineTotal.to_uint();
}
};
outputs
That summed to: 25
with Issue_195_Western_input.txt of
1 pencils 12
2 erasers 13
and I propose that with Issue_195_FakeForeign_input.txt of
b pencils bc
c erasers bd
conversion of decimal digit strings to numbers is just as well-supported. By incorporating the actual 0-9 ranges from https://www.fileformat.info/info/unicode/category/Nd/list.htm, which provides that internationalization data, then spicy's to_int() and to_uint() aren't western-centric, but can handle digits expressed in any of the Unicode supported alphabets.
So this would work only for base <= 10, right? It would then look for right block of 10 code points inside that table, and subtract the block's starting code point from the code points it's processing?
So I see how this could be useful, but let's wait until we actually have a use case.
Robin wrote: ... So this would work only for base <= 10, right? It would then look for right block of 10 code points inside that table, and subtract the block's starting code point from the code points it's processing? So I see how this could be useful, but let's wait until we actually have a use case.
Your description there is right on what I envision. I concur that doing this only for base <= 10 (plus handling a locale-appropriate negative indicator) is the vast majority of the value. And I am fine with waiting. I18n tickets are often put in that status for awhile.