spicy icon indicating copy to clipboard operation
spicy copied to clipboard

Internationalize the method which interprets the data as representing an ASCII-encoded number

Open duffy-ocraven opened this issue 4 years ago • 8 comments

https://www.fileformat.info/info/unicode/category/Nd/list.htm provides the data for internationalizing to_uint([ inout base: uint<64> ]) → uint<64> method which interprets the data as representing an ASCII-encoded number, for all of UTF-8 and not exclusive to ASCII decimal digits.

duffy-ocraven avatar Apr 15 '20 16:04 duffy-ocraven

This is using strtoul internally. The man page says base can be 36 max, so I think I'll just add a check for that cap.

rsmmr avatar Apr 16 '20 07:04 rsmmr

Nevermind, we're doing this ourselves actually, but capping the base still seems like a reasonable approach to me.

rsmmr avatar Apr 16 '20 09:04 rsmmr

I think this deserves to stay open, even if does get deferred before getting it implemented. We can so easily support to_uint taking arguments with content from all Unicode supported languages, just by putting in the starting indices for each sequence of ten digits from the table at the provided URL.

duffy-ocraven avatar Apr 21 '20 19:04 duffy-ocraven

It would be fine to enforce base <= 10 when the argument content has values coming from those international ranges, i.e. anything higher than U+006A 'z'.

duffy-ocraven avatar Apr 21 '20 19:04 duffy-ocraven

Ok, reopening, but not yet convinced. :-)

Can you give me a code example how this would be used?

rsmmr avatar Apr 23 '20 11:04 rsmmr

build/bin/spicy-driver Issue_195_in_Nko.spicy < Issue_195_FakeForeign_input.txt
where the spicy is like:

module Issue_195_in_Nko;
import spicy;

const Token      = /[^ \t\r\n]+/;
const Integer    = /[0-9]+/;
const FakeForeignInteger = /[0-9a-j]+/;
# const ForeignInteger = /[0-9\u07C0-\u07C9]+/;
const WhiteSpace = /[ \t]+/;

public type Requests = unit {
  var runningTotal: uint64;
  on %init {
    self.runningTotal = 0;      
  }
    : (RequestLine(self))[];
  on %done {
    print "That summed to: %d" % (self.runningTotal,);
  }
};

type RequestLine = unit(parent: Requests) {
%byte-order = Spicy::ByteOrder::Little;
    lineNo:  Token;
    :        WhiteSpace;
    ItemDesc:  Token;
    :        WhiteSpace;
    lineTotal: FakeForeignInteger;
    : /[\r\n]+/ ;
  on lineTotal {
    parent.runningTotal += self.lineTotal.to_uint();
  }
};

outputs

That summed to: 25

with Issue_195_Western_input.txt of

1 pencils 12
2 erasers 13

and I propose that with Issue_195_FakeForeign_input.txt of

b  pencils   bc
c  erasers   bd

conversion of decimal digit strings to numbers is just as well-supported. By incorporating the actual 0-9 ranges from https://www.fileformat.info/info/unicode/category/Nd/list.htm, which provides that internationalization data, then spicy's to_int() and to_uint() aren't western-centric, but can handle digits expressed in any of the Unicode supported alphabets.

duffy-ocraven avatar Apr 23 '20 23:04 duffy-ocraven

So this would work only for base <= 10, right? It would then look for right block of 10 code points inside that table, and subtract the block's starting code point from the code points it's processing?

So I see how this could be useful, but let's wait until we actually have a use case.

rsmmr avatar Apr 24 '20 11:04 rsmmr

Robin wrote: ... So this would work only for base <= 10, right? It would then look for right block of 10 code points inside that table, and subtract the block's starting code point from the code points it's processing? So I see how this could be useful, but let's wait until we actually have a use case.

Your description there is right on what I envision. I concur that doing this only for base <= 10 (plus handling a locale-appropriate negative indicator) is the vast majority of the value. And I am fine with waiting. I18n tickets are often put in that status for awhile.

duffy-ocraven avatar Apr 24 '20 16:04 duffy-ocraven