spicy Internationalize the method which interprets the data as representing an ASCII-encoded number

https://www.fileformat.info/info/unicode/category/Nd/list.htm provides the data for internationalizing to_uint([ inout base: uint<64> ]) → uint<64> method which interprets the data as representing an ASCII-encoded number, for all of UTF-8 and not exclusive to ASCII decimal digits.

Apr 15 '20 16:04 duffy-ocraven

This is using strtoul internally. The man page says base can be 36 max, so I think I'll just add a check for that cap.

Apr 16 '20 07:04 rsmmr

Nevermind, we're doing this ourselves actually, but capping the base still seems like a reasonable approach to me.

Apr 16 '20 09:04 rsmmr

I think this deserves to stay open, even if does get deferred before getting it implemented. We can so easily support to_uint taking arguments with content from all Unicode supported languages, just by putting in the starting indices for each sequence of ten digits from the table at the provided URL.

Apr 21 '20 19:04 duffy-ocraven

It would be fine to enforce base <= 10 when the argument content has values coming from those international ranges, i.e. anything higher than U+006A 'z'.

Apr 21 '20 19:04 duffy-ocraven

Ok, reopening, but not yet convinced. :-)

Can you give me a code example how this would be used?

Apr 23 '20 11:04 rsmmr

build/bin/spicy-driver Issue_195_in_Nko.spicy < Issue_195_FakeForeign_input.txt
where the spicy is like:

module Issue_195_in_Nko;
import spicy;

const Token      = /[^ \t\r\n]+/;
const Integer    = /[0-9]+/;
const FakeForeignInteger = /[0-9a-j]+/;
# const ForeignInteger = /[0-9\u07C0-\u07C9]+/;
const WhiteSpace = /[ \t]+/;

public type Requests = unit {
  var runningTotal: uint64;
  on %init {
    self.runningTotal = 0;      
  }
    : (RequestLine(self))[];
  on %done {
    print "That summed to: %d" % (self.runningTotal,);
  }
};

type RequestLine = unit(parent: Requests) {
%byte-order = Spicy::ByteOrder::Little;
    lineNo:  Token;
    :        WhiteSpace;
    ItemDesc:  Token;
    :        WhiteSpace;
    lineTotal: FakeForeignInteger;
    : /[\r\n]+/ ;
  on lineTotal {
    parent.runningTotal += self.lineTotal.to_uint();
  }
};

outputs

That summed to: 25

with Issue_195_Western_input.txt of

1 pencils 12
2 erasers 13

and I propose that with Issue_195_FakeForeign_input.txt of

b  pencils   bc
c  erasers   bd

conversion of decimal digit strings to numbers is just as well-supported. By incorporating the actual 0-9 ranges from https://www.fileformat.info/info/unicode/category/Nd/list.htm, which provides that internationalization data, then spicy's to_int() and to_uint() aren't western-centric, but can handle digits expressed in any of the Unicode supported alphabets.

Apr 23 '20 23:04 duffy-ocraven

So this would work only for base <= 10, right? It would then look for right block of 10 code points inside that table, and subtract the block's starting code point from the code points it's processing?

So I see how this could be useful, but let's wait until we actually have a use case.

Apr 24 '20 11:04 rsmmr

Robin wrote: ... So this would work only for base <= 10, right? It would then look for right block of 10 code points inside that table, and subtract the block's starting code point from the code points it's processing? So I see how this could be useful, but let's wait until we actually have a use case.

Your description there is right on what I envision. I concur that doing this only for base <= 10 (plus handling a locale-appropriate negative indicator) is the vast majority of the value. And I am fine with waiting. I18n tickets are often put in that status for awhile.

Apr 24 '20 16:04 duffy-ocraven

spicy spicy copied to clipboard

Internationalize the method which interprets the data as representing an ASCII-encoded number

spicy
spicy copied to clipboard