stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

UTF-8, UTF-16 and UTF-32 support

Open awvwgk opened this issue 3 years ago • 2 comments

String support in stdlib is currently limited to ASCII, @wclodius2 brought up the issue of supporting UTF-8, UTF-16 and UTF-32 as well:

FWIW for a "string type" to supplant the intrinsic character I would make the internal representation an integer array so that it is straight forward to extend it to represent UCS/Unicode. The integer type could be either INT8 if a UTF-8 representation is desired, INT16 for a UTF-16 representation, or INT32 for UTF-32. I would expect the UTF-32 representation would be the most straight-forward to implement and best for East Asian ideographs, UTF-8 would be the most efficient for most European and Semetic languages, UTF-16 the most efficient for most of the rest of the world.

Originally posted by @wclodius2 in https://github.com/fortran-lang/stdlib/issues/334#issuecomment-798813426

Implementing to_title will require more than ASCII. Allowing more than just ASCII will require access to the Unicode character database, https://unicode.org/ucd/. This database will also be required for to_upper, to_lower, and reverse if more than ASCII is involved. This database consists of several tens of megabytes of files, http://www.unicode.org/Public/UCD/latest/, and including it in the Standard Library will be controversial, but requiring users to download and install it on their own will also be controversial. FWIW I have a couple of modules to process the more important files in the database.

Originally posted by @wclodius2 in https://github.com/fortran-lang/stdlib/issues/335#issuecomment-798815164

awvwgk avatar Mar 14 '21 09:03 awvwgk


here's some context/rant, IDK, maybe it is worth your time:

I have a failed experiment where I tried to automatically fetch and build using a CI the unicode name list, the idea didn't work because it generates some ridiculously long names like ARABIC_LIGATURE_LAM_WITH_ALEF_WITH_HAMZA_BELOW_ISOLATED_FORM (even in CamelCase or compiler flags) and without them, the file compilation is killed or takes an eternity either way. So I stopped when I figured that I would need to create a hash_map/database structure :wink:


I wonder if there is some standard C source for such information.

Here are some pointers: https://www.unicode.org/reports/tr21/tr21-3.html https://www.unicode.org/reports/index.html

14NGiestas avatar May 05 '22 16:05 14NGiestas

FWIW my attempt at creating processing for Unicode used one-d arrays. It makes use of the old saw, "there is nothing that can's be fixed by an additional level of indirection". One was an array of a derived type that includes the defined non-Unihan data. This has I believe just over 100,000 elements for the latest data base. Another was an array of 17*2**16 int32 integers that for the corresponding Unicode code point either is an index to the corresponding element of the first array, or the value 0 if the code point does not have any non-Unihan data defined. The code point names were stored in a single long character string, and the derived type included indices to the starting and stopping character of its corresponding name. Most of the other attributes of the code points were represented by derived types whose only component was an eight bit integer, that served as an enumeration of the property values. This kept the storage required for the first array down to a few tens(?) of Megabytes.

wclodius2 avatar May 06 '22 00:05 wclodius2