forbear icon indicating copy to clipboard operation
forbear copied to clipboard

ucs4/ISO10646 characters

Open zbeekman opened this issue 7 years ago • 8 comments

Hi Stefano,

Very cool project!

I noticed you were interested in adding a spinner (in the README.md TODO list) and I was thinking it would be very nice to have ISO 10646 character support, then you can have fancy characters in your spinners and progress bars like this:

spinner-demo

It's a bit of a pain though, because you may need to detect if ISO 10646 support is available on the "system" (compiler + machine) and then create overloaded interface wrappers that will accept ASCII, and convert to ISO 10646 characters and pass them to the routines. (Or you could just force everyone to pass in ISO 10646 chars but that may not be practical.)

Here is some code to declare an ISO 10646 variable:

integer, parameter :: ucs4 = selected_char_kind('ISO_10646')
character(*, ucs4), intent(out) :: string

See also:

  • https://github.com/jacobwilliams/json-fortran/blob/master/src/json_macros.inc
  • https://github.com/jacobwilliams/json-fortran/blob/master/src/json_string_utilities.F90#L21-L59
  • https://github.com/jacobwilliams/json-fortran/blob/master/src/json_string_utilities.F90#L572-L714
  • https://github.com/jacobwilliams/json-fortran/blob/master/src/json_kinds.F90#L1-L36
  • https://github.com/jacobwilliams/json-fortran/blob/master/src/json_kinds.F90#L93-L132
  • https://github.com/jacobwilliams/json-fortran/blob/master/src/json_value_module.F90

zbeekman avatar Jun 07 '17 12:06 zbeekman

@zbeekman

Zaak, you have anticipated me, I'll ask to you and Jacob how to support non-ascii characters...

Wonderful references!

Cheers

P.S. today I cannot send you the log of the install script of OpenCoarrays: Friday I have a test for a stable position, cross the fingers!

szaghi avatar Jun 07 '17 12:06 szaghi

@zbeekman

Zaak, what do you think about a generic interface along the following lines?

module char_module
   implicit none
   private
   public :: ascii
   public :: ucs4
   public :: echo

   integer, parameter :: ascii = selected_char_kind('ascii')
#ifdef UCS4
   integer, parameter :: ucs4  = selected_char_kind('iso_10646')
#else
   integer, parameter :: ucs4  = selected_char_kind('ascii')
#endif

   interface echo
      module procedure echo_ascii
#ifdef UCS4
      module procedure echo_ucs4
#endif
   endinterface echo

   contains
      subroutine echo_ascii(string)
         character(len=*, kind=ascii), intent(in) :: string

         print '(A)', 'I am echo_ascii'
         print '(A)', string
      endsubroutine echo_ascii

      subroutine echo_ucs4(string)
         character(len=*, kind=ucs4), intent(in) :: string

         print '(A)', 'I am echo_ucs4'
         print '(A)', string
      endsubroutine echo_ucs4
endmodule char_module

program test
   use char_module
   implicit none
   character(len=3, kind=ascii) :: string_ascii
   character(len=3, kind=ucs4)  :: string_ucs4

   string_ascii = 'abc' ; call echo(string_ascii)
   string_ucs4  = 'ABC' ; call echo(string_ucs4 )
endprogram test

Upon execution:

stefano@thor(06:17 AM Thu Jun 08)
~ 21 files, 28Mb
→ gfortran -fcheck=all -W ucs4.F90 -std=f2008 -fall-intrinsics

stefano@thor(06:17 AM Thu Jun 08)
~ 21 files, 28Mb
→ a.out 
I am echo_ascii
abc
I am echo_ascii
ABC

stefano@thor(06:17 AM Thu Jun 08)
~ 21 files, 28Mb
→ gfortran -fcheck=all -W ucs4.F90 -std=f2008 -fall-intrinsics -DUCS4

stefano@thor(06:17 AM Thu Jun 08)
~ 21 files, 28Mb
→ a.out 
I am echo_ascii
abc
I am echo_ucs4
ABC

Do you think such an approach could be viable?

Cheers

szaghi avatar Jun 08 '17 04:06 szaghi

@zbeekman

Sorry... the following is more tailored to what I have in mind

module char_module
   implicit none
   private
   public :: ascii
   public :: ucs4
   public :: ck
   public :: convert

   integer, parameter :: ascii = selected_char_kind('ascii')
#ifdef UCS4
   integer, parameter :: ucs4  = selected_char_kind('iso_10646')
#else
   integer, parameter :: ucs4  = selected_char_kind('ascii')
#endif
   integer, parameter :: ck    = ucs4

   interface convert
      module procedure convert_from_ascii
#ifdef UCS4
      module procedure convert_from_ucs4
#endif
   endinterface convert

   contains
      function convert_from_ascii(string) result(conv)
         character(len=*, kind=ascii), intent(in) :: string
         character(len=len(string), kind=ck)      :: conv

         print '(A)', 'I am convert_from_ascii'
         conv = string
      endfunction convert_from_ascii

      function convert_from_ucs4(string) result(conv)
         character(len=*, kind=ucs4), intent(in) :: string
         character(len=len(string), kind=ck)     :: conv

         print '(A)', 'I am convert_from_ucs4'
         conv = string
      endfunction convert_from_ucs4
endmodule char_module

program test
   use char_module
   implicit none
   character(len=3, kind=ascii) :: string_ascii
   character(len=3, kind=ucs4)  :: string_ucs4

   string_ascii = 'abc' ; print '(A)', convert(string_ascii)
   string_ucs4  = 'ABC' ; print '(A)', convert(string_ucs4 )
endprogram test
stefano@thor(06:35 AM Thu Jun 08)
~ 21 files, 28Mb
→ gfortran -fcheck=all -W ucs4.F90 -std=f2008 -fall-intrinsics

stefano@thor(06:36 AM Thu Jun 08)
~ 21 files, 28Mb
→ a.out 
I am convert_from_ascii
abc
I am convert_from_ascii
ABC

stefano@thor(06:36 AM Thu Jun 08)
~ 21 files, 28Mb
→ gfortran -fcheck=all -W ucs4.F90 -std=f2008 -fall-intrinsics -DUCS4

stefano@thor(06:36 AM Thu Jun 08)
~ 21 files, 28Mb
→ a.out 
I am convert_from_ascii
abc
I am convert_from_ucs4
ABC

szaghi avatar Jun 08 '17 04:06 szaghi

@zbeekman

I realized that for the aim to make forbear ucs4-enabled I have only to catch the characters kind into the initiazialize method, thus a more specific approach could be:

module char_module
   implicit none
   private
   public :: ascii
   public :: ucs4
   public :: ck
   public :: initialize

   integer, parameter :: ascii = selected_char_kind('ascii')
#ifdef UCS4
   integer, parameter :: ucs4  = selected_char_kind('iso_10646')
#else
   integer, parameter :: ucs4  = selected_char_kind('ascii')
#endif
   integer, parameter :: ck    = ucs4

   contains
      subroutine initialize(input, output)
         class(*),                  intent(in)    :: input
         character(len=*, kind=ck), intent(inout) :: output

         select type(input)
         type is(character(len=*, kind=ascii))
            print '(A)', 'ascii input'
            output = input
#ifdef UCS4
         type is(character(len=*, kind=ucs4))
            print '(A)', 'ucs4 input'
            output = input
#endif
         class default
            error stop 'error: input must be of class character'
         endselect
      endsubroutine initialize
endmodule char_module

program test
   use char_module
   implicit none
   character(len=3, kind=ascii) :: string_ascii
   character(len=3, kind=ucs4)  :: string_ucs4
   character(len=3, kind=ck)    :: string_ck
   character(len=3)             :: string_nk
   character(len=3, kind=ck)    :: string

   string_ascii = 'abc'
   call initialize(input=string_ascii, output=string)
   print '(A)', string

   string_ucs4  = 'ABC'
   call initialize(input=string_ucs4, output=string)
   print '(A)', string

   string_ck  = 'aBc'
   call initialize(input=string_ck, output=string)
   print '(A)', string

   string_nk  = 'AbC'
   call initialize(input=string_nk, output=string)
   print '(A)', string

   call initialize(input=1, output=string)
endprogram test
gfortran -fcheck=all -W ucs4.F90 -std=f2008 -fall-intrinsics
ascii input
abc
ascii input
ABC
ascii input
aBc
ascii input
AbC
ERROR STOP error: input must be of class character

gfortran -fcheck=all -W ucs4.F90 -std=f2008 -fall-intrinsics -DUCS4
ascii input
abc
ucs4 input
ABC
ucs4 input
aBc
ascii input
AbC
ERROR STOP error: input must be of class character

Cheers

szaghi avatar Jun 08 '17 08:06 szaghi

Stefano, this looks perfect! The only reason we bothered with the complicated wrappers, etc. with JSON-Fortran was to try to eliminate redundant code as much as possible, because the library does some heavy text processing, parsing and manipulation.

The tricky part is how to handle user inputs. If at all possible, you should allow arbitrary user string inputs. If the only possible character user inputs are in an initialize method, then the way your last example is setup should work perfectly.

Also, FYI, I think conversion from ASCII to UCS4/ISO 10646 happens automatically on assignment. I'm not sure if this is part of the standard, or just common practice for compiler vendors. And, obviously, conversion from UCS4/ISO 10646 to ASCII is, in general, not safe since ASCII is a subset of UCS4/ISO 10646. One could create a routine to check if the UCS4 character exists in ASCII and then perform the conversion, throwing an error if the character in question was not in the ASCII set.

Here is a relevant excerpt from MRC:

selected_char_kind (name) returns the kind value for the character set whose name is given by the character string name, or −1 if it is not supported (or if the name is not recognized). In particular, if name is

  • DEFAULT, the result is the kind of the default character type (equal to kind(’A’));
  • ASCII, the result is the kind of the ASCII character type;
  • ISO_10646, the result is the kind of the ISO/IEC 10646 UCS-4 character type.
  • Other character set names are processor dependent. The character set name is not case sensitive (lower case is treated as upper case), and any trailing blanks are ignored.

Note that the only character set which is guaranteed to be supported is the default character set; a processor is not required to support ASCII or ISO 10646.

zbeekman avatar Jun 08 '17 13:06 zbeekman

Zaak, thank you very much for your insight, it is very appreciated.

The tricky part is how to handle user inputs. If at all possible, you should allow arbitrary user string inputs. If the only possible character user inputs are in an initialize method, then the way your last example is setup should work perfectly.

Exactly, this is why I end up with the last initialize toy example.

FYI, I am planning to add a better support for introspective tests about kindness compilers support into FoBiS, see this.

Edit: I just see that you see the FoBiS proposal...

szaghi avatar Jun 08 '17 13:06 szaghi

@zbeekman

Dear Zaak, I added support for UCS4 and now forbear provides 40 different spinners. Other will be very easy to add, feel free to suggest new ones.

A taste

taste

I have only one concern for now: I added spinners via a quick and dirty encoding on the sources, namely the forbear.F90 source contains unicode characters... and I think this is illegal, although GNU gfortran does not complain... what do you think?

Cheers

szaghi avatar Jun 12 '17 04:06 szaghi

I have only one concern for now: I added spinners via a quick and dirty encoding on the sources, namely the forbear.F90 source contains unicode characters... and I think this is illegal, although GNU gfortran does not complain... what do you think?

I'm guessing it works because your terminal is UTF-8... not 100% sure. I think GFortran has a flag to specify special characters via '\uxxxx' etc...

Yes, I just check man gfortran:

         -fbackslash
            Change the interpretation of backslashes in string literals from a single backslash character to "C-style" escape characters. The following combinations are expanded "\a", "\b", "\f", "\n", "\r", "\t",
*           "\v", "\\", and "\0" to the ASCII characters alert, backspace, form feed, newline, carriage return, horizontal tab, vertical tab, backslash, and NUL, respectively.  Additionally, "\x"nn, "\u"nnnn and
*           "\U"nnnnnnnn (where each n is a hexadecimal digit) are translated into the Unicode characters corresponding to the specified code points. All other combinations of a character preceded by \ are
            unexpanded.

This is probably a safer way to do this, but will be a pain to convert... I would assume Intel provides a similar flag, but can't confirm right now what it may be...

zbeekman avatar Jun 12 '17 14:06 zbeekman