CREXX icon indicating copy to clipboard operation
CREXX copied to clipboard

F0038 C2X testcase fails

Open rvjansen opened this issue 3 years ago • 1 comments

c2x in ff0038, lib/rxfns/tests_functional/ts_c2x:

failed in test 1: c2x(0123x) = 30313233
failed in test 3: c2x( 101x )=  313031
failed in test 4: c2x( 0123456789abcdefx ) =  30313233343536373839616263646566
failed in test 5: c2x( FFFFx ) =  66666666
failed in test 6: c2x( FFFFFFFFx ) =  6666666666666666

This is what ooRexx does (no problem here):

➜  tests_functional git:(feature/f0038) ✗ rexx ts_c2x.rexx
     4 *-* errors=0
       >L>   "0"
       >>>   "0"
       >=>   ERRORS <= "0"
     8 *-* if c2x('0123'x) \= '0123'
       >L>   "?#"
       >A>   "?#"
       >F>   C2X => "0123"
       >L>   "0123"
       >O>   "\=" => "0"
       >>>   "0"
    14 *-* if c2x( '' )\= ''
       >L>   ""
       >A>   ""
       >F>   C2X => ""
       >L>   ""
       >O>   "\=" => "0"
       >>>   "0"
    19 *-* if c2x( '101'x )\= '0101'
       >L>   "??"
       >A>   "??"
       >F>   C2X => "0101"
       >L>   "0101"
       >O>   "\=" => "0"
       >>>   "0"
    24 *-* if c2x( '0123456789abcdef'x )\= '0123456789ABCDEF'
       >L>   "?#Eg????"
       >A>   "?#Eg????"
       >F>   C2X => "0123456789ABCDEF"
       >L>   "0123456789ABCDEF"
       >O>   "\=" => "0"
       >>>   "0"
    29 *-* if c2x( 'ffff'x )\= 'FFFF'
       >L>   "��"
       >A>   "��"
       >F>   C2X => "FFFF"
       >L>   "FFFF"
       >O>   "\=" => "0"
       >>>   "0"
    34 *-* if c2x( 'ffffffff'x )\= 'FFFFFFFF'
       >L>   "����"
       >A>   "����"
       >F>   C2X => "FFFFFFFF"
       >L>   "FFFFFFFF"
       >O>   "\=" => "0"
       >>>   "0"
    39 *-* return errors<>0
       >V>   ERRORS => "0"
       >L>   "0"
       >O>   "<>" => "0"
       >>>   "0"

rvjansen avatar Feb 10 '22 21:02 rvjansen

I think the problem arises from the incorrect translation of an 'nnnn'x hex value, this is part of the asm (ts_c2x)

* Line 7: {IF} c2x('0123'x) \= '0123'
   * Line 7: c2x('0123'x)
   load r2,1
   load r3,"0123"
   call r4,c2x(),r2
   sne r4,r4,"0123"
   brf l28iffalse,r4
   * Line 8: {THEN}
   * Line 8: errors=errors+1 

'0123'x is translated to load r3, "0123" instead of the appropriate hex value.

Do we already support 'nnnn'x notation?

Peter-Jacob avatar Mar 03 '22 10:03 Peter-Jacob

I am leaning to the opinion that we maybe should change c2x into accepting only one character as input, like NetRexx does. The whole function in classic Rexx is a can of worms with Unicode.

rvjansen avatar Jun 30 '25 20:06 rvjansen

Look, I’m retired from thinking — now I just code. No architecture, no strategy. Just keyboards and caffeine.

Tell me what you need, I’ll get it done. If it’s nonsense, I’ll let you know — gently, or with a loud laugh, depending on how bad it is.

Peter-Jacob avatar Jun 30 '25 21:06 Peter-Jacob

The conversion functions seem to have a general underlying issue: they operate at the byte level. However, CREXX strings are UTF-8 encoded, meaning a single character may consist of one or more bytes.

Take the REXX instruction: special_char = substr('René', 4, 1)

Here, we've extracted what appears to be one character (é), but in UTF-8, that character typically spans 2 bytes. This raises the question: what should we expect from the expression C2X(special_char)? Should it return the full hexadecimal representation of the character (i.e., both bytes), or just the first byte?

Fortunately, C2X doesn’t take a length parameter, which avoids additional complexity. Still, this example illustrates a broader concern: many conversion functions might not handle multi-byte characters correctly or consistently in a UTF-8 context.

These ambiguities are likely to surface repeatedly on many conversion functions, and I’m currently uncertain how to address them. It may require either refining the behaviour of these functions or introducing clearer documentation and conventions around UTF-8 handling.

Note: I have modified the HEXCHAR instruction to also retrieve the byte length of the UTF-8 character being processed. While this length is not currently used, it provides the necessary flexibility to make decisions about character representation—particularly for multi-byte UTF-8 characters.

string_set_byte_pos(op2R, op3R->int_value);
const char *end = utf8codepoint(op2R->string_value + op2R->string_pos, &ch);
 int bytelen = end - op2R->string_value + op2R->string_pos;

Peter-Jacob avatar Jul 01 '25 06:07 Peter-Jacob

well, what would be really useful is to have look at Josep Maria's unicode additions to ooRexx. We have had endless (literally: circular) discussions in the ARB about unicode. My point of view: Classic Rexx (level C) needs C2X etc with string input (instead of only one character per call) and Classic Rexx programmers expect the unexpected results due to bytes, ASCII and EBCDIC.

for level B I would like to do what NetRexx did, which is limit the input to C2X to one character, and to add the gist of what Josep Maria has done with C2U; same goews for streams. Let's leave the current C2X as it is because the errors seem corner cases.

rvjansen avatar Jul 01 '25 09:07 rvjansen

If you point me to Josep's additions, I can have a look!

Peter-Jacob avatar Jul 01 '25 09:07 Peter-Jacob

Attached are the baseline unicode approach strategy for crexx ... after all the ARB discussions ...

On Tue, 1 Jul 2025 at 10:55, Peter-Jacob @.***> wrote:

Peter-Jacob left a comment (adesutherland/CREXX#252) https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023096643

If you point me to Josep's additions, I can have a look!

— Reply to this email directly, view it on GitHub https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023096643, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBKI6VTVENF2YB6QTMYYUT3GJLIDAVCNFSM6AAAAACAPHAF46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMRTGA4TMNRUGM . You are receiving this because you were assigned.Message ID: @.***>

adesutherland avatar Jul 01 '25 10:07 adesutherland

Adrian, do you have a link to that in .md?Peter, https://www.rexxla.org/presentations/2024/2024-03-04-The-Unicode-Tools-Of-Rexx.pdfRené.On 1 Jul 2025, at 12:49, Adrian Sutherland @.***> wrote:adesutherland left a comment (adesutherland/CREXX#252) Attached are the baseline unicode approach strategy for crexx ... after all

the ARB discussions ...

On Tue, 1 Jul 2025 at 10:55, Peter-Jacob @.***> wrote:

Peter-Jacob left a comment (adesutherland/CREXX#252)

https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023096643

If you point me to Josep's additions, I can have a look!

Reply to this email directly, view it on GitHub

https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023096643,

or unsubscribe

https://github.com/notifications/unsubscribe-auth/ABBKI6VTVENF2YB6QTMYYUT3GJLIDAVCNFSM6AAAAACAPHAF46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMRTGA4TMNRUGM

.

You are receiving this because you were assigned.Message ID:

@.***>

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you modified the open/close state.Message ID: @.***>

rvjansen avatar Jul 01 '25 11:07 rvjansen

but I read:

cRexx will not implement the c2u() function, as its functionality would be equivalent to cRexx's specific implementation of the c2x() function.

so we need to move the current C2X to the rxfnsc library and work on the crexx specific version for B

I am certain there are some other things to consider.

best regards,

René.

rvjansen avatar Jul 01 '25 11:07 rvjansen

I have a spreadsheet

On Tue, 1 Jul 2025 at 12:07, René Vincent Jansen @.***> wrote:

rvjansen left a comment (adesutherland/CREXX#252) https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023441865 Adrian, do you have a link to that in .md?Peter, https://www.rexxla.org/presentations/2024/2024-03-04-The-Unicode-Tools-Of-Rexx.pdfRené.On 1 Jul 2025, at 12:49, Adrian Sutherland @.***> wrote:adesutherland left a comment (adesutherland/CREXX#252) Attached are the baseline unicode approach strategy for crexx ... after all

the ARB discussions ...

On Tue, 1 Jul 2025 at 10:55, Peter-Jacob @.***> wrote:

Peter-Jacob left a comment (adesutherland/CREXX#252)

< https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023096643>

If you point me to Josep's additions, I can have a look!

Reply to this email directly, view it on GitHub

< https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023096643>,

or unsubscribe

< https://github.com/notifications/unsubscribe-auth/ABBKI6VTVENF2YB6QTMYYUT3GJLIDAVCNFSM6AAAAACAPHAF46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMRTGA4TMNRUGM>

.

You are receiving this because you were assigned.Message ID:

@.***>

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you modified the open/close state.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023441865, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBKI6WNAC4NX4TC4PA2YN33GJTVJAVCNFSM6AAAAACAPHAF46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMRTGQ2DCOBWGU . You are receiving this because you were assigned.Message ID: @.***>

adesutherland avatar Jul 01 '25 11:07 adesutherland

thats also fine

On 1 Jul 2025, at 13:54, Adrian Sutherland @.***> wrote:

adesutherland left a comment (adesutherland/CREXX#252) https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023662543I have a spreadsheet

On Tue, 1 Jul 2025 at 12:07, René Vincent Jansen @.***> wrote:

rvjansen left a comment (adesutherland/CREXX#252) https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023441865 Adrian, do you have a link to that in .md?Peter, https://www.rexxla.org/presentations/2024/2024-03-04-The-Unicode-Tools-Of-Rexx.pdfRené.On 1 Jul 2025, at 12:49, Adrian Sutherland @.***> wrote:adesutherland left a comment (adesutherland/CREXX#252) Attached are the baseline unicode approach strategy for crexx ... after all

the ARB discussions ...

On Tue, 1 Jul 2025 at 10:55, Peter-Jacob @.***> wrote:

Peter-Jacob left a comment (adesutherland/CREXX#252)

< https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023096643>

If you point me to Josep's additions, I can have a look!

Reply to this email directly, view it on GitHub

< https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023096643>,

or unsubscribe

< https://github.com/notifications/unsubscribe-auth/ABBKI6VTVENF2YB6QTMYYUT3GJLIDAVCNFSM6AAAAACAPHAF46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMRTGA4TMNRUGM>

.

You are receiving this because you were assigned.Message ID:

@.***>

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you modified the open/close state.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023441865, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBKI6WNAC4NX4TC4PA2YN33GJTVJAVCNFSM6AAAAACAPHAF46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMRTGQ2DCOBWGU . You are receiving this because you were assigned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/adesutherland/CREXX/issues/252#issuecomment-3023662543, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3WJQKONSGUFGIIILQN6VT3GJZH3AVCNFSM6AAAAACAPHAF46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMRTGY3DENJUGM. You are receiving this because you modified the open/close state.

rvjansen avatar Jul 01 '25 12:07 rvjansen

This is an experiment on how we could handle C2X more generically by respecting the actual byte length of UTF-8 characters:

138: René:   52656ec3a9
138: Rene:   52656e65
138: Adrian: 41647269616e
138: Peter:  5065746572

As seen above, multi-byte characters like é are properly represented in their UTF-8 hex form (c3a9), resulting in an accurate byte-level view of the string.

Limitation: This approach is currently not reversible via X2C, since the output lacks information on character boundaries — meaning X2C(C2X("René")) does not reliably reconstruct the original string.

Proposal: When in UTF-8 mode, we could represent each character using a fixed-width 4-byte layout — the maximum number of bytes required to encode any valid UTF-8 character.

This ensures consistency in hex output (e.g., for C2X) and simplifies processing, at the cost of some padding for characters that use fewer bytes.

Benefit: In this setup, we remain backward compatible with the X2C function, since each character occupies exactly 4 bytes in the hex stream. This makes it possible to reliably reconstruct the original string from the hex representation, including for multi-byte characters.

This proposal is relatively simple to implement and provides a clear path to consistent and reversible character conversion in UTF-8 mode. Please let me know what you think and how we should proceed.

Peter-Jacob avatar Jul 01 '25 17:07 Peter-Jacob