perl5 Add new API function utf8_to

This is designed to replace the problematic utf8_to_uvchr(), which is problematic. Its behavior varies depending on if warnings are enabled or not, and no code in core actually takes that into account

If warnings are enabled:

A zero return can mean both success or failure

    Hence a zero return must be disambiguated.  Success would come
    from the next character being a NUL.

If failure, <retlen> will be -1, so can't be used to find where to
start parsing again.

If disabled:

Both the return and <retlen> will be usable values, but the return
of the REPLACEMENT CHARACTER is ambiguous.  It could mean failure,
or it could mean that that was the next character in the input and
was successfully decoded.

utf8_to_uv() solves these. This commit includes a few changes to use it, to show it works. I have WIP that changes the rest of core to use it. I found that it makes coding simpler.

The new function returns true upon success; false on failure. And it is passed pointers to return the computed code point and byte length into. These values always contain the correct information, regardless of if the input is malformed or not.

It is easy to test for failure in a conditional and then to take appropriate action. However, most often it seems the appropriate action is to use, going forward, the REPLACEMENT CHARACTER returned in failure cases.

And if you don't care particularly if it succeeds or not, you just use it without testing the result. This happens when you are confident that the input is well-formed, or say in converting a string for display.

There is another function utf8_to_uv_flags() which merely extends this API for more flexible use, and doesn't offer the advantages over the existing API function that does the same thing. I included it because the main function is just a small wrapper around it, and the API is similar and some may prefer it.

Aug 25 '24 22:08 khwilliamson

After this is in, is it possible for Devel::PPPort to generate a warning when utf8_to_uvchr is used, with a reference to its replacement?

(I'm not suggesting that you do this Tony, just wondering if we can and should)

Aug 29 '24 01:08 karenetheridge

It is easy in ppport.h to either output a hint or a warning about using any particular function. It's just text. Here's an example:

*** WARNING: my_sprintf *** It's safer to use my_snprintf instead

Aug 29 '24 02:08 khwilliamson

The name I used for this function was the first name used for this functionality, being removed in 5.7. A succession of names has followed, as the previous incarnation was found to be deficient for one reason or another. uvchr apparently came about as a result of supporting non-ASCII systems, and I presumed was chosen to somehow note the fluidity of the underlying character set. I thought that it had been long enough since utf8_to_uv' had been in use that it could safely be reused again. But I came up with an alternative name in the meantime utf8_to_cp (We would want to make an inversecp_to_utf8. I think cp` is in common enough usage these days that the name would be self-explanatory to modern programmers.

But I don't know. I'm opening this up for discussion

Sep 02 '24 04:09 khwilliamson

If it returns a native code point rather than a Unicode code point I'd be inclined to avoid "cp".

We've generally used "uvchr" for this but utf8_to_uvchr() is what we're trying to replace.

Maybe utf8_decode_uvchr(), but that has it's own potential for confusion.

Maybe utf8_to_uvchr4() to emphasize that this is an alternative to utf8_to_uvchr().

Sep 04 '24 02:09 tonycoz

I'm then inclined to go with utf8_to_uv

Sep 04 '24 19:09 khwilliamson

I changed the name of perl_utf8_to_uv to extended_utf8_to_uv because of likely confusion

Oct 22 '24 14:10 khwilliamson

Except for the documentation nits which aren't hard rejections, I'm otherwise happy with this.

Oct 29 '24 22:10 tonycoz

In testing this, I found a few bugs, which I have force-pushed corrections for. The compare button above hides most irrelevant code changes. Most of the changes were to the pod, either better wording or I found I didn't fully understand how things actually worked.

Concerning the pod concerns. I realized that valid_to_uvchr_buf is an internal function, so its pod needs to stay in perlintern and not be combined with the others. I still believe the pod for the functions these new ones are preferred over should remain combined with these, bug be afterwards. perlapi is not just a reference document, but a teaching one; and I want to persuade people that it is worthwhile to make the effort to convert from the old-style to the new. It's like Raku documentation assuming that the reader is converting from Perl, so goes out of its way to highlight the differences. And, if you aren't interested in seeing the pod for the old, it comes at the end of the group, but before the signatures for all of the items in it. Just stop reading at that point; or if you are interested in the signatures, their appearance is quite distinct from the remainder of the pod, so it is easy to just glance over it to see them.

I'm still working on the tests, which have shown other, pretty obscure bugs in the base code of this function. More pull requests to come on that.

Nov 25 '24 20:11 khwilliamson

perl5
perl5 copied to clipboard

Add new API function utf8_to_uv()

perl5 perl5 copied to clipboard

Add new API function utf8_to_uv()

perl5
perl5 copied to clipboard