perl5 icon indicating copy to clipboard operation
perl5 copied to clipboard

Combine all forms of is_utf8_string()

Open khwilliamson opened this issue 1 year ago • 1 comments

khwilliamson avatar Jun 21 '24 18:06 khwilliamson

To facilitate review, I'm posting the part of pod/perlapi.pod pertaining to is_utf8_string and variants when built with this branch:

    "is_utf8_string"
    "is_utf8_string_loc"
    "is_utf8_string_loclen"
    "is_strict_utf8_string"
    "is_strict_utf8_string_loc"
    "is_strict_utf8_string_loclen"
    "is_c9strict_utf8_string"
    "is_c9strict_utf8_string_loc"
    "is_c9strict_utf8_string_loclen"
    "is_utf8_string_flags"
    "is_utf8_string_loc_flags"
    "is_utf8_string_loclen_flags"
        These each return TRUE if the first "len" bytes of string "s" form a
        valid UTF-8 string for varying degrees of strictness, FALSE
        otherwise. If "len" is 0, it will be calculated using strlen(s)
        (which means if you use this option, that "s" can't have embedded
        "NUL" characters and has to have a terminating "NUL" byte). Note
        that all characters being ASCII constitute 'a valid UTF-8 string'.

        Some of the functions also return information about the string.
        Those that have the suffix "_loc" in their names have an extra
        parameter, "ep". If that is not NULL, the function stores into it
        the location of how far it got in parsing "s". If the function is
        returning TRUE, this will be a pointer to the byte immediately after
        the end of "s". If FALSE, it will be the location of the first byte
        that fails the criteria.

        The functions that instead have the suffix "_loclen" have a second
        extra parameter, "el". They act as the plain "_loc" functions do
        with their "ep" parameter, but if "el" is not null, the functions
        store into it the number of UTF-8 encoded characters found at the
        point where parsing stopped. If the function is returning TRUE, this
        will be the full count of the UTF-8 characters in "s"; if FALSE, it
        will be the count before the first invalid one.

        "is_utf8_string" (and "is_utf8_string_loc" and
        "is_utf8_string_loclen") consider Perl's extended UTF-8 to be valid.
        That means that code points above Unicode, surrogates, and
        non-character code points are all considered valid by this function.
        Problems may arise in interchange with non-Perl applications, or
        (unlikely) between machines with different word sizes.

        "is_strict_utf8_string" (and "is_utf8_string_loc" and
        "is_utf8_string_loclen") consider only Unicode-range (0 to 0x10FFFF)
        code points to be valid, with the surrogates and non-character code
        points invalid. This level of strictness is what is safe to accept
        from outside sources that use Unicode rules.

        The forms whose names contain "c9strict" conform to the level of
        strictness given in Unicode Corrigendum #9
        <http://www.unicode.org/versions/corrigendum9.html>. This means
        Unicode-range code points including non-character ones are
        considered valid, but not the surrogates. This level of strictness
        is considered safe for cooperating components that know how the
        other components handle non-character code points.

        The forms whose names contain "_flags" allow you to customize the
        acceptable level of strictness. They have an extra parameter,
        "flags" to indicate the types of code points that are acceptable. If
        "flags" is 0, they give the same results as "is_utf8_string" (and
        kin); if "flags" is "UTF8_DISALLOW_ILLEGAL_INTERCHANGE", they give
        the same results as "is_strict_utf8_string" (and kin); and if
        "flags" is "UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE", they give the
        same results as "is_c9strict_utf8_string" (and kin). Otherwise
        "flags" may be any combination of the "UTF8_DISALLOW_*foo*" flags
        understood by "utf8n_to_uvchr", with the same meanings.

        It's better to use one of the non-"_flags" functions if they give
        you the desired strictness, as those have a better chance of being
        inlined by the C compiler.

        See also "is_utf8_invariant_string",
        "is_utf8_fixed_width_buf_flags",

            bool  is_utf8_string                (const U8 *s, STRLEN len)
            bool  is_utf8_string_loc            (const U8 *s,
                                                 const STRLEN len,
                                                 const U8 **ep)
            bool  is_utf8_string_loclen         (const U8 *s, STRLEN len,
                                                 const U8 **ep, STRLEN *el)
            bool  is_strict_utf8_string         (const U8 *s, STRLEN len)
            bool  is_strict_utf8_string_loc     (const U8 *s, STRLEN len,
                                                 const U8 **ep)
            bool  is_strict_utf8_string_loclen  (const U8 *s, STRLEN len,
                                                 const U8 **ep, STRLEN *el)
            bool  is_c9strict_utf8_string       (const U8 *s, STRLEN len)
            bool  is_c9strict_utf8_string_loc   (const U8 *s, STRLEN len,
                                                 const U8 **ep)
            bool  is_c9strict_utf8_string_loclen(const U8 *s, STRLEN len,
                                                 const U8 **ep, STRLEN *el)
            bool  is_utf8_string_flags          (const U8 *s, STRLEN len,
                                                 const U32 flags)
            bool  is_utf8_string_loc_flags      (const U8 *s, STRLEN len,
                                                 const U8 **ep,
                                                 const U32 flags)
            bool  is_utf8_string_loclen_flags   (const U8 *s, STRLEN len,
                                                 const U8 **ep, STRLEN *el,
                                                 const U32 flags)

jkeenan avatar Jun 23 '24 20:06 jkeenan