perl5
perl5 copied to clipboard
Combine all forms of is_utf8_string()
To facilitate review, I'm posting the part of pod/perlapi.pod pertaining to is_utf8_string and variants when built with this branch:
"is_utf8_string"
"is_utf8_string_loc"
"is_utf8_string_loclen"
"is_strict_utf8_string"
"is_strict_utf8_string_loc"
"is_strict_utf8_string_loclen"
"is_c9strict_utf8_string"
"is_c9strict_utf8_string_loc"
"is_c9strict_utf8_string_loclen"
"is_utf8_string_flags"
"is_utf8_string_loc_flags"
"is_utf8_string_loclen_flags"
These each return TRUE if the first "len" bytes of string "s" form a
valid UTF-8 string for varying degrees of strictness, FALSE
otherwise. If "len" is 0, it will be calculated using strlen(s)
(which means if you use this option, that "s" can't have embedded
"NUL" characters and has to have a terminating "NUL" byte). Note
that all characters being ASCII constitute 'a valid UTF-8 string'.
Some of the functions also return information about the string.
Those that have the suffix "_loc" in their names have an extra
parameter, "ep". If that is not NULL, the function stores into it
the location of how far it got in parsing "s". If the function is
returning TRUE, this will be a pointer to the byte immediately after
the end of "s". If FALSE, it will be the location of the first byte
that fails the criteria.
The functions that instead have the suffix "_loclen" have a second
extra parameter, "el". They act as the plain "_loc" functions do
with their "ep" parameter, but if "el" is not null, the functions
store into it the number of UTF-8 encoded characters found at the
point where parsing stopped. If the function is returning TRUE, this
will be the full count of the UTF-8 characters in "s"; if FALSE, it
will be the count before the first invalid one.
"is_utf8_string" (and "is_utf8_string_loc" and
"is_utf8_string_loclen") consider Perl's extended UTF-8 to be valid.
That means that code points above Unicode, surrogates, and
non-character code points are all considered valid by this function.
Problems may arise in interchange with non-Perl applications, or
(unlikely) between machines with different word sizes.
"is_strict_utf8_string" (and "is_utf8_string_loc" and
"is_utf8_string_loclen") consider only Unicode-range (0 to 0x10FFFF)
code points to be valid, with the surrogates and non-character code
points invalid. This level of strictness is what is safe to accept
from outside sources that use Unicode rules.
The forms whose names contain "c9strict" conform to the level of
strictness given in Unicode Corrigendum #9
<http://www.unicode.org/versions/corrigendum9.html>. This means
Unicode-range code points including non-character ones are
considered valid, but not the surrogates. This level of strictness
is considered safe for cooperating components that know how the
other components handle non-character code points.
The forms whose names contain "_flags" allow you to customize the
acceptable level of strictness. They have an extra parameter,
"flags" to indicate the types of code points that are acceptable. If
"flags" is 0, they give the same results as "is_utf8_string" (and
kin); if "flags" is "UTF8_DISALLOW_ILLEGAL_INTERCHANGE", they give
the same results as "is_strict_utf8_string" (and kin); and if
"flags" is "UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE", they give the
same results as "is_c9strict_utf8_string" (and kin). Otherwise
"flags" may be any combination of the "UTF8_DISALLOW_*foo*" flags
understood by "utf8n_to_uvchr", with the same meanings.
It's better to use one of the non-"_flags" functions if they give
you the desired strictness, as those have a better chance of being
inlined by the C compiler.
See also "is_utf8_invariant_string",
"is_utf8_fixed_width_buf_flags",
bool is_utf8_string (const U8 *s, STRLEN len)
bool is_utf8_string_loc (const U8 *s,
const STRLEN len,
const U8 **ep)
bool is_utf8_string_loclen (const U8 *s, STRLEN len,
const U8 **ep, STRLEN *el)
bool is_strict_utf8_string (const U8 *s, STRLEN len)
bool is_strict_utf8_string_loc (const U8 *s, STRLEN len,
const U8 **ep)
bool is_strict_utf8_string_loclen (const U8 *s, STRLEN len,
const U8 **ep, STRLEN *el)
bool is_c9strict_utf8_string (const U8 *s, STRLEN len)
bool is_c9strict_utf8_string_loc (const U8 *s, STRLEN len,
const U8 **ep)
bool is_c9strict_utf8_string_loclen(const U8 *s, STRLEN len,
const U8 **ep, STRLEN *el)
bool is_utf8_string_flags (const U8 *s, STRLEN len,
const U32 flags)
bool is_utf8_string_loc_flags (const U8 *s, STRLEN len,
const U8 **ep,
const U32 flags)
bool is_utf8_string_loclen_flags (const U8 *s, STRLEN len,
const U8 **ep, STRLEN *el,
const U32 flags)