POSIX::strftime() returns empty string if format contains UTF-8 characters
POSIX::strftime() returns empty string if format contains UTF-8 characters.
Module: POSIX
Description The following script prints empty string instead of second time:
#!/usr/bin/env perl
use strict;
use warnings;
use POSIX;
use utf8;
my $a = strftime("abc %T", localtime);
my $b = strftime("абв %T", localtime);
print "[$a]\n";
print "[$b]\n";
Result:
[abc 16:42:40]
[]
No problems with perl 5.36.
Steps to Reproduce Execute the script.
Expected behavior Result should be:
[abc 16:42:40]
[абв 16:42:40]
Perl configuration
Summary of my perl5 (revision 5 version 40 subversion 2) configuration:
Platform:
osname=freebsd
osvers=14.2-release-p3
archname=amd64-freebsd-thread-multi
uname='freebsd 142amd64-default-job-01 14.2-release-p3 freebsd 14.2-release-p3 amd64 '
config_args='-Darchlib=/usr/local/lib/perl5/5.40/mach -Dcc=cc -Dcf_by=mat [email protected] -Dcf_time=Sun Apr 13 13:07:13 UTC 2025 -Dinc_version_list=none -Dlibperl=libperl.so.5.40.2 -Dman1dir=/usr/local/lib/perl5/5.40/perl/man/man1 -Dman3dir=/usr/local/lib/perl5/5.40/perl/man/man3 -Dprefix=/usr/local -Dprivlib=/usr/local/lib/perl5/5.40 -Dscriptdir=/usr/local/bin -Dsitearch=/usr/local/lib/perl5/site_perl/mach/5.40 -Dsitelib=/usr/local/lib/perl5/site_perl -Dsiteman1dir=/usr/local/lib/perl5/site_perl/man/man1 -Dsiteman3dir=/usr/local/lib/perl5/site_perl/man/man3 -Dusenm=n -Duseshrplib -sde -Ui_iconv -Ui_malloc -Uinstallusrbinperl -Alddlflags=-L/wrkdirs/usr/ports/lang/perl5.40/work/perl-5.40.2 -L/usr/local/lib/perl5/5.40/mach/CORE -lperl -Dshrpldflags=$(LDDLFLAGS:N-L/wrkdirs/usr/ports/lang/perl5.40/work/perl-5.40.2:N-L/usr/local/lib/perl5/5.40/mach/CORE:N-lperl) -Wl,-soname,$(LIBPERL:R) -Doptimize=-O2 -pipe -fstack-protector-strong -fno-strict-aliasing -Dusedtrace -Ui_gdbm -Dusemultiplicity=y -Duse64bitint -Dusemymalloc=n -Dusethreads=y'
hint=recommended
useposix=true
d_sigaction=define
useithreads=define
usemultiplicity=define
use64bitint=define
use64bitall=define
uselongdouble=undef
usemymalloc=n
default_inc_excludes_dot=define
Compiler:
cc='cc'
ccflags ='-DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DNO_POSIX_2008_LOCALE -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
optimize='-O2 -pipe -fstack-protector-strong -fno-strict-aliasing '
cppflags='-DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DNO_POSIX_2008_LOCALE -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
ccversion=''
gccversion='FreeBSD Clang 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67)'
gccosandvers=''
intsize=4
longsize=8
ptrsize=8
doublesize=8
byteorder=12345678
doublekind=3
d_longlong=define
longlongsize=8
d_longdbl=define
longdblsize=16
longdblkind=3
ivtype='long'
ivsize=8
nvtype='double'
nvsize=8
Off_t='off_t'
lseeksize=8
alignbytes=8
prototype=define
Linker and Libraries:
ld='cc'
ldflags ='-pthread -Wl,-E -fstack-protector-strong -L/usr/local/lib'
libpth=/usr/lib /usr/local/lib /usr/lib/clang/18/lib
libs=-ldl -lm -lcrypt -lutil
perllibs=-ldl -lm -lcrypt -lutil
libc=
so=so
useshrplib=true
libperl=libperl.so.5.40.2
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs
dlext=so
d_dlsymun=undef
ccdlflags=' -Wl,-R/usr/local/lib/perl5/5.40/mach/CORE'
cccdlflags='-DPIC -fPIC'
lddlflags='-shared -L/usr/local/lib/perl5/5.40/mach/CORE -lperl -L/usr/local/lib -fstack-protector-strong'
Characteristics of this binary (from libperl):
Compile-time options:
HAS_LONG_DOUBLE
HAS_STRTOLD
HAS_TIMES
MULTIPLICITY
PERLIO_LAYERS
PERL_COPY_ON_WRITE
PERL_DONT_CREATE_GVSV
PERL_HASH_FUNC_SIPHASH13
PERL_HASH_USE_SBOX32
PERL_MALLOC_WRAP
PERL_OP_PARENT
PERL_PRESERVE_IVUV
PERL_USE_SAFE_PUTENV
USE_64_BIT_ALL
USE_64_BIT_INT
USE_ITHREADS
USE_LARGE_FILES
USE_LOCALE
USE_LOCALE_COLLATE
USE_LOCALE_CTYPE
USE_LOCALE_NUMERIC
USE_LOCALE_TIME
USE_PERLIO
USE_PERL_ATOF
USE_REENTRANT_API
Built under freebsd
@INC:
/usr/local/lib/perl5/site_perl/mach/5.40
/usr/local/lib/perl5/site_perl
/usr/local/lib/perl5/5.40/mach
/usr/local/lib/perl5/5.40
No repro here:
$ perl bug.pl
[abc 17:38:42]
[абв 17:38:42]
But that's with PERL_UNICODE=SAL in the environment. Without that, the results I'd expect to see are:
$ ( unset PERL_UNICODE; perl bug.pl )
[abc 17:39:39]
Wide character in print at bug.pl line 11.
[абв 17:39:39]
(Because STDOUT has no Unicode-aware output encoding layer by default.)
Do you have any interesting environment variables that might affect the result?
Please try this:
#!/usr/bin/env perl
use strict;
use warnings;
use POSIX;
use utf8;
use open ":locale";
setlocale(LC_TIME(), "C");
my $a = strftime("abc %T", localtime);
my $b = strftime("абв %T", localtime);
print "[$a]\n";
print "[$b]\n";
Ah. I can reproduce it with the original code using LC_TIME=C perl bug.pl, so the issue is definitely locale-related.
Bisecting using LC_TIME=C ../perl5/Porting/bisect.pl --start=v5.38.0 --end=v5.40.0 -e 'use POSIX qw(strftime); strftime("\x{100}", localtime) ne "" or die' leads to commit 492c28719682673e6c19378edf6824d79784b33e. @khwilliamson ?
bad - non-zero exit from ./perl -Ilib -e use POSIX qw(strftime); strftime("\x{100}", localtime) ne "" or die
492c28719682673e6c19378edf6824d79784b33e is the first bad commit
commit 492c28719682673e6c19378edf6824d79784b33e
Author: Karl Williamson <[email protected]>
Date: Wed Apr 26 15:36:54 2023 -0600
Implement sv_strftime_tm and sv_strftime_ints
These two functions are designed to free the caller from having to know
anything about the intricacies of handling UTF-8 in using strftime(), as
they take SV inputs and return an SV with the UTF-8 flag appropriately
set.
They differ only in that one takes a bunch of integer arguments that
define the various components of the time; and the other takes a pointer
to a struct tm.
The POSIX implementation of strftime is converted to use these.
This is not a bug.
Only ASCII characters are legal in the C locale. You are calling strftime with illegal inputs.
If you had checked $!, after finding the return is empty, you would have found that it is set to "Invalid argument"
If you change the locale to C.UTF-8, where those characters are actually legal, it outputs
[abc 13:27:36]
[абв 13:27:36]
If the called function were a Perl construct, it would have raised a warning, but since it is a POSIX function, it uses the POSIX API which is to return errors in errno which is accessed in Perl via $!
If any action were to result from this ticket, I think it would be to stress in the POSIX man page that warnings don't tend to be generated by failures of these functions, so look at $! instead. That can be inferred from that text already, but an explicit call-out could well be in order
FreeBSD manual:
The format string consists of zero or more conversion specifications and ordinary characters. All ordinary characters are copied directly into the buffer.
GNU manual:
The characters of ordinary character sequences (including the null byte) are copied verbatim from format to s.
There is nothing about rejecting some characters according to LC_TIME.
If you think that LC_TIME now should affect not only conversion specifications but also ordinary characters, then this should be described in perldelta as an incompatible change.
Only ASCII characters are legal in the C locale. You are calling
strftimewith illegal inputs.
Do you have a reference for that? Because I tried it in C and it worked fine:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <errno.h>
#include <locale.h>
int main(void) {
if (!setlocale(LC_TIME, "C")) {
fprintf(stderr, "Can't set locale LC_TIME=C\n");
return EXIT_FAILURE;
}
time_t t = time(NULL);
if (t == -1) {
fprintf(stderr, "Can't get current time: %s\n", strerror(errno));
return EXIT_FAILURE;
}
struct tm *pt = localtime(&t);
if (!pt) {
fprintf(stderr, "Can't localtime(): %s\n", strerror(errno));
return EXIT_FAILURE;
}
char buf[200];
size_t n = strftime(buf, sizeof buf, "абв %T", pt);
printf("[%.*s]\n", (int)n, buf);
return 0;
}
Result:
$ ./a.out
[абв 08:06:57]
(And even if C isn't aware of UTF-8, these are clearly bytes outside the ASCII range.)
tl;dr I can safely revert to the old behavior for this case where the locale is C; but for other locales, malformed UTF-8 could result; which is not pointed out in either the POSIX or C standards; hence they are defective.
It can be argued that абв aren't ordinary characters because they aren't characters at all in the C locale. It also could be argued that the bytes that comprise them are actually individual characters in the C locale and that they can be passed through strftime, no harm down. But presuming strftime() will pass these through is skating close to undefined behavior IMO.
The man pages of various platforms aren't the final arbiter of how things should work. (Although we may have to have special behavior for non-conforming implementations) For how things should work, first turn to the POSIX standard. It says
The application shall ensure that the format is a character string,
Implied is that it is a legal character string in the given locale. (An ordinary character is a character that isn't a format specification, as far as I can tell.)
But the POSIX standard also says
For strftime(): The functionality described on this reference page is aligned with the ISO C standard. Any conflict between the requirements described here and the ISO C standard is unintentional. This volume of POSIX.1-2024 defers to the ISO C standard
The C23 standard is little changed in this regard from C99. It says
The format shall be a multibyte character sequence, beginning and ending in its initial shift state.
multibyte is defined confusingly to me. But I believe it comes down to something that is passible to a libc function via char* as opposed to wchar_t *.
Suppose the locale is fr_FR.iso88591, French, and we want to get the month of January. It is "janvier", all ASCII. Adding these to абв won't cause any malformed UTF-8 to be generated, so there isn't a problem.
But, if the desired month is February, the answer is "febrièr:". The penultimate letter of that is U+00E8 (LATIN SMALL LETTER E WITH GRAVE). In this locale, strftime() will return that as a single byte, \xe8, adding it blindly to the rest of the format, yielding a mixture of legal UTF-8 bytes and the illegal one. resulting in a mixture of UTF-8 and single byte.
Reverting this to the old behavior for the C locale seems reasonable to me.
There should probably be a warning for the unsafe cases that return an empty string.
This will have to wait for next cycle. That’s unfortunate, but this problem was not introduced during this cycle, and we switched to timeboxed releases to avoid dragging out the release by delaying for just one more would-be-good-to-ship-ASAP change appearing at the last minute, every time the last minute is reached.
@ap, AFAIK the problem was introduced in this cycle
It's not from this development cycle. The bug report was against 5.40.2 and the bisected commit was already in v5.39.2.
This will have to wait for next cycle. That’s unfortunate, but this problem was not introduced during this cycle, and we switched to timeboxed releases to avoid dragging out the release for just one more would-be-good-to-ship-ASAP change, repeatedly.
The current dev cycle began here:
commit f22a16ecf4821b7e93d2569f630817a2631fddd9 (HEAD, tag: v5.40.0)
Author: Graham Knop <[email protected]>
AuthorDate: Sun Jun 9 16:01:09 2024 +0200
Commit: Graham Knop <[email protected]>
CommitDate: Sun Jun 9 16:01:09 2024 +0200
I.e., June 9 2024
The breaking commit was actually committed on August 17 2024.
commit 492c28719682673e6c19378edf6824d79784b33e
Author: Karl Williamson <[email protected]>
AuthorDate: Wed Apr 26 15:36:54 2023 -0600
Commit: Karl Williamson <[email protected]>
CommitDate: Thu Aug 17 09:36:52 2023 -0600
Implement sv_strftime_tm and sv_strftime_ints
Or, as indicated via git describe, v5.39.1-353-g492c287196.
So this is a problem that appeared within the current dev cycle and, per policy and best practice, must be addressed before the next production release.
2023 < 2024
2023 < 2024
You are correct.