perl5 POSIX::strftime() returns empty string if format contains UTF-8 characters

POSIX::strftime() returns empty string if format contains UTF-8 characters.

Module: POSIX

Description The following script prints empty string instead of second time:

#!/usr/bin/env perl

use strict;
use warnings;
use POSIX;
use utf8;

my $a = strftime("abc %T", localtime);
my $b = strftime("абв %T", localtime);
print "[$a]\n";
print "[$b]\n";

Result:

[abc 16:42:40]
[]

No problems with perl 5.36.

Steps to Reproduce Execute the script.

Expected behavior Result should be:

[abc 16:42:40]
[абв 16:42:40]

Perl configuration

Summary of my perl5 (revision 5 version 40 subversion 2) configuration:

  Platform:
    osname=freebsd
    osvers=14.2-release-p3
    archname=amd64-freebsd-thread-multi
    uname='freebsd 142amd64-default-job-01 14.2-release-p3 freebsd 14.2-release-p3 amd64 '
    config_args='-Darchlib=/usr/local/lib/perl5/5.40/mach -Dcc=cc -Dcf_by=mat [email protected] -Dcf_time=Sun Apr 13 13:07:13 UTC 2025 -Dinc_version_list=none -Dlibperl=libperl.so.5.40.2 -Dman1dir=/usr/local/lib/perl5/5.40/perl/man/man1 -Dman3dir=/usr/local/lib/perl5/5.40/perl/man/man3 -Dprefix=/usr/local -Dprivlib=/usr/local/lib/perl5/5.40 -Dscriptdir=/usr/local/bin -Dsitearch=/usr/local/lib/perl5/site_perl/mach/5.40 -Dsitelib=/usr/local/lib/perl5/site_perl -Dsiteman1dir=/usr/local/lib/perl5/site_perl/man/man1 -Dsiteman3dir=/usr/local/lib/perl5/site_perl/man/man3 -Dusenm=n -Duseshrplib -sde -Ui_iconv -Ui_malloc -Uinstallusrbinperl -Alddlflags=-L/wrkdirs/usr/ports/lang/perl5.40/work/perl-5.40.2 -L/usr/local/lib/perl5/5.40/mach/CORE -lperl -Dshrpldflags=$(LDDLFLAGS:N-L/wrkdirs/usr/ports/lang/perl5.40/work/perl-5.40.2:N-L/usr/local/lib/perl5/5.40/mach/CORE:N-lperl) -Wl,-soname,$(LIBPERL:R) -Doptimize=-O2 -pipe  -fstack-protector-strong -fno-strict-aliasing  -Dusedtrace -Ui_gdbm -Dusemultiplicity=y -Duse64bitint -Dusemymalloc=n -Dusethreads=y'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='cc'
    ccflags ='-DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DNO_POSIX_2008_LOCALE -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    optimize='-O2 -pipe -fstack-protector-strong -fno-strict-aliasing '
    cppflags='-DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DNO_POSIX_2008_LOCALE -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='FreeBSD Clang 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67)'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags ='-pthread -Wl,-E  -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/lib /usr/local/lib /usr/lib/clang/18/lib
    libs=-ldl -lm -lcrypt -lutil
    perllibs=-ldl -lm -lcrypt -lutil
    libc=
    so=so
    useshrplib=true
    libperl=libperl.so.5.40.2
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='  -Wl,-R/usr/local/lib/perl5/5.40/mach/CORE'
    cccdlflags='-DPIC -fPIC'
    lddlflags='-shared  -L/usr/local/lib/perl5/5.40/mach/CORE -lperl -L/usr/local/lib -fstack-protector-strong'


Characteristics of this binary (from libperl):
  Compile-time options:
    HAS_LONG_DOUBLE
    HAS_STRTOLD
    HAS_TIMES
    MULTIPLICITY
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_HASH_FUNC_SIPHASH13
    PERL_HASH_USE_SBOX32
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    PERL_USE_SAFE_PUTENV
    USE_64_BIT_ALL
    USE_64_BIT_INT
    USE_ITHREADS
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_PERLIO
    USE_PERL_ATOF
    USE_REENTRANT_API
  Built under freebsd
  @INC:
    /usr/local/lib/perl5/site_perl/mach/5.40
    /usr/local/lib/perl5/site_perl
    /usr/local/lib/perl5/5.40/mach
    /usr/local/lib/perl5/5.40

May 19 '25 13:05 vvv2542

No repro here:

$ perl bug.pl
[abc 17:38:42]
[абв 17:38:42]

But that's with PERL_UNICODE=SAL in the environment. Without that, the results I'd expect to see are:

$ ( unset PERL_UNICODE; perl bug.pl )
[abc 17:39:39]
Wide character in print at bug.pl line 11.
[абв 17:39:39]

(Because STDOUT has no Unicode-aware output encoding layer by default.)

Do you have any interesting environment variables that might affect the result?

May 19 '25 15:05 mauke

Please try this:

#!/usr/bin/env perl

use strict;
use warnings;
use POSIX;
use utf8;
use open ":locale";

setlocale(LC_TIME(), "C");
my $a = strftime("abc %T", localtime);
my $b = strftime("абв %T", localtime);
print "[$a]\n";
print "[$b]\n";

May 19 '25 15:05 vvv2542

Ah. I can reproduce it with the original code using LC_TIME=C perl bug.pl, so the issue is definitely locale-related.

May 19 '25 16:05 mauke

Bisecting using LC_TIME=C ../perl5/Porting/bisect.pl --start=v5.38.0 --end=v5.40.0 -e 'use POSIX qw(strftime); strftime("\x{100}", localtime) ne "" or die' leads to commit 492c28719682673e6c19378edf6824d79784b33e. @khwilliamson ?

bad - non-zero exit from ./perl -Ilib -e use POSIX qw(strftime); strftime("\x{100}", localtime) ne "" or die
492c28719682673e6c19378edf6824d79784b33e is the first bad commit
commit 492c28719682673e6c19378edf6824d79784b33e
Author: Karl Williamson <[email protected]>
Date:   Wed Apr 26 15:36:54 2023 -0600

    Implement sv_strftime_tm and sv_strftime_ints
    
    These two functions are designed to free the caller from having to know
    anything about the intricacies of handling UTF-8 in using strftime(), as
    they take SV inputs and return an SV with the UTF-8 flag appropriately
    set.
    
    They differ only in that one takes a bunch of integer arguments that
    define the various components of the time; and the other takes a pointer
    to a struct tm.
    
    The POSIX implementation of strftime is converted to use these.

May 19 '25 17:05 mauke

This is not a bug.

Only ASCII characters are legal in the C locale. You are calling strftime with illegal inputs.

If you had checked $!, after finding the return is empty, you would have found that it is set to "Invalid argument"

If you change the locale to C.UTF-8, where those characters are actually legal, it outputs

[abc 13:27:36]
[абв 13:27:36]

If the called function were a Perl construct, it would have raised a warning, but since it is a POSIX function, it uses the POSIX API which is to return errors in errno which is accessed in Perl via $!

If any action were to result from this ticket, I think it would be to stress in the POSIX man page that warnings don't tend to be generated by failures of these functions, so look at $! instead. That can be inferred from that text already, but an explicit call-out could well be in order

May 19 '25 21:05 khwilliamson

FreeBSD manual:

The format string consists of zero or more conversion specifications and ordinary characters. All ordinary characters are copied directly into the buffer.

GNU manual:

The characters of ordinary character sequences (including the null byte) are copied verbatim from format to s.

There is nothing about rejecting some characters according to LC_TIME.

If you think that LC_TIME now should affect not only conversion specifications but also ordinary characters, then this should be described in perldelta as an incompatible change.

May 19 '25 22:05 vvv2542

Only ASCII characters are legal in the C locale. You are calling strftime with illegal inputs.

Do you have a reference for that? Because I tried it in C and it worked fine:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <errno.h>
#include <locale.h>

int main(void) {
    if (!setlocale(LC_TIME, "C")) {
        fprintf(stderr, "Can't set locale LC_TIME=C\n");
        return EXIT_FAILURE;
    }

    time_t t = time(NULL);
    if (t == -1) {
        fprintf(stderr, "Can't get current time: %s\n", strerror(errno));
        return EXIT_FAILURE;
    }

    struct tm *pt = localtime(&t);
    if (!pt) {
        fprintf(stderr, "Can't localtime(): %s\n", strerror(errno));
        return EXIT_FAILURE;
    }

    char buf[200];
    size_t n = strftime(buf, sizeof buf, "абв %T", pt);

    printf("[%.*s]\n", (int)n, buf);
    return 0;
}

Result:

$ ./a.out 
[абв 08:06:57]

(And even if C isn't aware of UTF-8, these are clearly bytes outside the ASCII range.)

May 20 '25 06:05 mauke

tl;dr I can safely revert to the old behavior for this case where the locale is C; but for other locales, malformed UTF-8 could result; which is not pointed out in either the POSIX or C standards; hence they are defective.

It can be argued that абв aren't ordinary characters because they aren't characters at all in the C locale. It also could be argued that the bytes that comprise them are actually individual characters in the C locale and that they can be passed through strftime, no harm down. But presuming strftime() will pass these through is skating close to undefined behavior IMO.

The man pages of various platforms aren't the final arbiter of how things should work. (Although we may have to have special behavior for non-conforming implementations) For how things should work, first turn to the POSIX standard. It says

The application shall ensure that the format is a character string,

Implied is that it is a legal character string in the given locale. (An ordinary character is a character that isn't a format specification, as far as I can tell.)

But the POSIX standard also says

For strftime(): The functionality described on this reference page is aligned with the ISO C standard. Any conflict between the requirements described here and the ISO C standard is unintentional. This volume of POSIX.1-2024 defers to the ISO C standard

The C23 standard is little changed in this regard from C99. It says

The format shall be a multibyte character sequence, beginning and ending in its initial shift state.

multibyte is defined confusingly to me. But I believe it comes down to something that is passible to a libc function via char* as opposed to wchar_t *.

Suppose the locale is fr_FR.iso88591, French, and we want to get the month of January. It is "janvier", all ASCII. Adding these to абв won't cause any malformed UTF-8 to be generated, so there isn't a problem.

But, if the desired month is February, the answer is "febrièr:". The penultimate letter of that is U+00E8 (LATIN SMALL LETTER E WITH GRAVE). In this locale, strftime() will return that as a single byte, \xe8, adding it blindly to the rest of the format, yielding a mixture of legal UTF-8 bytes and the illegal one. resulting in a mixture of UTF-8 and single byte.

May 21 '25 17:05 khwilliamson

Reverting this to the old behavior for the C locale seems reasonable to me.

There should probably be a warning for the unsafe cases that return an empty string.

May 22 '25 16:05 haarg

This will have to wait for next cycle. That’s unfortunate, but this problem was not introduced during this cycle, and we switched to timeboxed releases to avoid dragging out the release by delaying for just one more would-be-good-to-ship-ASAP change appearing at the last minute, every time the last minute is reached.

May 22 '25 16:05 ap

@ap, AFAIK the problem was introduced in this cycle

May 22 '25 19:05 khwilliamson

It's not from this development cycle. The bug report was against 5.40.2 and the bisected commit was already in v5.39.2.

May 22 '25 20:05 mauke

This will have to wait for next cycle. That’s unfortunate, but this problem was not introduced during this cycle, and we switched to timeboxed releases to avoid dragging out the release for just one more would-be-good-to-ship-ASAP change, repeatedly.

The current dev cycle began here:

commit f22a16ecf4821b7e93d2569f630817a2631fddd9 (HEAD, tag: v5.40.0)
Author:     Graham Knop <[email protected]>
AuthorDate: Sun Jun 9 16:01:09 2024 +0200
Commit:     Graham Knop <[email protected]>
CommitDate: Sun Jun 9 16:01:09 2024 +0200

I.e., June 9 2024

The breaking commit was actually committed on August 17 2024.

commit 492c28719682673e6c19378edf6824d79784b33e
Author:     Karl Williamson <[email protected]>
AuthorDate: Wed Apr 26 15:36:54 2023 -0600
Commit:     Karl Williamson <[email protected]>
CommitDate: Thu Aug 17 09:36:52 2023 -0600

    Implement sv_strftime_tm and sv_strftime_ints

Or, as indicated via git describe, v5.39.1-353-g492c287196.

So this is a problem that appeared within the current dev cycle and, per policy and best practice, must be addressed before the next production release.

May 22 '25 22:05 jkeenan

2023 < 2024

May 22 '25 22:05 mauke

2023 < 2024

You are correct.

May 22 '25 22:05 jkeenan