problem-solving
problem-solving copied to clipboard
Datish should have a `.fmt` command and its formatter should be deprecated
Formatting for Datish is currently done via a :formatter named argument at creation that affects its stringification. This makes Date and DateTime unique among Raku built-ins in that they cannot be trusted to have consistent stringification given their contents. There is no way to quickly format a Datish object on an adhoc basis.
I haven't formed an opinion about whether Datish should have a .fmt command but wanted to provide a (partial?) solution for
There is no way to quickly format a Datish object on an adhoc basis.
In most cases, .clone seems like it would fill that roll. Here's an example (building on the :formatter docs)
my $dt = Date.new: '2015-12-31', :formatter{sprintf "%02d/%02d/%04d", .month, .day, .year };
say $dt.Str; # OUTPUT: «12/31/2015»
say $dt.clone(:formatter{sprintf "%02d/%02d/%04d", .day, .month, .year }); # OUTPUT: «31/12/2015»
How close does that come to addressing the use case for .fmt?
Proposal
As I mentioned on a Rakudo issue, the current formatting of Datish is problematic for several reasons. Consequently, I propose the following
- Deprecate the
:formatteroption - Add a method
.fmt(Str)that takes in a standard formatting string. POSIX / ISO C have a long-extant format used instrftimethat has a number of very common (if technically non-standard) extensions found in GNU and elsewhere.
Rationale for deprecating :formatter
Synopsis 32 (“Temporal”) shows that the root of Date and DateTime were the Perl module DateTime. The formatter seems to have been added during construction in v0.23, whereas previous it required an explicit call to .strftime. This formatter was no doubt enthusiastically taken into Raku during the development phase as a way to show off first-class functions, but the formatter model has substantial drawbacks.
At present, it is an immutable attribute that can only be generated at object creation. Because it is used for .Str, this also means that other modules that may request a Datish object cannot trust the .Str output to be consistent. While cloning with a new formatter is possible, that creates an unfortunate identity crisis. The following is problematic:
my $a = Date.new: :2021year, :1month, :1day, :formatter{'A'};
my $b = $a.clone: :formatter{'B'}
my $s = Set.new: $a, $b;
my $t = Set.new: $a.Str, $b.Str;
say "There are {+$s} unique dates in the set";
say "There are {+$t} unique dates in the set";
# OUTPUT: "There are 1 unique dates in the set"
# OUTPUT: "There are 2 unique dates in the set"
Since Date and DateTime are value types, Set clobbers one of them. But ignoring it also can create surprises as only one formatter will "win" when creating, e.g., a set (or whenever an implementation decided to point them to the same memory location). OTOH, the different string output will mean that
my $a = Date.new: :2021year, :1month, :1day, :formatter{'A'};
my $b = $a.clone: :formatter{'B'} ;
my %h;
my %i{Any};
%h{$a} = 'A'; say %h{$b};
%i{$a} = 'A'; say %i{$b};
# OUTPUT: (Any)
# OUTPUT: A
This is completely unintuitive for a value type. For this reason, the formatter attribute is problematic.
Rationale for a .fmt method
If the formatter is removed, then, of course, there should be a replacement. Many other objects in Raku have a .fmt method (the vast majority of common use ones, in fact, as it exists for, e.g., Cool and List, among others). Naturally, then, we should apply a .fmt method to Datish to allow for easy formatting. It would fit within the norm of how Raku objects work, unlike the former method which was unique.
If we are going to have a format method, there are two possible ways. One is to, like with the old :formatter, take in a Callable. I would advise against this as it would be (a) unique and (b) a callable in such a circumstance could be even more shortly called using .&{…} syntax. Thus, a plain old Str() is probably the best, as .&{…} can be used as an alternate (the docs might even mention such a possibility).
If a Str is used for formatting, the question then is which formatting string. The original Perl DateTime module had used .strftime() for formatting and used the POSIX format. In 2008 (shortly before the creation of S23), it gained the format_cldr() routine, which is much more powerful for producing localized output based on Unicode's CLDR skeleton format. Which should we use? It's not cleaer from S32, but in Datish's source code, we can see at some time early in development, strftime was pulled out in to a module. I am confident that that is because it provided localized output, which is not appropriate for core.
Since localization is out of the purview of core Raku, I propose using the POSIX format, and limiting it to the POSIX locale (which is, more or less, en-US). CLDR's format is incredibly powerful, but beyond the scope of core and lacks desirable features such as padding that is in POSIX. As well, since the POSIX format is used in ISO C's strftime, its format options are considered fairly baseline and likely to be used in other data formats/protocols. For localized formatting, modules like Intl::Format::DateTime exist.
Another oddity of the :formatter option that came up during discussions on IRC: it doesn't survive round tripping with .raku:
my $a = Date.new: :2021year, :1month, :1day, :formatter{'Modified format'};
say $a;
say $a.raku.EVAL;
# OUTPUT: "Modified format"
# OUTPUT: "2021-01-01"
This is no doubt due to the fact that the formatter is a Callable which doesn't survive serialization, but it's definitely a silent gotcha: (* + 1).raku.EVAL.(1) would complain about executing stub code, but here there is no such complaint. By switching the paradigm for Datish formatting, it is far less likely that users will end up in such a round trip gotcha.
FWIW, as the creator of the Perl 5 DateTime library, I totally agree with @alabamenhu here. The formatter constructor argument was a poor design choice on my part.
I have created a branch with working code and so can have a more concrete proposal. Documentation and extensive test files to be included in roast exist in the working module. This has brought up a few questions that need to be resolved, mainly to what degree we should diverge from the POSIX standard and align ourselves with other common features of implementations, as well as how to handle undefined behavior. I will detail the issues and then summarize the questions/issues at hand. (sorry in advance for the length, but there are a lot of little things).
Modifiers
POSIX defines only two modifiers, + (“add literal + if greater than minimum width”) and 0 (“pad to minimum width with zeros”). Many other strftime implementations define other operators which add little overhead but greater compatibility. They are _ (space padding), - (no padding), ^ (uppercase), and # (flip case). The upper case is generally only used with names of days/months, and the flip case is almost exclusive to lowercasing timezones. I have implemented all of these with almost no extra overhead. Thus, I feel it's useful to include all of them, but because the flip case is almost only used for lowercasing timezones, I'd recommend generalizing to lowercasing if implemented.
Localizers
E and O don't do anything for the POSIX locale which is what we use. Some non-localized implementations do not recognize them at all and will not produce formatting when they are used. My implementation recognizes, but ignores them, and I'd recommend the same.
Non-POSIX formatters
There are a few formatters included that are not standard in POSIX, but appear nonetheless. Chiefly they are the fractional seconds, %L (milliseconds), %f (microseconds), %N (nanoseconds). The latter, borrowed from perl, has a slight variation on the minimim width parameter, in that it interprets it as a precision (such that %2N is centiseconds). I highly recommend that these be included, as POSIX has no functionality for fractional seconds. While just %N should be sufficient, Ruby, Python, Perl, etc, support the %L and %f variants, and no implementation that I could find used those format codes for anything else.
%R is a shorthand for %H:%M, and I'm ambivalent on including. %P, %k, and %l are all shorthands for things that can be done with modifier codes, but are nonetheless common in many implementations (for instance, Ruby or GNU libc).
I'm less inclined to include %+ which is used in some implementations as a shorthand for a much longer formatter sequence and mostly supplanted by %c. While it's implemented, because of the overlap with the modifier code, to be used, it requires a modifier to be copresent, e.g. %-+. This is confusing, and I'd recommend and be happy to remove from the proposal.
Beyond Dateish
By defining .fmt on Dateish, we are forced to accept a design-by-contract dilemma: formatting codes are specified for units like hours, minutes, etc, but these do not exist on Dateish attributes, rather on the class DateTime. From an implementation standard, this is easily handled: we check if the attribute is available, and use it if so, and fallback to a default value if not (effectively .?minute // 0).
The problem is that this presupposes the existence of those attributes/method names. The non–design-by-contract solution would be to require classes that do .fmt to implement their own version of .fmt. I don't really like the idea of duplicating the code for Date and DateTime, with the former filling in default values for certain formatters. This is probably the largest architectural hurdle I see. If we are willing to do a design-by-contract solution, I would recommend that we reserve a few attribute names and encourage subclasses to use them for interoperability so that if POSIX defines a new formatter, those classes are readily opted in. (primarily, I'm thinking era, quarter, and cycle).
Timezones
I'd like to propose a small, semi-related change (not currently in the branch I implemented). The formatting code %Z requires a text version of a timezone. Both .timezone and .offset are integers. In hindsight, .timezone should probably have been a (possibly empty) string and .offset an integer, moving forward, this could be remedied in a backwards compatible manner by redefining .timezone as an IntStr (or as an Int allowing for IntStr), and the string set to the timezone name if it can be obtained (via, for instance struct tm.tm_zone in BSD/GNU, or a TZ environment variable), and otherwise stringifying to the number's digits. Just as POSIX recognizes some timezone information isn't always available, we could do the same, while still enabling it when it is.
Right now, in my implementation of .fmt, I test .timezone.Str to see if there is an alpha character, and handle it accordingly (my DateTime::Timezones overrides .timezone to provide an IntStr).
Undefined behavior
Like many implementations, I treat the modifier, minimum width localization code, and format code as four independent elements. In POSIX, not all modifiers are valid for all format codes, and not all format codes take minimum widths. Frankly, it's easier to treat them as independent (both from an implementation POV and a user POV), even if they don't always make sense (uppercasing the year, for instance). This effectively means each format code has a default min-width and modifier, which would need to be documented (and I have done this).
Because the padding and capitalization modifiers function independently, and the use of two modifiers is undefined behavior, we could make them independent, such that %_^6a might produce SUN. I'm personally ambivalent.
If the format code is invalid, the entire formatting string is included verbatim. sprintf will error if an invalid code is used, but at least one implementation of strftime keeps the whole format string (e.g. Today is %q. would print as Today is %q.), another just removes the parentheses (Today is q.), and I'm sure I saw one that just blanked the whole block (Today is .). There are good reasons for all four options, all are easily implemented. I currently keep the whole format string.
The method name
Thanks to @codesections , another question might be the name. .fmt goes with other objects, but I was recently reminded elsewhere that we do like to have different names for different things. I'm totally ambivalent on this.
Summary
- Which modifier codes should we implement?
+and0are mandatory,_and-are extremely common,^moderately common,#less common. - If we implement
#, do we maintain functionality as flip case, or generalize to lower case?
The only formatters providing text are names of days/months (by default titlecase) and timezones (per POSIX, could be an abbreviation likeCST, could be a full name likeAmerica/New_York) - How do we handle
EandO?
POSIX defines them, they just have no effect in the POSIX locale. - Which non-POSIX format codes should we include?
The nonstandards are%L,%F,%N(fractional seconds),%R,%+(shorthand for expanded formats), and%P,%k,%l(minor but common variations on extant tags) - How do we handle behavior undefined by POSIX?
Bad format codes and modifiers/widths on tags not specified to by modified/padded. - How do we handle the design-by-contract dilemma?
- What should the method name be?
Will.fmtbe confusing?
There are no right or wrong answers IMO to any of these, but if .fmt is added into Raku, we should standardize it for Rakudo and any other future implementations.