Extend the `ToDatetime` transformer so that it can take a list of datetime formats
Right now, the ToDatetime transformer tries to convert strings to datetimes by either guessing the format using pandas' timeseries parsing library, or it uses a format provided by the user.
In some cases, the user may not know what is the exact format that is being used, but they still know that a specific column contains a timeseries, or might know that there is only a limited number of formats that can be used. Here I am thinking of a situation where the user has to go through a lot of datasets and convert datetimes in each of them.
It should not be too difficult to extend ToDatetime so that it can take a list of formats rather than only one, then go through each of them to try to parse.
The case where I have a list of formats (not just one) but the default list used by pandas is not adequate sounds a bit niche to warrant the added complexity IMO, and it is easily solved by chaining several to datetime
# try first format1 then format2 and raise if both fail
[OnEachColumn(ToDatetime(format=format1, allow_reject=True), OnEachColumn(ToDatetime(format=format2, allow_reject=False)]
# or with expressions
e.skb.apply(ToDatetime(format=format1), allow_reject=True).skb.apply(ToDatetime(format=format2), allow_reject=False)
The reason I thought of this was to address the case in which datetimes are using locale specific formats, e.g. French day/month names, which I don't think are parsed properly by the default Pandas use case.
(my actual idea was using some kind of locale parsing method to translate directly the names but that turned out to be more complicated than I thought)
I like the first solution you suggested, I think it should be added to the user guide.
The reason I thought of this was to address the case in which datetimes are using locale specific formats, e.g. French day/month names, which I don't think are parsed properly by the default Pandas use case.
I see, that makes sense. and would those be parseable by pandas to_datetime if the right format is provided? in any case allowing a list of formats makes sense, my question was more in terms of what proportion of users would need it but your guess is as good as mine 😅
In the skrub meeting we discussed the problem and decided that it would be useful to have a simpler way to implement this in the ToDatetime transformer, rather than having to concatenate the objects after wrapping them in OnEachColumn
The "locale-driven" parsing would also be interesting, but for a separate PR and low prio
SkrubxWIDSML sprint: I will work on this issue