SDV
SDV copied to clipboard
`DatetimeFormatter`: When `ValueError` occurs, the `pd.to_datetime` can fail due to format miss-match
Environment Details
Please indicate the following details about the environment in which you found the bug:
- SDV version: 1.21
- Python version: 3.8 / 3.12
- Operating System:
Error Description
As seen in this workflow, this error originates through the DatetimeFromatter class that we have in data_processing.
What happened there is that if a ValueError is raised, we try to use pd.to_datetime without considering the already provided datetime format and this is not always accurate as shown in the example below.
Therefore, we should aim to make this more robust by:
- Try to cast the data with the provided format
- Convert the already parsed datetime to string.
- If the 'default conversion' of pandas fails, we should try to apply the format back in a safer way with
errors='coerce'and avoidValueErrors.
Steps to reproduce
series = pd.Series(["31 May 2021", "02 Apr 2021"])
pd.to_datetime(series)
...
ValueError: time data "02 Apr 2021" doesn't match format "%d %B %Y", at position 1. You might want to try:
- passing `format` if your strings have a consistent format;
- passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
- passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.