framework
framework copied to clipboard
Cannot use ISO 8601 date format
Hi, I have a schema that contains a field 'date' which also specifies an ISO 8601 date format:
{
"fields": [
{
"name": "date",
"type": "date",
"description": "Week in ISO-8106 Format",
"format": "%G-W%V"
},
...
When doing validation on a data file, I get the following error.
{'type': 'type-error',
'title': 'Type Error',
'description': 'The value does not match the schema '
'type and format for this field.',
'message': 'Type error in the cell "2024-W40" in row '
'"250" and field "date" at position "1": '
'type is "date/%G-W%V"',
'tags': ['#table', '#row', '#cell'],
'note': 'type is "date/%G-W%V"',
'cells': ['2024-W40',
...
It work when I use a non-ISO date format such as "%Y-W%W", but using this would be semantically wrong.
Am I missing something here or is this a known issue? Is there a workaround?
Thank you and kind regards Simon
Thanks for the report. Can reproduce, and looked under the hood : datetime.strptime(cell, self.format).date()
This command fails with your inputs :
from datetime import datetime
datetime.strptime("2024-W40", "%G-W%V").date()
ValueError: ISO year directive '%G' must be used with the ISO week directive '%V' and a weekday directive ('%A', '%a', '%w', or '%u')
This seems to be a limitation of datetime as mentioned in this SO question
Excerpts :
The parsing in datetime is limited. Look at module dateutil
It looks like dateutil is already a dependency from frictionless, so I don't see any drawback to using dateutil instead.
Just to be sure, for your use case, do you see any inconvenience to store "2024-W40" as the first day of this week internally ? I guess that if it is only for validation then it should not matter.
Hi, thanks for looking into it! Some more background: This is a dataset that is made available as Open Data, this is why I would like to stick with the existing representation. We also want to use Frictionless in order to describe it's structure and make the schema availabe publicly as well. In the description, we write that the date is in ISO format, but ideally I would encode that information in the format property as well so that it's semantically correct. Wouldn't it be possible for frictionless to use dateutil.parser.isoparse() in cases when an ISO date format is used? Thank you Simon
You mean, if the format is set to "default" ?
I'm going to need a little time to think before replying, while I parse the XML documentation linked in the table schema specification myself!
Actual code already uses dateutil in this case with additional asserts and a comment, but I do not know if something has changed between v1 and v2 :
if self.format == "default":
# Guard against shorter formats supported by dateutil
assert cell[16] == ":"
assert len(cell) >= 19
cell = platform.dateutil_parser.isoparse(cell)
Hi Pierre, I mean not only if it's set to 'default' but if the pattern is an ISO-compliant date pattern ("%G-W%V" is ISO-compliant in my opinion as it represents this ISO format: https://en.wikipedia.org/wiki/ISO_week_date).
I looked a little bit closer into this issue :
dateutilwill not be useful here. What you suggest is not possible, as you would not want to validate a different ISO format that what you specify in your table schema.dateutilis not based on format strings from what I understand, but sophisticated heuristics instead, the the aim is to be " forgiving with regards to unlikely input formats" (source), and we seek quite the opposite.- SO question linked above suggests to use another third party lib, however replacing stdlib with an unknown new dependency is really not an option here.
- So what would be left would be a little hack : detect this specific format string, add "-1" to the data, "-%u" to the format to validate it (but it would not work for any variation).
However, all things considered, I'll say that this isn't a bug but a feature : "%G-W%V" is actually not a valid date format, as it indicates an entire week. I have no access to the ISO8601 standard to check whether it makes such a distinction. However strptime does and the error message is explicit about this : the format should contain a weekday directive.
Additionnaly, the Table Schema specification is very explicit about this :
follow the syntax of standard Python / C strptime.
So If you need to validate this format, I would suggest you preprocess your data before validation (adding -1 after each week date), and validate your column with %G-W%V-%u. Would this work for you ?
@pierrecamilleri, @SimonScholler
For a ISO 8601 date/datetime, you can use (date|datetime).fromisoformat(...):
import datetime as dt
>>> dt.date.fromisoformat("2024-W40")
datetime.date(2024, 9, 30)
>>> dt.datetime.fromisoformat("2024-W40")
datetime.datetime(2024, 9, 30, 0, 0)
As mentioned in the docs, the inverse of this is date.isocalendar():
import datetime as dt
>>> dt.date.fromisoformat("2024-W40").isocalendar()
datetime.IsoCalendarDate(year=2024, week=40, weekday=1)
Finally, bringing it all back together in datetime.strptime(...):
import datetime as dt
year, week, weekday = dt.date.fromisoformat("2024-W40").isocalendar()
>>> dt.datetime.strptime(f"{year}-W{week}-{weekday}", "%G-W%V-%u").date()
datetime.date(2024, 9, 30)