hepcrawl
hepcrawl copied to clipboard
loader: digest "all" possible date formats
Loader should include some normalization routine to handle dates in different formats.
Expected Behavior
Such a normalization routine would be called for each date field in the record ensuring that the data fit the schema, like "2017 Sep 1" -> "2017-09-01", "2017-Sep-1" -> "2017-09-01", "2017 Sep-Oct" -> "2017", "01.09.2017" -> "2017-09-01"
Current Behavior
I have to admit, I do not know to what extent it is already implemented in hepcrawl. In the harvesting-kit each publisher program has its own normalization code. At DESY we have a hand-written function which tries to catch most the cases.
Context
We will have to write a lot of spiders. It would save time, if we could just map the date-fields without thinking about the format.
There now is a date
util, in particular normalize_date
, that can be used to normalize any (incomplete) date:
In [1]: from inspire_utils.date import normalize_date
In [2]: normalize_date("2017 Sep 1")
Out[2]: '2017-09-01'
In [3]: normalize_date("2017-Sep-1")
Out[3]: '2017-09-01'
In [4]: normalize_date("2017 Sep-Oct")
[...]
ValueError: Unknown string format
In [5]: normalize_date("01.09.2017")
Out[5]: '2017-01-09'
Date ranges are not suported yet, are they a common occurence? if so we need to extend the utils to understand them. Also the last case is interpreted wrongly, but is ambiguous so we would need to make a choice here. Do you think your interpretation is more common?
@fschwenn did you see my question about date ranges?