robotoff
robotoff copied to clipboard
Expiration date format
I noticed that revision 29, robotoff added the expiration date 14/06/2019
. In the JSON file for my uploaded picture, the date is written as 14.06.2019
(dd.mm.yyyy). Clearly, there's some kind of processing going on. In my opinion, the date should either be written in a format in the language of the uploaded picture, or normalized to ISO 8601, so that consumers don't need to play a guessing game about which digit is a day and which is a month. I prefer ISO 8601 for all languages.
Indeed there is a preprocessing going on, first to check that the candidate is really a date, and to normalize the date format. However only dates of format %d/%m/%Y
are currently recognized, so mm.dd.yyyy
dates are not recognized (unless the value for month is valid for day and inversely).
See https://github.com/openfoodfacts/robotoff/blob/master/robotoff/insights/ocr/expiration_date.py for more info. I like the idea of normalizing to ISO 8601, but the format will be inconsistent between robotoff- and user- annotated products.
The regex full_digits_long
does match 14.06.2019
if I test it: https://regexr.com/4br3a Otherwise, robotoff wouldn't have added 14/06/2019
as the products date.
Yes indeed, as I said above robotoff matches dates of format dd.mm.yyyy
Sorry, I must've misread. In that case, robotoff shouldn't replace the separators.
I like the idea of normalizing to ISO 8601, but the format will be inconsistent between robotoff- and user- annotated products.
I usually do use ISO when editing manually, because it's the only consistent syntax. 😁 But yes, there may be differences. There's an old openfoodfacts-server issue about normalizing the date, but it work on it hasn't been started, yet.
Yes indeed, as I said above robotoff matches dates of format
dd.mm.yyyy
In https://de.openfoodfacts.org/produkt/4311501619872/harzer-minis-gut-gunstig?rev=20 the expiration date was updated to 06/07/2019
based on the text 06.07.19
in this image: https://de.openfoodfacts.org/images/products/431/150/161/9872/4.jpg The product's main language is German, an dd/mm/yyyy
is not a known date pattern in Germany.
I've fixed the normalization issue by normalizing dates to ISO 8601 in 836b4eb82a832498b29f676d917f58ba1c26a13f. Thanks for the report!
Regarding the other issue you mention, I think we could change the detection pattern given the detected language on the image. If most words returned by the OCR are detected as german -> dd/mm/yyyy
pattern would not be used.
I've fixed the normalization issue by normalizing dates to ISO 8601 in 836b4eb.
Looks good, thank you!
Regarding the other issue you mention, I think we could change the detection pattern given the detected language on the image. If most words returned by the OCR are detected as german ->
dd/mm/yyyy
pattern would not be used.
That's probably a good idea. Wikipedia lists several interesting date patterns that could be used for parsing.