robotoff icon indicating copy to clipboard operation
robotoff copied to clipboard

Expiration date format

Open hangy opened this issue 5 years ago • 7 comments

I noticed that revision 29, robotoff added the expiration date 14/06/2019. In the JSON file for my uploaded picture, the date is written as 14.06.2019 (dd.mm.yyyy). Clearly, there's some kind of processing going on. In my opinion, the date should either be written in a format in the language of the uploaded picture, or normalized to ISO 8601, so that consumers don't need to play a guessing game about which digit is a day and which is a month. I prefer ISO 8601 for all languages.

hangy avatar Apr 07 '19 15:04 hangy

Indeed there is a preprocessing going on, first to check that the candidate is really a date, and to normalize the date format. However only dates of format %d/%m/%Y are currently recognized, so mm.dd.yyyy dates are not recognized (unless the value for month is valid for day and inversely).

See https://github.com/openfoodfacts/robotoff/blob/master/robotoff/insights/ocr/expiration_date.py for more info. I like the idea of normalizing to ISO 8601, but the format will be inconsistent between robotoff- and user- annotated products.

raphael0202 avatar Apr 08 '19 08:04 raphael0202

The regex full_digits_long does match 14.06.2019 if I test it: https://regexr.com/4br3a Otherwise, robotoff wouldn't have added 14/06/2019 as the products date.

hangy avatar Apr 08 '19 09:04 hangy

Yes indeed, as I said above robotoff matches dates of format dd.mm.yyyy

raphael0202 avatar Apr 08 '19 09:04 raphael0202

Sorry, I must've misread. In that case, robotoff shouldn't replace the separators.

I like the idea of normalizing to ISO 8601, but the format will be inconsistent between robotoff- and user- annotated products.

I usually do use ISO when editing manually, because it's the only consistent syntax. 😁 But yes, there may be differences. There's an old openfoodfacts-server issue about normalizing the date, but it work on it hasn't been started, yet.

hangy avatar Apr 08 '19 16:04 hangy

Yes indeed, as I said above robotoff matches dates of format dd.mm.yyyy

In https://de.openfoodfacts.org/produkt/4311501619872/harzer-minis-gut-gunstig?rev=20 the expiration date was updated to 06/07/2019 based on the text 06.07.19 in this image: https://de.openfoodfacts.org/images/products/431/150/161/9872/4.jpg The product's main language is German, an dd/mm/yyyy is not a known date pattern in Germany.

hangy avatar Jun 29 '19 10:06 hangy

I've fixed the normalization issue by normalizing dates to ISO 8601 in 836b4eb82a832498b29f676d917f58ba1c26a13f. Thanks for the report!

Regarding the other issue you mention, I think we could change the detection pattern given the detected language on the image. If most words returned by the OCR are detected as german -> dd/mm/yyyy pattern would not be used.

raphael0202 avatar Nov 15 '19 19:11 raphael0202

I've fixed the normalization issue by normalizing dates to ISO 8601 in 836b4eb.

Looks good, thank you!

Regarding the other issue you mention, I think we could change the detection pattern given the detected language on the image. If most words returned by the OCR are detected as german -> dd/mm/yyyy pattern would not be used.

That's probably a good idea. Wikipedia lists several interesting date patterns that could be used for parsing.

hangy avatar Nov 18 '19 10:11 hangy