OpeningHoursParser icon indicating copy to clipboard operation
OpeningHoursParser copied to clipboard

Be (even) more lenient

Open simonpoole opened this issue 9 years ago • 15 comments

Potential further improvements could be made by consuming:

  • [x] times with missing trailing zeros
  • [x] times with unusual separators (mainly ".",) (there are slightly less than 1000 such OH specs in the test data)
  • [x] times with "h" as separator, see above
  • [ ] 4 digit times
  • [x] PH/SH in weekday list (173 occurrences)
  • [x] utf white space
  • [x] "to" instead of "-"
  • [x] times specified with am/pm
  • [ ] interpret "wd" as weekdays
  • [x] translations of "to"

See https://github.com/simonpoole/OpeningHoursParser/blob/master/test-data/oh.txt-fail for list of strings that currently fail in non-strict mode.

simonpoole avatar Jun 29 '16 20:06 simonpoole

Sounds good. However note that this is not covered by the spec. Refer to opening_hours.js which supports all of those things but will give proper warnings.

ypid avatar Jun 29 '16 20:06 ypid

https://github.com/simonpoole/OpeningHoursFragment/issues/14

simonpoole avatar Jul 20 '17 21:07 simonpoole

Should this include a closing time of 00:00 as well as 24:00? I've noticed that in a number of nodes and I'm not sure if that is an error in the data, or if this parser isn't handling it properly.

TannerYoung avatar Oct 23 '19 14:10 TannerYoung

My comment from https://josm.openstreetmap.de/ticket/18140#comment:17

Concerning "Wd": `Wd is used as a symbol for describing the syntax in ​https://wiki.openstreetmap.org/wiki/Key:opening_hours, but it is not an allowed value. Also, "wd" is negligible concerning the usage statistics: ​https://taginfo.openstreetmap.org/keys/opening_hours#values

simon04 avatar Mar 03 '20 19:03 simon04

In principle Wd could be supported in a non-conflicting way, but it would need a clear definition (and even if that boils down to "the days considered as work days is defined by local convention".

Because now the problem is that the only way to produce a OH grammar conforming output would be to transform it to a list of normal weekdays, so say for DACH:

Wd -> Mo,Tu,We,Th,Fr

or would that be

Wd -> Mo,Tu,We,Th,Fr,Sa

? :-)

See also https://en.wikipedia.org/wiki/Workweek_and_weekend

simonpoole avatar Mar 11 '20 11:03 simonpoole

Some ideas:

  • [x] also support em dash (—)
  • [ ] some values are written like this Mo-Fr 10:00-12:00 Sa 12:00-14:00 (missing separator). Supporting will be problematic in the rare case where it would actually make a difference if either the , or ; makes a difference - where the weekdays collide. So proper handling of it would make it necessary to analyze after the parsing if that is the case, so might be more trouble than it is worth
  • [ ] there are some values with fully-written weekday names, "Monday" etc.
  • [ ] same for Spanish etc. ~~-a is used I think in almost all romanic languages, might be worth to include, f.e. Lunes a Jueves 08:00-16:00~~

westnordost avatar Aug 22 '20 23:08 westnordost

https://github.com/simonpoole/OpeningHoursParser/commit/732edc67d642f28e01658d711abea39b2a5d1bb5 adds some translations of "to" for use in non-strict mode.

I've had a look at supporting 4 digit times, but it is a bad case of diminishing returns, ~10 changes in 160'000 examples and two of those would be regressions because of mix up with years. As this would need some fiddling around with semantic lookup ahead to fix, I think we shouldn't do this.

simonpoole avatar Jun 19 '21 18:06 simonpoole

What do you mean with

~10 changes in 160'000 examples

?

westnordost avatar Jun 19 '21 18:06 westnordost

What do you mean with

~10 changes in 160'000 examples

?

It changed the parsing of ~10 of the values from our test set (there are more values that use 4 digit times but they continue to fail for other reasons).

simonpoole avatar Jun 19 '21 18:06 simonpoole

Ah well, isn't the problem with the test set, that it only includes unique values? For example, the opening hours Mo-Fr 09:00-17:00 are used 15513 times. So it should be 15513 times more important to be able to parse that than for example Mo-su 09:00-22:00 (used 1 time).

In other words, the question is, how many opening hours use 4 digits?

westnordost avatar Jun 19 '21 18:06 westnordost

Ah well, isn't the problem with the test set, that it only includes unique values? For example, the opening hours Mo-Fr 09:00-17:00 are used 15513 times. So it should be 15513 times more important to be able to parse that than for example Mo-su 09:00-22:00 (used 1 time).

In other words, the question is, how many opening hours use 4 digits.

I don't think your argument really holds water, because in the end we are testing constructs, not individual values. And I would say that it is just as important to parse Mo-Fr 09:00-17:00 correctly as it is Mo-su 09:00-22:00.

My argument was that handling 4 digit times causes issues with -legit- values with year values, and that handling illegal values in this case doesn't seem to be worth the effort because the gains are likely to be very small.

simonpoole avatar Jun 19 '21 18:06 simonpoole

Let's say there is a city that has 15514 shops and all of these have the opening hours Mo-Fr 09:00-17:00, except one, which has Mo-su 09:00-22:00. For an app that is being used in that city to display the opening hours, it would certainly be more important that the former opening hours are parsed correctly than the latter.

The test set does not test constructs, but unique taggings. This is insofar a difference as testing for the construct, you'd need to test times like 0600-1200, 0601-1200, 0602-1200, 0603-1200 etc.

westnordost avatar Jun 19 '21 18:06 westnordost

If you want to argue semantics, then lets say "constructs in use", the goal was and is to parse all legal constructs. Given that enumerating all of them is not possible, testing all that are actually used (plus additional specific tests) is a reasonable place holder. Weighting is irrelevant as not parsing a legal construct is a failure regardless of it it occurs once or 1'000'000 times.

But in any case I might actually add this in, because I've found some more cases which can be handled by this which might make the effort of avoiding trashing years worth while.

simonpoole avatar Jun 19 '21 19:06 simonpoole

I created this Kotlin script that parses every single instance of an opening hours value in OSM and reports the result. In Java, the code would look very similar. Maybe you are interested in this:

https://gist.github.com/westnordost/b43d2e5100e913cdd2eb8d2d2eb2d296

With the current version of OpeningHoursParser, it outputs

.....
535447 opening hours
Of these, 97.508272% are supported.

westnordost avatar Jun 19 '21 20:06 westnordost

I just noticed that the retrieved number of opening hours strings by that script is less than one fourth of the actual count of opening hours tagged. Probably just a temporary bug/issue with sophox.

westnordost avatar Jun 19 '21 21:06 westnordost