extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

FlagTemplateParser is widely flawed

Open Nono314 opened this issue 10 years ago • 8 comments

I've just written unit tests for the FlagTemplateParser and many of them miserably fail.

Here are things not working as advertised:

  • {{flag|...}} template with a country code as 1st parameter does not work since it checks the length of the template name instead of the property value
  • 2-letter codes for NL are not supported as claimed because they're looked up in a map that only has 3-letter codes

And other important limitations:

  • the template documentation states that IOC codes are supported along to ISO codes, but they're missing from config, including very common ones such as GER for Germany or SUI for Switzerland
  • there's no support for template i18n as in other parsers: eg frwiki uses {{drapeau|...}} and eswiki uses {{bandera|..}
  • there's also no support for most major language in the config file maybe due to that comment. It however means there could easily be at least a basic support instead of using english country names.

I'm currently working on fixing most of these.

Nono314 avatar Mar 08 '15 22:03 Nono314

The main points above have been addressed by #362

Further improvements with other languages support can be easily achieved by using the Locale class to build an automatically-generated map as done for English (for French there were around 15% entries differing between the locale generated label and the actual country page title, mostly casing and accented characters, essentially affecting minor territories).

Nono314 avatar Mar 12 '15 08:03 Nono314

One more point: fr (and also bg if I'm correct) very often uses a template whose name is the full country name, ie {{Afghanistan}} instead of {{AFG}}. This could be easily handled but would require to instantiate the map right on the beginning, not just on demand when an appropriate template has been found. Not sure what the impact is performance-wise? I guess it was significant since there was this change. @jimkont any idea about this?

Nono314 avatar Mar 12 '15 22:03 Nono314

hmm, the code base is so old, with so many contributors over time and I cannot really tell about the initialization overhead. This could be measured with a quick benchmark

jimkont avatar Mar 16 '15 08:03 jimkont

I have done some benchmarking and the overhead of instantiating the static map, though noticeable, is completely dwarfed by the overhead of the dynamic creation from java locale in en (which is about 25 - 30x when parsing100000 templates).

So, I think:

  • I will enable the early instantiation only where it is needed, ie in fr
  • it may be a good idea to use a static map for en as for other languages

Nono314 avatar Mar 19 '15 20:03 Nono314

What if we keep a config cache for flags of size e.g. 1? We never use different lang configs for the same language?

jimkont avatar Mar 19 '15 20:03 jimkont

Hmm, not sure to understand what you mean here... Where would that cache sit then?

Nono314 avatar Mar 19 '15 23:03 Nono314

I'm on mobile and this is on the to of my head but as a static variable to avoid re initialization

jimkont avatar Mar 20 '15 05:03 jimkont

Oh I see. I better understand the part about different lang configs now. But the framework is supposed to be thread safe and on the server there may be several users testing mappings in different languages at the same time.

Actually the whole config class is static (they all are). But since the front end is a method, it gets called over and over. So I just added an internal mutable map that is used as cache for the code-generated language setup. Now it's just reading the template property that has an overhead :)

Nono314 avatar Mar 20 '15 22:03 Nono314