Improve title-based category search
- Chose picture
- Entered file name "Nintendo sign at Tokyo branch in Taito"
- The proposed categories did not contain anything related to Nintendo nor Taito.
- Waiting some time does not change the proposed categories
That's because the API searches for "Nintendo sign at Tokyo branch in Taito" instead of "Nintendo" and "sign" and "Tokyo" and "branch" and "Taito".
We would need to split into words, then remove grammar words such as "the is first to into" or more generally all small words, then perform a search for each seemingly relevant word (these seem to be called stop words) It is less easy for languages without spaces (like Japanese), but most file names are in space-separated languages so for now that's not a big problem.
A bigger problem is that most titles are not in English, which means we would have to first guess the language, and then extract stop words in the context of that language. To not make the app to big, we could write a multi-languages extractor (for instance using nltk) and host it on a Wikimedia server.
we would have to first guess the language
Do we? Since we are not just receiving a string from an unknown source but we also (kind of) have access to the input side, isn't it possible to retrieve and use the locale/keyboard information? It doesn't always work, though - I could more or less enter Spanish with an English keyboard if I am being lazy to not switch (and if I knew how to write in Spanish :)). But I believe it often does.
Another problem is translation - titles may not be in English, while most categories are in English. In theory the search can find the right category by looking at the category's multilingual descriptions, but in reality not many categories have any non-English descriptions. In future this might be resolved by https://meta.wikimedia.org/wiki/Community_Tech/Allow_categories_in_Commons_in_all_languages but for now we'd have to translate (although, as in Nicolas's example, sometimes company names and place names might not need translation/transliteration).
How about searching in English, then in the app's locale, and showing the results of both?
How about searching in English, then in the app's locale, and showing the results of both?
This sounds like a pragmatic way forward.
Chris.
Doesn't the app use the built-in search from the website? Does it do its own logic for research?
The website's current search engine is MediaSearch. This didn't exist until 2020. I believe the app uses a different API endpoint for search. I don't know if MediaSearch has an API that an app can call.