api icon indicating copy to clipboard operation
api copied to clipboard

Autocomplete: Multi-lang search (based on user's lang)

Open Joxit opened this issue 5 years ago • 19 comments

Transcription of https://github.com/pelias/api/issues/127#issuecomment-490002821

What is this for ?

We want Pelias to send responses to queries written in other languages than English. For example, a Dutch looking for Parijs (Paris in Dutch) will get Parijs, Frankrijk.

What should we do ?

  • [x] Use multi_match in autocomplete queries (done in #1300)
  • [ ] ~Should search in name.$LANG index with higher boost~
  • [ ] ~Should search in name.default index with standard boost~
  • [ ] ~Should search in name.en index as fallback with lower boost (when $LANG is not en)~
  • [x] Should return name.$LANG when available, default otherwise (done in #1301)
  • [ ] Found something for matching parent e.g Parijs, Frankrijk.

Some use cases

Text Lang Result Status
Parijs nl Parijs, Frankrijk (whosonfirst:locality:101751119 Paris) KO
Londre fr Londres, Angleterre, Royaume-Uni (whosonfirst:locality:101750367 London) KO
ブラジル ja ブラジル (whosonfirst:country:85633009 Brazil) OK

cc @mihneadb Have you started working on it ? I can take the task if you want :smile:

Joxit avatar May 14 '19 03:05 Joxit

@Joxit I haven't, I was waiting for some clarifications in the other issue and I started working on some other stuff now. So sure, go ahead and take it, thanks a ton for doing this! :) 🥂

mihneadb avatar May 15 '19 07:05 mihneadb

It seems that in order to validate this issue, all the importers must support the multi-lang index.

At this time, only OSM supports it. WOF will be supported with pelias/whosonfirst#446 Geonames needs alternateNamesV2 file to add multi-lang (we want that ?) OpenAddresses and Polylines are unavailable

I think, the most important importer is WOF, the city/country search is the most common use case of the geocoder.

Joxit avatar May 20 '19 13:05 Joxit

@Joxit FWIW there seems to be some level of support for that already - when looking for something, passing e.g. lang=en or lang=ru yields the same name but the city name is translated.

https://pelias.github.io/compare/#/v1/autocomplete%3Flang=en&text=red%20square%20moscow vs https://pelias.github.io/compare/#/v1/autocomplete%3Flang=ru&text=red%20square%20moscow (see label)

I thought that data was based on WOF.

mihneadb avatar May 20 '19 13:05 mihneadb

Yes, this is done by pelias/placeholder which is a middleware and translate ElasticSearch responses for the user (using WOF ids). This issue is about ES requests (and not responses). That means, when you use lang=ru and search red square Москва, you will not found the correct venue (geonames:venue:6295575).

The data is present in WOF, but not indexed in ES, only the default name and English variant are currently indexed. That's why I opened pelias/whosonfirst#446 :smile:

Joxit avatar May 20 '19 13:05 Joxit

Gotcha now, thanks! About that, I'm thinking we should also return Кра́сная пло́щадь if someone searches red square lang=ru, would you agree? I'm thinking this should be easier to achieve - building on what you pointed out about the middleware. I can make a PR if so.

mihneadb avatar May 20 '19 13:05 mihneadb

I think the API can return the name.{lang} index when it's available in OSM, but for Geonames, it will be a bit more tricky because we do not use it anywhere. Maybe this can be added in placeholder ? But we will have conflicts with WOF data...

Joxit avatar May 20 '19 14:05 Joxit

I was thinking about it at a higher level. Simplest seems to me to update geojsonify here: https://github.com/pelias/api/blob/master/helper/geojsonify.js#L55-L60

Instead of going for default, prioritize req.lang?

mihneadb avatar May 20 '19 14:05 mihneadb

Hi everyone! Any update on this one? LMK if I can help some way.

mihneadb avatar Aug 01 '19 09:08 mihneadb

Hi @mihneadb, unfortunately, it's me whos the blocker here, I would like to land https://github.com/pelias/api/pull/1287 before merging this (It's a complex change but I'm planning on doing the final testing and merging next week).

It's really not ideal to hold back another PR, especially a community contribution, but it makes sense for us in this case because the PR I linked is a massive refactoring of how autocomplete queries are generated.

We are sometimes a little over-cautious with merging big PRs but it's our responsibility to ensure compatibility and reliability for organisations running Pelias in a production environment with user-facing traffic.

missinglink avatar Aug 01 '19 12:08 missinglink

Oh actually I thought this was another PR, but the same still applies to this one ;)

missinglink avatar Aug 01 '19 12:08 missinglink

@missinglink thanks for the transparency! Looking forward to using the new parser! :)

mihneadb avatar Aug 01 '19 12:08 mihneadb

@missinglink Any news on this?

slvlirnoff avatar Sep 18 '19 06:09 slvlirnoff

I've been sick this week but releasing the new parser is a top priority.

missinglink avatar Sep 19 '19 09:09 missinglink

Hi,

I came across an issue related to this today. I was looking for Edo Tokyo Museum and could not find any result. I realized that I had to search for 江戸東京博物館 in order to find it.

Any ETA for this feature?

bboure avatar Oct 05 '20 14:10 bboure

Hi @bboure, this part of the feature is already live if you are running your query with lang=en. I found a difference in ES query between the English version and the Kanji version.

{
  "constant_score": {
    "filter": {
      "multi_match": {
        "type": "cross_fields",
        "query": "Museum",
        "fields": [
          "parent.country.ngram^1",
          "parent.dependency.ngram^1",
          "parent.macroregion.ngram^1",
          "parent.region.ngram^1",
          "parent.county.ngram^1",
          "parent.localadmin.ngram^1",
          "parent.locality.ngram^1",
          "parent.borough.ngram^1",
          "parent.neighbourhood.ngram^1",
          "parent.locality_a.ngram^1",
          "parent.region_a.ngram^1",
          "parent.country_a.ngram^1",
          "name.default^1.5"
        ],
        "analyzer": "peliasQuery"
      }
    }
  }
}

In the must clause, name.en^1.5 is missing.

The missing feature is multi lang in parent hierarchy now.

Joxit avatar Oct 05 '20 15:10 Joxit

@Joxit Thanks for reaching back.

Add lang=en does not work either though. The query does not include name.en^1.5

https://pelias.github.io/compare/#/v1/autocomplete?layers=venue&lang=en&text=Edo+Tokyo+Museum&debug=1

Am I doing something wrong?

bboure avatar Oct 05 '20 15:10 bboure

Interestingly, looking for Edo Tokyo Museum, Tokyo works

It has to do on how the query is built

Edo Tokyo Museum, Tokyo:

"must": [
                   {
                      "multi_match": {
                        "type": "phrase",
                        "query": "edo Tokyo Museum",
                        "fields": [
                          "phrase.default",
                          "phrase.en"
                        ],
                        "analyzer": "peliasQuery",
                        "boost": 1,
                        "slop": 3
                      }
                    },
                   {
                      "multi_match": {
                        "type": "cross_fields",
                        "query": "Tokyo",
                        "fields": [
                          "parent.country.ngram^1",
                          "parent.dependency.ngram^1",
                          "parent.macroregion.ngram^1",
                          "parent.region.ngram^1",
                          "parent.county.ngram^1",
                          "parent.localadmin.ngram^1",
                          "parent.locality.ngram^1",
                          "parent.borough.ngram^1",
                          "parent.neighbourhood.ngram^1",
                          "parent.locality_a.ngram^1",
                          "parent.region_a.ngram^1",
                          "parent.country_a.ngram^1",
                          "name.default^1.5"
                        ],
                        "analyzer": "peliasAdmin"
                      }
                    }
                  ],

The full text falls into the peliasQuery analyzer here, and Tokyo into peliasAdmin

Edo Tokyo Museum:

"must": [
                   {
                      "multi_match": {
                        "type": "phrase",
                        "query": "edo Tokyo",
                        "fields": [
                          "phrase.default",
                          "phrase.en"
                        ],
                        "analyzer": "peliasQuery",
                        "boost": 1,
                        "slop": 3
                      }
                    },
                   {
                      "constant_score": {
                        "filter": {
                          "multi_match": {
                            "type": "cross_fields",
                            "query": "Museum",
                            "fields": [
                              "parent.country.ngram^1",
                              "parent.dependency.ngram^1",
                              "parent.macroregion.ngram^1",
                              "parent.region.ngram^1",
                              "parent.county.ngram^1",
                              "parent.localadmin.ngram^1",
                              "parent.locality.ngram^1",
                              "parent.borough.ngram^1",
                              "parent.neighbourhood.ngram^1",
                              "parent.locality_a.ngram^1",
                              "parent.region_a.ngram^1",
                              "parent.country_a.ngram^1",
                              "name.default^1.5"
                            ],
                            "analyzer": "peliasQuery"
                          }
                        }
                      }
],

Museum here is separated into a second rule and missing name.en^1.5

bboure avatar Oct 05 '20 15:10 bboure

Don't worry, I'm working on a fix, I will publish something tonight or tomorrow.

Yes, in autocomplete, the last token can be either a part of the subject (the venue) or the hierarchy. That's why we are using a cross_fields with both parent.* and name.default.

Joxit avatar Oct 05 '20 15:10 Joxit

Great, thanks!

bboure avatar Oct 05 '20 16:10 bboure