airmail icon indicating copy to clipboard operation
airmail copied to clipboard

Make better use of spoken language data in WhosOnFirst

Open ellenhp opened this issue 1 year ago • 6 comments

At a bare minimum, spoken language data should inform the dictionary choice used for generating all the abbreviation permutations in airmail_indexer.

I also want to find a way to use it to correctly stem languages. Once focus point queries are supported (currently we only have bounding box queries) we can lookup into WOF the spoken languages in the focus point and surrounding areas and use stemmers for those languages. Doing this will involve splitting out the fields we use by language. Currently there's only one field, "content", but eventually we'll need more for handling matches that need to get boosted. Outside the scope of this issue, but those boosted fields may need a version for each language also. I'm thinking we can use lingua-rs to pick the top 5 possible languages for every query, and then search against those fields in a disjunction, using stemmers as appropriate?

There will be a performance cost to this of course, but the lack of stemmers is really disappointing because with lenient mode off (no prefix queries allowed) I can't search for "mighty-o donut" if the POI is called "mighty-o donuts". When I briefly had stemming working on a feature branch it was so cool to watch things like "tow truck" match "XYZ towing company". That's the kind of thing that I think airmail needs to really stand out, even if it has to be disabled for remote indexes.

ellenhp avatar Feb 25 '24 18:02 ellenhp

Hey Ellen, just discovered Airmail and I'm so happy to see some open source geocoder development with modern tech (Rust, tantivy, range requests)!

Just my two cents for what you describe.

Semantic search for POIs

it was so cool to watch things like "tow truck" match "XYZ towing company"

This is a perfect example where semantic search could come in handy. If you used multilingual embeddings, query latency for any language will be really low too. Also, you can forget about the quirks of stemming etc. To give you an example how fast this can work (on a distributed setup too, but with flatgeobuf and inferencing in the frontend) just have a look at this research demo https://huggingface.co/datasets/do-me/overture-places (paper under review). Happy to share my experiences; I'm building more stuff like this atm.

Cologne Phonetics

better use of spoken language

German street names are a mess. We often need to match street names that people heard to how they are actually spelled. So we had the idea of using Cologne Phonetics . However, it didn't quite work out as planned as it's not designed for matching. You can find the long story here: https://github.com/provinzkraut/cologne_phonetics/issues/3

What worked for us instead, was a custom, simplified logic, where we substituted certain consonants.

I'm not quite sure to how fare you would like to dive into the nitty details of every single language out there, but also happy to share some experiences in case.

do-me avatar Feb 10 '25 16:02 do-me

Thanks for the feedback! Semantic search is something I'm actively working on. Demo available at https://categorical.airmail.rs

It ends up being more complicated than just using embeddings for a variety of reasons though, namely we still need to support focus point queries, so going from the text query to a sparse vector search that will work directly with Tantivy is ideal. I'm working with my employer on releasing more about this

ellenhp avatar Feb 10 '25 17:02 ellenhp

I'm really interested in your demo but unfortunately I haven't been able to get it to work. Do you do dense vector search? And, if so have you been able to find a way to get good performance out of a global index? Both computational importance but also (more interesting to me) subjective quality.

ellenhp avatar Feb 11 '25 16:02 ellenhp

The demo is best to be used on Chrome or Firefox and with a fast internet connection as at the moment it loads a 500Mb model to the frontend once (bge-m3). So it will take a bit of time to load it initially but is then cached in the browser for the second run. Let me know whether it works for you; also wanted to record some demo videos soon! If you're interested in these kind of things you can also have a look at the version fueled by Instagram data for Bonn (Germany, https://github.com/do-me/semantic-hexbins) but that's a little off-topic :D.

However the approach that I chose in the overture demo is slightly different. I'm indexing spatially, i.e. the frontend can load a subset of the data from HF very fast via a bounding box with flatgeobuf. So this is probably not exactly what you're looking for unless you operate on city scale areas.

I think what you're interested in for fast global POI search is something that I'm actually building at the moment. The idea is simple:

  1. use all the fantastic open POI datasets like OSM, WOF, foursquare, overture etc. and somehow harmonize them; in the beginning the name of the places might be sufficient but usually you do have a sophisticated category system attached
  2. ingest everything in a vector db and host it online
  3. I know that e.g. Qdrant has very low latency as well as bbox (and polygon filters), so that might be an idea

I'm cooking something at the moment with a serverless approach where I'm hosting the index on s3. Any client can connect to this index and retrieve the top k rows with very low latency. I think this might come very close to the spirit of airmail. Will ping you, once I have the demo ready.

do-me avatar Feb 11 '25 17:02 do-me

I've thought about this at length, I suspect you could augment the embedding vector with a normalized 3d cartesian representation of the POI's position on earth (augmenting with polar coords would be a bad idea) and and then augment your query vector with the focus point, or 0,0,0 if no focus point (you can actually scale focus point strength continuously this way) and then re-normalize it and throw it into qdrant and see what comes up. Qdrant might also support geo filters but that's not an ideal way to do focus point searches unless you do reranking manually.

That said, I'm personally trying to get sparse search working because IMO transformers feel like the wrong model architecture to generate embeddings of POIs.

ellenhp avatar Feb 11 '25 20:02 ellenhp

Combined embeddings

I had this exact idea as well, but in my setup - and that's something I'm exploring at the moment whether it actually leads to something useful - I want to use the factor time as well. Note that for any distance metric you'll have large biases when e.g. you have 256 dims of "topic" vector, then augment it with 3 dims of x,y,z coordinates.

That's the reason why in my setup I'm experimenting with dim reduction methods to give the same "weight" to both, the topic and space. So an ideal vector would then only have e.g. [3 dims of topic vector + 3 dims of coordinates]

In my first tests though, if you simply want to use cosine distance for instance, I found the results to be hard to justify. E.g. something right at your current location but only loosely related might show up: you're looking for restaurants but find a bar right where you are. On the other hand something quite far with a perfect match might be in the top results too. In that case, what's better?

One could obviously tweak the distance function to either use the coordinates more (or the topic dims) but what's the point then concatting topic and coordinate dims to one vector? IMO the "Google Maps way" makes sense: use the current map view as bbox and retrieve all results in it. Prioritize the displayed POIs by relevance/similarity to the user query on lower zoom levels and only show lesser related POIs when zooming in.

Transformers

Why do you think transformers aren't suitable for this job? Out of my head, the only argument against them atm is the current embedding model size, starting usually from 23 Mb. Have a look at this dataset where I'm using Minish Lab's multilingual static embeddings. This example performs remote semantic search but is slow (if you download the index locally it will be fairly fast) https://huggingface.co/datasets/do-me/foursquare_places_100M#remote-semantic-search

Vector DBs

Qdrant is a good option if you have a little more budget for extra performance. I'm using LanceDB at the moment for index hosting e.g. on cheaper s3 (in the hundreds of ms query time). Locally I get something like ~5ms on my system for inferencing and querying. See this example index for Italy with Foursquare places: https://huggingface.co/datasets/do-me/foursquare_places_100M/blob/main/README.md#semantic-search

Image

I'm currently doing some calculations how much it would cost to host a public s3 (or similar) instance for all large POI databases combined. If it's not too much, I'm considering hosting a public instance myself. Finding sponsors would be the next step I guess.

LanceDB is absolutely great, but it requires a DB connection. So you cannot simply query static files. I'm exploring ways to use static files for this purpose if you're interested: https://github.com/do-me/flatgeobuf-vectordb

do-me avatar Feb 12 '25 08:02 do-me