General Questions
Hi Andy, thanks for maintaining this repo, very useful! This issue might be better suited for a discussion section as it's some open-ended questions - would you mind opening it up (and in case converting this issue?).
I just discovered this repo and had a quick peek at what is running under the hood.
- ElasticSearch vs OpenSearch:
Would you mind replacing it with OpenSearch, the OSS fork for license reasons? E.g. Photon also made the switch: https://github.com/komoot/photon; seemed fairly friction-free.
- ElasticSearch vs in RAM:
Without having looked at the Python code: why use Elastic at all? All of Geonames is only around 700MB that you could keep in RAM: https://huggingface.co/datasets/do-me/Geonames
- Geonames:
Geonames is full of errors https://huggingface.co/datasets/do-me/Geonames#quality and might not be the most reliable source of data.
Why not use nominatim or Photon?
-
Just to ping you that this recent paper used a very similar approach to this repo with a spacy model but with nominatim: https://www.nature.com/articles/s41597-025-05422-w
-
I recently tried a larg-ish scale extraction of around 1M entites with a GLINER-based model instead and got promising results. Did you benchmark the spacy model against newer NER models (see huggingface, lots of progress happening)? Also, quality-wise I got pretty good results with smallish LLMs <600M params!
Not a critique, just genuinely curious about your choices as I'm digging into how to most efficiently analyze around 100M newspaper articles for Geospatial Semantic Search:
- Demo: https://do-me.github.io/semantic-news-mapper/
- Paper: https://link.springer.com/article/10.1007/s41651-025-00232-5
All the best Dominik
Hey Dominik,
Thanks for these comments and suggestions. This code all grew out of some specific use cases, so there's some path dependency baked in that we could probably revisit.
- Elasticsearch vs. Opensearch: this is an interesting idea and one I'd be happy to review a PR for. We're pinned at the older version of ES, which I think is when Opensearch was forked, so it's possible it won't be that hard.
- Why elasticsearch at all: this was mostly so that we could have fuzzy search and filtering (e.g., by country) built in without having to roll our own implementation of that. We also use Elasticsearch for a related project to index a Wikipedia dump, so it was pretty easy to just reuse the same ES instance.
- Geonames: I've been pretty happy with the quality and it's also easy for us to go in and correct e.g. missing alternative spelling when we need them. I looked into Nominatum a while ago and remembered thinking at the time that we didn't need the street-level granularity it has, but maybe there's a way to just get Geonames-type info. Since we're not doing address-level geocoding, just populated places and above, we didn't need that full dump.
- I read the Kriesch and Losacker paper and thought it was interesting!
- spaCy vs. newer NER model: this is something I'm definitely interested in looking into. I've noticed some issues with spaCy NER, though the spaCy transformer model is definitely better than the older models. This new version of Mordecai uses the contextual embeddings of the model pretty heavily, but those should also be available with the newer NER models.
- Efficiently doing this for 100M articles: this is also a place where this repo could use some work. There are definitely some efficiencies we could get, especially by parallelizing the ES queries and batching some of the model interference. I've been focused on other code recently, but a PR to implement this would be great! @andybega has been doing some great work recently to get the CI set up and working well, which should make contributions easier.
Thanks for the detailed answer! I'll be looking into this for my PhD in the next months, so I might come back with some more details about my findings. From what I found out so far for my specific use case of news articles: geocoding is not at all a bottleneck anyway. Also I get around 300 RPS on a self-hosted Photon instance with unstructured fuzzy search (beefy Hetzner server) which was more than sufficient. It also works with typos:
The biggest bottleneck for me instead is:
- high-quality, reliable NER
- understanding the primary subject location of an article when multiple entities are mentioned. Not quite sure about the current state of the art in this direction.
In my simple tests I got decent results with a simple heuristic:
- using only the first N words of an article and
- ranking all found location entities by counts
Of course this is nothing you can rely on for high quality results which is why I am looking into small long-context LLMs with structured JSON outputs.
Anyway I'm happy to test mordecai3 and you're model in NGEC-2025! :)
I've had a lot of success using small LLMs with structured JSON output (e.g., Qwen3-4B or even 0.6B) and I think that's a great idea! (As an aside, I lived in Nikolassee as a kid.)
The "primary" location thing is tricky. There's some work on it (below), plus I think CLIFF-CLAVEN also tries to do this. My main concern with identifying the "focus" location is that I don't always think it's a well formed question (is an international summit in Geneva about the war in Syria "about" Geneva or Syria?). But this is mostly a concern with event data: the location of an event in a story clearly doesn't equal the story's "focus location", but that might be a problem if you're not doing event extraction.
@inproceedings{imani2017focus,
author = {Imani, Maryam Bahojb and Chandra, Swarup and Ma, Samuel and Khan, Latifur and Thuraisingham, Bhavani},
booktitle = {Big Data (Big Data), 2017 {IEEE} International Conference on},
organization = {IEEE},
pages = {1956--1964},
title = {Focus location extraction from political news reports with bias correction},
year = {2017}}
@article{lee2018lost,
author = {Lee, Sophie J and Liu, Howard and Ward, Michael D},
journal = {Political Science Research and Methods},
pages = {1-18},
title = {Lost in Space: Geolocation in Event Data},
year = {2018}}