adapt icon indicating copy to clipboard operation
adapt copied to clipboard

Build a Database with Known Entities

Open wolfv opened this issue 8 years ago • 6 comments

It could be cool to have scripts that can build a Database with known entities from Wikidata.

E.g. one could use a SPARQL query like this (can be executed over here: https://query.wikidata.org)

SELECT ?subj ?label (GROUP_CONCAT(DISTINCT(?altLabel); separator=", ") as ?altlabels)
WHERE
{
    ?subj wdt:P31 wd:Q215380 . # subject -> is instance of -> BAND
    SERVICE wikibase:label {
        bd:serviceParam wikibase:language "en" .
        ?subj rdfs:label ?label . # gather label
        ?subj skos:altLabel ?altLabel # and alt labels (seperated by comma)
    }
}
GROUP BY ?subj ?label
LIMIT 100

to select all bands in Wikidata. Then those entities could be stored in a Trie (as done currently) and the trie nodes could hold the query entity (e.g. Q215380) as well as the subject identifier (for example, Dire Straits are wd:Q50040)

For the intent matching they could be used as additional information for the probabilities (e.g. optional(Adapt.MusicEntity))

wolfv avatar Mar 19 '16 13:03 wolfv

I very much like the idea of this, but I'm not convinced the code/data for so many domains should be rolled up into one super-pip install. Probably we want additional projects (like adapt-data-music), and possibly language specific versions of each. These data sets may be very large, and we want to be respectful of resources on dev boxes as well as end-user devices. I'd be happy to create an adapt-data-music-en repo for you to start playing in, and I'll see if I can find some time to make an adapt-data-weather-en repo to act as an example.

clusterfudge avatar Mar 19 '16 23:03 clusterfudge

True. But when fetching the entities from Wikidata, there could just be scripts that operate on the SPARQL endpoint and generate the dictionaries. Plus there could be a script (or even function) in adapt, that downloads pre-built dictionaries (maybe domain specific if they become too large. That's the way the NLTK does it (they have some nltk.download() ) function.

wolfv avatar Mar 20 '16 08:03 wolfv

You can check out a working prototype here: https://github.com/wolfv/adapt/tree/feature-numbers-dates/adapt/tools

I've added the entity_fetcher script and a trie of almost all musicians and bands in wikidata. The trie is built using marisa trie which I think is really good + fast. The entire trie is only 1.6 MB :)

wolfv avatar Mar 20 '16 10:03 wolfv

I had completely forgotten about NLTK's data management model! I definitely like that; We'd want to come up with a standardized way/location of storing the data so that it can be cached locally (as opposed to re-running queries unnecessarily).

As for marisa trie; that looks like a pretty rockin' trie implementation, but it's missing one major feature from the adapt trie; gather. At least, that appears to be the case from my cursory reading of the marisa-trie python wrapper. I'm not gonna lie, that is some brutally dense code, and having been out of C++ for 5 years (and never writing cython bindings), I can't make any true claim of understanding the code.

I can however explain my code! The purpose of gather is to allow us to make N passes on an utterance for entity tagging (one pass per token), as opposed to doing an N-Gram expansion on the utterance (which would be N! complexity). Maybe there's a clever way to reimplement (or reverse) that logic so we can use a standard trie implementation, but maintain the performance characteristics? I'm open to suggestions.

clusterfudge avatar Mar 20 '16 17:03 clusterfudge

Good to hear! Yes, definitly my idea would be to have a download option thing downloading the data from some other place than wikidata because hitting their server with these queries all the time will be quite expensive.

Hmm, if I understand the gather functionality correctly than my idea would be the following:

Split all names into tokens (e.g. "Blues Brothers" -> "Blues", "Brothers") Append the ID to each token ("Blues" -> 123, "Brothers" -> 123) and afterwards one can find the intersection of all entity IDs in Blues and Brothers to find out that Blues Brothers belong together.

But on a related note, I think that 'in' queries, even with n-gram expansion, are so cheap with the Marisa Trie that it doesn't really matter.

Another option might be to use the following function: trie.has_keys_with_prefix(u'fo') to iteratively build up the n-gram expansion.

Let me know if this stuff made sense :) however, it will probably be a bit harder to implement the matching with edit distance I guess...

wolfv avatar Mar 20 '16 22:03 wolfv

FYI: I still think this is a really interesting idea! I don't believe there's been a ton of progress, but I may revive it in a post-1.0 world. thanks!

clusterfudge avatar Apr 11 '21 06:04 clusterfudge