webchem icon indicating copy to clipboard operation
webchem copied to clipboard

get_wdid() searches all of wikidata, not just chemicals

Open Aariq opened this issue 5 years ago • 7 comments

Currently get_wdid() searches more than just chemicals:

 get_wdid("Horse", verbose = FALSE)
       id match distance query
1 Q869595 Horse        0 Horse

This might be a problem for something that is both a chemical and something else, especially with acronyms like DDT which returns wdids for "Duffy's Tavern Airport" and "Dark Dance Treffen".

However, there is a note in the code that suggests it may be possible to narrow the search:

#! Use SPARQL to search of chemical compounds (P31)?! For a finer / better search?

SPARQL is used in wd_ident() and that's all I know about it!

Aariq avatar Apr 15 '20 20:04 Aariq

related to #82

Aariq avatar Apr 15 '20 20:04 Aariq

Indeed, I saw the comment about SPARQL also a while ago and started working on functions to improve the wikidata query. I am almost done and will push a PR next week.

andschar avatar Apr 16 '20 06:04 andschar

Wonderful! I'm concurrently working on a PR to standardize input and output of all the get_() functions, and unfortunately I think that get_wdid() is one of the functions I changed the code for the most. (https://github.com/Aariq/webchem/tree/git-consistency).* Maybe take a look and see if you'd rather me go first with my PR?

*"git" was a typo in the branch name. It's supposed to bet "get-consistency".

Aariq avatar Apr 16 '20 15:04 Aariq

Yes, go ahead and once your PR is merged I change the code within the function, leaving the standardized structure intact.

andschar avatar Apr 16 '20 16:04 andschar

PR #242 is now merged

Aariq avatar Apr 28 '20 15:04 Aariq

Great! I will file a PR this or next week as suggested above.

andschar avatar Apr 29 '20 14:04 andschar

Hi @andschar how's the work for this coming along? Being a Wikidata editor, I think I could help out a bit with this one, if it's not solved yet.

I mostly wanted to chime in to say that searching by item name with "standard" SPARQL is not particularly efficient and would probably time out a lot, see this for reference.

That being said, there is a workaround which uses a mashup of SPARQL and the MediaWiki API, for example:

SELECT ?item ?itemLabel WHERE {
  SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:endpoint "www.wikidata.org";
        wikibase:api "EntitySearch";
        mwapi:search "pyridine";
        mwapi:language "en".
      ?item wikibase:apiOutputItem mwapi:item.
  }
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "en". 
  }
  ?item wdt:P31 wd:Q11173 # Guarantees items are 'instances of' a chemical compound
}

Results for this query

The query above would search all item names and aliases for the string "pyridine", while also excluding results that are not "instances of" (P31) "chemical compound" (Q11173), which could help out with unwanted results.

jvfe avatar Oct 02 '20 18:10 jvfe