gerbil icon indicating copy to clipboard operation
gerbil copied to clipboard

nif-file with taIdentRef to wikidata showing bad results

Open Linux249 opened this issue 7 years ago • 6 comments

It is possible for gerbil to handle nif-datasets with taIdentRef-URls to wikidata (e.g.: https://www.wikidata.org/wiki/Q65)?

I resume realy bad results from gerbil for all annotators with my own (direct oploaded) dataset, regardless from running localy or online(demo).

Here is a small example dataset (3 docs á 10 annotations): A3A10.txt results: http://gerbil.aksw.org/gerbil/experiment?id=201704120006

We also have wikipedia Urls to identify the annotations (https://en.wikipedia.org/wiki/Los_Angeles) and we got bedder results. Same data with diferrent taIdentRef (wikipedia) urls: A3A10_wiki.txt results: http://gerbil.aksw.org/gerbil/experiment?id=201704120004

Linux249 avatar Apr 12 '17 11:04 Linux249

I assume that the bad results for the wikidata URIs are caused by the default configuration of GERBIL that is not configured for wikidata.

Can you please open the src/main/properties/gerbil.properties and

  1. add the line org.aksw.gerbil.evaluate.DefaultWellKnownKB=https://www.wikidata.org/wiki/ to tell GERBIL that wikidata is a known KB.
  2. add the line org.aksw.gerbil.semantic.sameas.impl.http.HTTPBasedSameAsRetriever.domain=www.wikidata.org to enable the same as retrieval for wikidata URIs.

Does this fix your problem?

MichaelRoeder avatar Apr 12 '17 12:04 MichaelRoeder

Thanks but no, it doesn't. Exactly the same results.

I saw in the logs this warings so i added the missing files (the gerbil.proprtie file tells me witch) manuely.

017-04-12 13:19:20,349 [localhost-startStop-1] WARN [org.aksw.gerbil.dataset.check.impl.FileBasedCachingEntityCheckerManager] - <Couldn't read the cache file. Trying the temporary file...> 2017-04-12 13:19:20,349 [localhost-startStop-1] WARN [org.aksw.gerbil.dataset.check.impl.FileBasedCachingEntityCheckerManager] - <Couldn't read cache from files. Creating new empty cache.> 2017-04-12 13:19:20,856 [localhost-startStop-1] WARN [org.aksw.gerbil.semantic.sameas.impl.cache.FileBasedCachingSameAsRetriever] - <Couldn't read the cache file. Trying the temporary file...> 2017-04-12 13:19:20,856 [localhost-startStop-1] WARN [org.aksw.gerbil.semantic.sameas.impl.cache.FileBasedCachingSameAsRetriever] - <Couldn't read cache from files. Creating new empty cache.>

After that i run the dataset with dbpedia URLs again and record with "./start.sh > gerbil.log" everything: gerbil_workAround1.txt

Linux249 avatar Apr 12 '17 13:04 Linux249

No, the warnings that are printed because of missing cache files do not cause these problems and you don't have to create any of these files. It is enough to make sure that the directory for these file is available and the program can write to it. However, these warnings are not related to your problem but another important part of the log that you appended seems to explain why the configuration from above does not solve your problem.

Unfortunately, it seems like there is no fast solution for your problem. Retrieving owl:sameAs links works fine for other KBs that point to Wikidata, e.g., for the DBpedia -> Wikidata direction. However, Wikidata does not send the requested RDF triples but HTML pages that can not be parsed by our RDF lib. This leads to the problem that GERBIL can not find a path from the Wikidata URIs in your dataset to the DBpedia / Wikipedia URIs returned by the systems. I created a new issue for that problem in #195

On the other hand, the sameAs retrieval should be performed for the results generated by the annotators as well, which would also solve the problem. In the following, we should focus on the fact that this does not seem to work correctly in your case. @TortugaAttack Does the DBpedia sameAs index include Wikidata URIs as objects of the owl:sameAs triples? Could you please test this locally and check what is going wrong?

MichaelRoeder avatar Apr 12 '17 14:04 MichaelRoeder

On the other hand, the sameAs retrieval should be performed for the results generated by the annotators as well, which would also solve the problem. In the following, we should focus on the fact that this does not seem to work correctly in your case.

That's exactly what I thought and why I'm confused about the results.

@TortugaAttack Does the DBpedia sameAs index include Wikidata URIs as objects of the owl:sameAs triples? Could you please test this locally and check what is going wrong?

The DBPedia sameAS includes Wikidata URIs - our own annotator disambiguate to wikidata and we test him with Gerbil. Also you can see it here: http://dbpedia.org/page/Los_Angeles

Linux249 avatar Apr 12 '17 15:04 Linux249

Yes, I know that the DBpedia has this URIs. The question is, whether the index that has been created by @TortugaAttack still contains this data.

MichaelRoeder avatar Apr 12 '17 15:04 MichaelRoeder

I will check them to be absolutly sure, but i am 100% certain they are in it. If DBpedia has a owl:sameAs from the resource to the wikidata uri it must be in the index.

TortugaAttack avatar Apr 13 '17 14:04 TortugaAttack