gerbil icon indicating copy to clipboard operation
gerbil copied to clipboard

owl:sameAs via HDT

Open RicardoUsbeck opened this issue 7 years ago • 11 comments

After a talk to Wouter Beek (@wouterbeek) at ISWC (https://rdfhdt.github.io/ISWC2017/) , I came to the conclusion that we could use Wouter's owl:sameas list and query these links faster via HDT instead of using lucene

RicardoUsbeck avatar Nov 19 '17 19:11 RicardoUsbeck

where is the owl:sameas list? (or do i have to create it myself?)

TortugaAttack avatar Nov 19 '17 19:11 TortugaAttack

Wouter can provide it probably

RicardoUsbeck avatar Nov 19 '17 22:11 RicardoUsbeck

@RicardoUsbeck I have a list of 558,943,116 owl:sameAs triples / explicit identity pairs that were extracted from a LOD Laundromat crawl in 2015. (The number of unique terms involved in at least one explicit identity pair is 179,739,567.) I can send you the list in private ([email protected]). The list is not yet published online because I can not 100% guarantee its correctness yet. I intend on publishing this list in a proper way to the wider community before the end of this year.

wouterbeek avatar Nov 19 '17 22:11 wouterbeek

Sounds very good! :+1: We need to make sure that we don't get wrong sameAs connections as we got them with the DBpedia <-> data.nytimes links that connected Japan to Armenia :smile: But I assume that all of us would be interested in figuring out how good these links are :wink:

MichaelRoeder avatar Nov 21 '17 09:11 MichaelRoeder

@MichaelRoeder I will have to disappoint you then: the largest cluster of owl:sameAs IRIs has size 177,794. This includes not only Japan and Armenia, but also all other countries in the world, Albert Einstein, and the empty string :-P

wouterbeek avatar Nov 21 '17 09:11 wouterbeek

That is not really disappointing. I think we simply have to find a way to

  1. identify these faulty clusters,
  2. figure out which of the links in the cluster are wrong (i.e., which links connect two correct clusters creating one large faulty cluster) and remove these links to fix the clusters.

I know that this is not easy to do in an automatic way. The easiest way is to do step 1 and remove all wrong clusters. However, step 2 sounds interesting from a research point of view :wink:

MichaelRoeder avatar Nov 21 '17 09:11 MichaelRoeder

@MichaelRoeder The owl:sameAs resources are here: https://sameas.cc/

Let me know in case you encounter issues (the resource is still quite new, so I'm expecting there are some). Also, we hope to update this resource once we have a new LOD Cloud crawl.

wouterbeek avatar Mar 02 '18 13:03 wouterbeek

@wouterbeek Thanks a lot. Very interesting service :+1:

MichaelRoeder avatar Mar 02 '18 13:03 MichaelRoeder

Hey starting to clean up old issues. And finally wanted to include this :) @wouterbeek However the sameas.cc site seems to be down. Has it moved or is the service closed?

TortugaAttack avatar Jul 02 '20 09:07 TortugaAttack

Hi @TortugaAttack , the site is still there: https://www.sameas.cc It is maintained by @raadjoe and myself. Feel free to contact us if there are any issues.

PS: In the meantime we have extended our work on owl:sameAs in MetaLink, published at ESWC 2020. Take a look at https://krr.triply.cc/krr/metalink, it may be useful for Gerbil as well.

wouterbeek avatar Jul 02 '20 10:07 wouterbeek

Ah perfect, thanks a lot (I tried it without the www) I will read through that, thank you!

TortugaAttack avatar Jul 02 '20 10:07 TortugaAttack