extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

Images dataset contains wrong triples

Open jlareck opened this issue 3 years ago • 18 comments

Issue validity

Some explanation: DBpedia Snapshot is produced every three months, see Release Frequency & Schedule, which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. Berlin or Joe_Biden here: http://dief.tools.dbpedia.org/server/extraction/en/ If the issue persists, please post the link from your browser here:

https://dbpedia.org/page/Borysthenia_goldfussiana https://dbpedia.org/page/Ingoldiomyces There are more triples in the DBpedia snapshot 2021-09 that contain this issue

Error Description

Please state the nature of your technical emergency:

Looks like ImageExtractorNew produces triples from Wikipedia pages that don't contain images. For example https://en.wikipedia.org/wiki/Borysthenia_goldfussiana, it doesn't contain any image but the ImageExtractorNew produced triple with image http://commons.wikimedia.org/wiki/Special:FilePath/T-72_ATE_South_Africa.jpg from it. The same issue with page https://en.wikipedia.org/wiki/Ingoldiomyces, it doesn't contain any picture but ImageExtractorNew also produced triple with image https://upload.wikimedia.org/wikipedia/commons/c/cf/B%26N_nook_Logo.svg

Pinpointing the source of the error

Where did you find the data issue? Non-exhaustive options are:

  • Web/SPARQL, e.g. http://dbpedia.org/sparql or http://dbpedia.org/resource/Berlin, please provide query or link
  • Dumps: dumps are managed by the Databus. Please provide artifact & version or download link
  • DIEF: you ran the software and the error occured then, please include all necessary information such as the extractor or log. If you had problems running the software use another issue template

This error occurs in https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/ImageExtractorNew.scala

Details

please post the details

Wrong triples RDF snippet

<http://dbpedia.org/resource/Borysthenia_goldfussiana> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/T-72_ATE_South_Africa.jpg>

<http://dbpedia.org/resource/Ingoldiomyces> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/B&N_nook_Logo.svg> .

Expected / corrected RDF outcome snippet

We must remove that kind of triples


Example DBpedia resource URL(s)


Other

jlareck avatar Dec 02 '21 11:12 jlareck

I have an extensive sample set that we can use to test when this issue is resolved /jay gray

jaygray0919 avatar Jan 07 '22 19:01 jaygray0919

@jaygray0919 Could you please send this sample set? Looks like that I resolved the issue but not sure that completely (at least produced dataset doesn't contain <http://dbpedia.org/resource/Borysthenia_goldfussiana> and <http://dbpedia.org/resource/Ingoldiomyces> triples but it would be cool to check other wrong triples)

jlareck avatar Jan 12 '22 08:01 jlareck

@jlareck try using this: https://afdsi.com/sparql-species/#/specierch/gold i can explain the app if you are interested /jay

jaygray0919 avatar Jan 12 '22 16:01 jaygray0919

@jlareck this also worked well 6 months ago, but is now very slow/unresponsive: https://afdsi.com/search-dbpedia-tv-shows/?#genre=&language=&country=& it seems/feels-like the parser is 'in a twist' do you see any obvious reasons for its sluggishness? /jay

jaygray0919 avatar Jan 12 '22 16:01 jaygray0919

@jlareck anything we can do to help out here? if possible, we'd like to feature this and related DBpedia/SPARQL apps as part of a product launch in the near future so we're motivated to help restore the images previously served by the SPARQL queries /jay

jaygray0919 avatar Jan 19 '22 18:01 jaygray0919

try using this: https://afdsi.com/sparql-species/#/specierch/gold i can explain the app if you are interested /jay

Hi @jaygray0919, thank you for providing this link with examples! I checked some triples in the upcoming release image dataset and as I see some wrong images were not extracted but there are still some triples that contain images not related to the wikipage. So, the image extractor that produces the data is only partitially fixed.

if possible, we'd like to feature this and related DBpedia/SPARQL apps as part of a product launch in the near future so we're motivated to help restore the images previously served by the SPARQL queries

Could you please provide more details what do you want to do?

jlareck avatar Jan 19 '22 19:01 jlareck

The url Species is one of our DBpedia/SPARQL applications. To reprise the above: "the content ain't right" Previously, when it "was right" the images for the queries were 100% correct (we checked extensively over a year ago - zero errors). Our request: restore the last good version. Now, we're not so naive to think that's easy; since the last solid data set, many changes have been applied. But the bottom line: DBpedia content has been corrupted. While we can determine that item images are corrupt, there may be other errors that also crept in somewhere during an update. It's highly unlikely that only the image files are fubar - my guess is that there are problems with other item properties. An indicator is the performance problems we see with another SPARQL application - TV Shows A year ago, this app worked very well. It is now very slow and produces irregular results. We're far more concerned with Species than TV Shows and are willing to "pitch in" and find the last good dataset (the version with uncorrupted image property values). Does that make sense? Anything short-term we can do to restore an uncorrupted dataset?

jaygray0919 avatar Jan 19 '22 21:01 jaygray0919

Hi @jaygray0919, sorry, but it looks like we cannot restore uncorrupted dataset at the moment. Image dataset should have a better quality in the upcoming release, but it still contains some wrong triples. I am discovering those triples now, and we will try to fix image extraction till the next release

jlareck avatar Jan 24 '22 09:01 jlareck

Got it. Then we'll be happy to work with you to incrementally identify misaligned images in the next release. Then you can use that list to correct a subsequent release. ITMT, the link we shared above will display - for biologics - misaligned images. It's a one-at-a-time process, but it might help you identify patterns that we cannot easily see (e.g. a consistent pairing of biologics/non-biologics). For example, there is a high concentration of military weapons in our biologic queries.

jaygray0919 avatar Jan 24 '22 12:01 jaygray0919

Hi @jaygray0919, could you please check more images on your website if there are any incorrect images? Because it seems to me that I fixed the image extraction and all images should be correct. Thank you

jlareck avatar Feb 10 '22 08:02 jlareck

Hello @jlareck - will do; will report back today/tomorrow Thank you for doing this work.

jaygray0919 avatar Feb 10 '22 13:02 jaygray0919

Previous errors that have been corrected: https://afdsi.com/sparql-species/#/specierch/gold https://afdsi.com/sparql-species/#/specierch/green https://afdsi.com/sparql-species/#/specierch/taurus

Small problems: https://afdsi.com/sparql-species/#/specierch/red Feredayia graminosa

I'll look for other errors later today

jaygray0919 avatar Feb 10 '22 13:02 jaygray0919

foaf:depiction

https://dbpedia.org/page/Pseudocharopa_whiteleggei https://commons.wikimedia.org/wiki/Special:Redirect/file/Lord_Howe_Island.png

https://dbpedia.org/page/Chiasmia_goldiei https://commons.wikimedia.org/wiki/Special:Redirect/file/Chiasmia_goldiei.jpg

https://dbpedia.org/page/Golden_volute https://commons.wikimedia.org/wiki/Special:Redirect/file/Iredalina_mirabilis.jpg

https://dbpedia.org/page/Pictured_rove_beetle https://commons.wikimedia.org/wiki/Special:Redirect/file/thinopinus_pictus.jpg

https://dbpedia.org/page/Tenthredo_amoena https://commons.wikimedia.org/wiki/Special:Redirect/file/Tenthredinidae_-_Tenthredo_amoena.jpg

https://dbpedia.org/page/Tenthredo_crassa https://commons.wikimedia.org/wiki/Special:Redirect/file/Tenthredinidae_-_Tenthredo_crassa-001.jpg

jaygray0919 avatar Feb 10 '22 17:02 jaygray0919

Small problems: https://afdsi.com/sparql-species/#/specierch/red Feredayia graminosa

Actually, this is the correct image. Check the page https://en.wikipedia.org/wiki/Feredayia_graminosa , this article contains 3 images. I think that if the current version of image extraction extracts all pictures from wikipages, and produces multiple triples with foaf:depiction, you can show not only one picture but all those pictures on your website. Otherwise if you want to show only first picture from the wikipage, you can try to use dbo:thumbnail instead of foaf:depiction .

foaf:depiction

https://dbpedia.org/page/Pseudocharopa_whiteleggei https://commons.wikimedia.org/wiki/Special:Redirect/file/Lord_Howe_Island.png

https://dbpedia.org/page/Chiasmia_goldiei https://commons.wikimedia.org/wiki/Special:Redirect/file/Chiasmia_goldiei.jpg

And regarding to this, I think it is a one more issue in image extraction that I didn't notice before, but now it is related to creating incorrect links to wikimedia images

jlareck avatar Feb 10 '22 18:02 jlareck

Unfortunately, your (sensible) exception handling is difficult to implement. We 'grab' the first instance and do not iterate on subsequent instances. And dimensions for dbo:thumbnail do not look good on desktop (they are passable on mobile, but we need to keep it simple).

Returning to the big picture, your corrections seem to handle the glaring issues (biologics like Russian tanks; aircraft; etc.) If you can correct the null values, that will further improve the display. Bottom line: queries are dramatically improved; thank you for that. /jay

jaygray0919 avatar Feb 10 '22 20:02 jaygray0919

@jlareck good first milestone :-). but can you please write the documentation for the images dataset https://databus.dbpedia.org/dbpedia/generic/images/ and explain what to expect there. I think this is important knowledge for users to understand the difference between foaf:depiction, dbo:thumbnail and foaf:thumbnail. For me it is confusing I had to look in the code to get an impression that is not good...

@jaygray0919 thanks for testing and finding issues. But I do not understand your issue with multiple images, there seems no complexity in that, right? Just write the sparql query so that only one image is returned? or use thumbnail and cut off the size parameter at the end?

JJ-Author avatar Feb 11 '22 09:02 JJ-Author

@JJ-Author I'll revist the SPARQL query, which has some age to it. When doing the original engineering, we did not see or foresee the need to test for more than one image; our single select on foaf:depiction worked 100% of the time. However, it will be much more difficult to read multiple properties and test for multiple images. Based on @jlareck corrections, we're ~90% of our previous results, which is acceptable. I'm reluctant to make an isolated change to a large program at this time. When we do reopen the beast, we'd like to add new features like autosuggest to limit the scope of the query. The current version hits DBpedia fairly hard, and we'd like to implement a more refined query. We'd also like to introduce a "You also may be interested in" using a reasoner (which, of course, adds back complexity). Bottom line: we'd like to help improve data quality thru testing, but postpone changes to the app until we have a new plan.

jaygray0919 avatar Feb 11 '22 21:02 jaygray0919

@JJ-Author I made a pull request with the documentation for the image dataset: https://github.com/dbpedia/marvin-config/pull/4 . Could you please check it?

jlareck avatar Feb 13 '22 19:02 jlareck