sparql.anything icon indicating copy to clipboard operation
sparql.anything copied to clipboard

[HTML] Add Schema.org and other inline rdf support

Open danbri opened this issue 4 years ago • 12 comments

A great many pages contain RDF data via Schema.org (in microdata, json-ld, rdfa). There are also other vocabularies which uses those syntaxes. Does SPARQL Anything represent that data naturally, or could it be adapted to do so?

danbri avatar Nov 29 '21 12:11 danbri

Currently, it is only generating an RDF-like view of the DOM tree.

In general, SA generates the main graph for the resource content (RDF-like view) and, in some cases, additional graphs for metadata (e.g. EXIF metadata for images).

In the case of HTML, SA could generate additional named graphs with extracted metadata. These should include:

  • RDFa
  • Microdata
  • Microformats
  • Others?

We could use http://any23.apache.org -- other ideas?

enridaga avatar Nov 30 '21 10:11 enridaga

Thanks. You might look at https://github.com/wbsg-uni-mannheim/WDCFramework/blob/master/pom.xml since they extract these formats and seem to build upon any23

Named graphs makes sense to distinguish the different syntax sources

UK Guardian newspaper pages are usually good if you want to find examples of json-ld and microdata in the same page. Or at least used to be.

On Tue, 30 Nov 2021 at 10:19, Enrico Daga @.***> wrote:

Currently, it is only generating an RDF-like view of the DOM tree.

In general, SA generates the main graph for the resource content (RDF-like view) and, in some cases, additional graphs for metadata (e.g. EXIF metadata for images).

In the case of HTML, SA could generate additional named graphs with extracted metadata. These should include:

  • RDFa
  • Microdata
  • Microformats
  • Others?

We could use http://any23.apache.org -- other ideas?

— You are receiving this because you authored the thread.

Reply to this email directly, view it on GitHub https://github.com/SPARQL-Anything/sparql.anything/issues/164#issuecomment-982492911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJSGILFJJ6G2MIRGGT2LTUOSQMXANCNFSM5I66EMEQ .

danbri avatar Nov 30 '21 11:11 danbri

This relates to #13

luigi-asprino avatar Dec 07 '21 16:12 luigi-asprino

With dcc589e SA is able to extract metadata from HTML pages. This feature relies on Any23. By default Any23 extracts quads having the URL of the page as graph URI. Therefore, at the moment, the content extracted by SA and Any23 collapses on the same graph. The option to enable this feature is html.metadata=(true/false) (false by default). Of course, we can discuss which is the best way to serve Any23 extracted content. This was just a tentative implementation of the feature.

luigi-asprino avatar Dec 11 '21 08:12 luigi-asprino

That's fantastic - nice work!

On Sat, 11 Dec 2021, 08:31 luigi-asprino, @.***> wrote:

With dcc589e https://github.com/SPARQL-Anything/sparql.anything/commit/dcc589e8cfffe681014ea883def4ab8b4b5481ab SA is able to extract metadata from HTML pages. This feature relies on Any23. By default Any23 extracts quads having the URL of the page as graph URI. Therefore, at the moment, the content extracted by SA and Any23 collapses on the same graph. The option to enable this feature is html.metadata=(true/false) (false by default). Of course, we can discuss which is the best way to serve Any23 extracted content. This was just a tentative implementation of the feature.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SPARQL-Anything/sparql.anything/issues/164#issuecomment-991538143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJSGNLPSPOSTGXWJWWODDUQMD43ANCNFSM5I66EMEQ .

danbri avatar Dec 11 '21 15:12 danbri

Graph names can be customized according to the running extractor. Will do a commit with partial work in this direction.

enridaga avatar Dec 13 '21 09:12 enridaga

Any23 should use the HTTP client of SA.

Any23.setHTTPClient

However, this means that we need to make a public method Triplifier.getHTTPClient, which we don't have at the moment.

enridaga avatar Dec 13 '21 09:12 enridaga

Any23 should use the HTTP client of SA.

Any23.setHTTPClient

However, this means that we need to make a public method Triplifier.getHTTPClient, which we don't have at the moment.

However, I would prefer to just pass an InputStream to Any23, really.

enridaga avatar Dec 13 '21 09:12 enridaga

cool i do see the embedded json-ld (which uses schema.org) from IMDB now.

curl --silent 'http://localhost:3000/sparql.anything'  \
-H 'Accept: text/csv' \
--data-urlencode 'query=
PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>
select *
# construct {?s ?p ?o}
WHERE {
service <x-sparql-anything:>{
    fx:properties fx:location "https://www.imdb.com/title/tt1160419/" .
    fx:properties fx:media-type "text/html" .
    fx:properties fx:html.metadata "true" .
    graph ?g {?s ?p ?o .}
}
}'

yields:

s,p,o,g

...
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/url,https://www.imdb.com/title/tt1160419/,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/site_name,IMDb,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/title,Dune (2021) - IMDb,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/description,"Dune: Directed by Denis Villeneuve. With Timothée Chalamet, Rebecca Ferguson, Oscar Isaac, Jason Momoa. Feature adaptation of Frank Herbert's science fiction novel about the son of a noble family entrusted with the protection of the most valuable asset and most vital element in the galaxy.",https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/type,video.movie,https://www.imdb.com/title/tt1160419/
...

it would be nice if it was in a different named graph so i could easily tell if html had embedded RDF (by counting the number of distinct graphs).

justin2004 avatar Dec 16 '21 23:12 justin2004

ops i missed them in the snippet but they are there.

EDIT

here they are

s,p,o,g
_:b0,http://schema.org/actor,_:b1,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/actor,_:b2,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/actor,_:b3,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/aggregateRating,_:b4,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/alternateName,Dune,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/contentRating,PG-13,https://www.imdb.com/title/tt1160419/
...

justin2004 avatar Dec 16 '21 23:12 justin2004

it would be nice if it was in a different named graph so i could easily tell if html had embedded RDF (by counting the number of distinct graphs).

Yes, this is the plan

enridaga avatar Dec 17 '21 09:12 enridaga