json-ld.org icon indicating copy to clipboard operation
json-ld.org copied to clipboard

Consider a mechanism for including content from surrounding document

Open danbri opened this issue 8 years ago • 2 comments

This is a "what do you think about this" enquiry rather than a fully polished proposal.

At Google we have come to recommend JSON-LD within HTML, for structured data (schema.org etc.) use especially around search features, in (mild) preference over Microdata and RDFa.

All these formats have different strengths. JSON-LD is very self-contained, which can make publication/maintenance easier. However sometimes we want to have the JSON-LD say things about chunks of markup from the surrounding document that contains the JSON-LD. Extracting those pieces of document is also (we suggest) likely to be a common task, and perhaps worth some kind of standardization support.

Our in-house experiments in this direction use the draft SpeakableSpecification type from schema.org, which defines 3 ways of pointing into the surrounding document content: IDs, 'xpath' and 'cssSelector' properties. While the type name comes from a particular application domain, the selection mechanism is very cross-domain and we thought potentially something a future JSON-LD version might define.

Although downstream processors can use these to extract from the page and (in some sense) attach to the JSON-LD data, we think it might be useful for the JSON-LD specification to talk about the details of this rather than leave it entirely to applications.

There are a few different ways this could be handled, but the rough idea is that vocabularies defining types like SpeakableSpecification might mention some well known datatypes corresponding to something like 'Xpath', 'CssSelector', and the extraction of the corresponding content could be an optional extra service provided by JSON-LD 1.x parsers. Thanks for any thoughts on this.

(working assumptions: some paths would never match, or match multiple times; and that we are only talking about the current containing document)

danbri avatar May 23 '17 14:05 danbri

@danbri This comes down to a best practice rather than any specification language. IMHO, the best way to do this is to combine JSON-LD and RDFa. When parsed as part of a script element, the base URI for the JSON-LD is the same as the enclosing HTML document. Working off of one of the schema.org examples:

<html><head>
<script type="application/ld+json">
{
  "@context": "https://schema.org/",
  "@id": "",
  "@type": "WebPage",
  "contentLocation": "#this"
}
</script>
</head>
<body resource="#this" vocab="http://schema.org/" typeof="WebPageElement">
  <h1 property="name">Lecture 12: Graphs, networks, incidence matrices</h1>
  <p property="description">These video lectures of Professor Gilbert
    Strang teaching 18.06 were  recorded in Fall 1999 and do not
    correspond precisely to the current  edition of the textbook.</p>
  <div property="publisher" typeof="CollegeOrUniversity">
    <h4 class="footer">About <span property="name">MIT OpenCourseWare</span></h4>
  </div>
  <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/3.0/us/deed.en_US"><img
    src="/images/cc_by-nc-sa.png" alt="Creative Commons logo with terms BY-NC-SA." /></a>
</body>
</html>

This generates the following Turtle from my distiller (it also passes through the Linter:

@base <http://example.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfa: <http://www.w3.org/ns/rdfa#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<> a schema:WebPage;
   schema:contentLocation "#this";
   rdfa:usesVocabulary schema: .

<#this> a schema:WebPageElement;
   schema:description """These video lectures of Professor Gilbert
    Strang teaching 18.06 were  recorded in Fall 1999 and do not
    correspond precisely to the current  edition of the textbook.""";
   schema:license <http://creativecommons.org/licenses/by-nc-sa/3.0/us/deed.en_US>;
   schema:name "Lecture 12: Graphs, networks, incidence matrices";
   schema:publisher [
     a schema:CollegeOrUniversity;
     schema:name "MIT OpenCourseWare"
   ] .

This mechanism uses the fact that URIs in the JSON-LD and HTML+RDFa are the same, and the JSON-LD can "point" to the HTML by referencing the same URI.

Alternatively, if you want to point to the actual HTML markup, you can always use rdf:HTML as the datatype within RDFa and do both:

<html><head>
<script type="application/ld+json">
{
  "@context": "https://schema.org/",
  "@id": "",
  "@type": "WebPage",
  "contentLocation": "#this"
}
</script>
</head>
<body resource="#this" vocab="http://schema.org/" typeof="WebPageElement" property="text" datatype="rdf:HTML">
  <h1 property="name">Lecture 12: Graphs, networks, incidence matrices</h1>
  <p property="description">These video lectures of Professor Gilbert
    Strang teaching 18.06 were  recorded in Fall 1999 and do not
    correspond precisely to the current  edition of the textbook.</p>
  <div property="publisher" typeof="CollegeOrUniversity">
    <h4 class="footer">About <span property="name">MIT OpenCourseWare</span></h4>
  </div>
  <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/3.0/us/deed.en_US"><img
    src="/images/cc_by-nc-sa.png" alt="Creative Commons logo with terms BY-NC-SA." /></a>
</body>
</html>

This simply adds property="text" datatype="rdf:HTML" to the body element.

The problem with IDs, path and cssSelector is that they do not result in any semantic content.

Alternatively, you may want to look at work done in Web Annotations which may be closer to what you're looking for.

gkellogg avatar May 23 '17 17:05 gkellogg

That's a reasonable point of view to piggyback on the RDFa spec. There are two things that make this impractical:

  1. Most data providers are using JSON-LD for these situations because they don't have access to their primary content. Otherwise, they would consider using RDFa in the first place. So their article HTML is coming from a CMS that does not allow markup but they want to specify a section to refer to.
  2. RDFa and Microdata markup parsing (HTML -> Triple) is very inflexible in spec. It calls for textContent serialization. Most consumers of the data in practice actually want a more browser-like innerText behavior.

rrlevering avatar Sep 28 '20 14:09 rrlevering