data-model-spec A Triple which is not a Quad

A Triple which is not a Quad

Open awwright opened this issue 5 years ago • 22 comments

It came to my attention in #124 that we don't really have a way of talking about triples without implying that they're part of a graph. Since #124 is about a slightly different issue (if triple should be aliased to quad), I'd like to separately raise adding a Triple interface.

I think it's important to have separate Triple and Quad instances, because they're not the same thing. A Triple is an axiomatic statement; a Quad additionally signifies a Triple exists in a single graph. But sometimes I want to be able to talk about an RDF statement without implying membership in a graph.

So far we've supposed the DefaultGraph should be sufficient if graph membership is unimportant—just treat it as extraneous information. Perhaps we add the requirement that RDF sources add configuration options on how to generate graph names. But this is a workaround; it adds additional complexity to many components of an ecosystem that could be dispensed with entirely.

For example, suppose I parse two Turtle documents and want to test if they're isomorphic. What does this mean if I'm returned a Dataset, without any interface-level guarantee all the triples will be in a single graph? Confusing Quad for Triple muddies the semantics of RDF, which does not define interpretations/entailment over anything other than a single graph. RDF uniquely identifies statements by (subject,predicate,object), and this triple is the same triple even if present in multiple graphs. But the current implementation considers them to be different quads; so there is no way to test for triple-equality.

Adding a graph property immediately doubles the memory requirements to have a fully indexed RDF store. For applications that don't need a graph property—such as testing isomorphism or entailment—this can be quite significant.

URIs/IRIs are supposed to be universal, and so this adds a requirement that each component agree on how to name graphs & treat graph names. While this shouldn't be a foreign concept to RDF developers, a fourth dimension of IRI to maintain is not insignificant, and in my experience working with RDF, not typically necessary; as a result, we now have to decide how to configure a parser that should be zero-configuration.

Sometimes I want to be able to hold multiple graphs in memory without naming them. What are the semantics of having two Quad stores with different information for the same graphs? It's probably possible to figure out, but it's not immediately apparent to me.

It appears to me that Quad stores and named graphs were invented for applications that can't store graphs without names; for example SPARQL, where the graph name is an alternative to a file on the filesystem. But we don't have this limitation in ECMAScript, and I don't think we should limit the data interface to things describable over SPARQL.

For some perspective: Presently I'm working on an application that uses and produces RDFa data. (In the future, it'll do the same with JSON-LD and JSON Hyper-schema.) It uses datasets and quads to identify which RDFa document makes which statements. This is done with a library I've maintained, itself derived from webr3's work.

First, I want the application to manage the namespace for the graphs, as opposed to libraries I call out to. I've tried managing the data a few different ways, and I've simply found it's simpler if I work with Triples when I'm dealing with graphs, and Quads in a single case where I'm aggregating all the information together or querying it.

Second, several of the document operations demands use for Triple, because I have a Graph implementation that provides useful methods that only make sense defined over graphs, things like unions, merges, equality/isomorphism testing, and so on. We're defining an OO interface, and so I would like to define methods that are defined over a graph and not an entire dataset.

Additionally I've been considering adding these methods to Triple, because Triples can be considered a singleton Graph; but a Triple is not a Quad: Since Quad implies two pieces of axiomatic information (both a statement, and its membership to a single graph), and sometimes these methods are only defined over one or the other, not both.

I hope this makes a convincing point; I'm happy to answer any questions or consider any feedback. Thanks!

Jan 23 '19 18:01 awwright

Issues in brief:

I should be able to compare two triples for equality regardless of which graphs they're a member of, if any;
I should be able to define methods that only make sense over Graphs and Triples (itself either as a triple or as a singleton graph)

Jan 23 '19 18:01 awwright

I should be able to compare two triples for equality regardless of which graphs they're a member of, if any

If you have 'triple in some graph', I think you already have a quad there. Any application can implement functions to compare equality only based on s, p, o and to ignore the graph component. I don't understand how do you imagine to have

two triples for equality regardless of which graphs they're a member of

once again, quad means a triple in a graph (named or default)

RDF uniquely identifies statements by (subject,predicate,object), and this triple is the same triple even if present in multiple graphs.

I understand that sometimes you see need to only consider the s, p, o part of the quad. If I understand you correctly, you would like to have an interface that only represents s, p, o (aka Triple), does it mean that instance of Triple would never equal an instance of Quad even if it has graph set to an instance of DefaultGraph?

In practice, that seems to mean that Turtle parser would emit instances of Triple, while Trig parser would emit instances of Quad, some of them with graph set to an instance of DefaultGraph. If we get them as representation returned from deferencing the same IRI, I find it not aligning very well with https://www.w3.org/TR/rdf11-concepts/#section-dataset-conneg

Sometimes I want to be able to hold multiple graphs in memory without naming them. What are the semantics of having two Quad stores with different information for the same graphs? It's probably possible to figure out, but it's not immediately apparent to me.

You can do it by having two different datasets, each having its distinct default graph. Since blank node labels stay scoped to dataset, one has to stay careful when merging data from two different datasets. If you have two distinct graphs from the same dataset, at least one of them needs to be named graph.

Adding a graph property immediately doubles the memory requirements to have a fully indexed RDF store.

@jacoscaz could you please comment on that based on your experience with node-quadstore

Jan 23 '19 19:01 elf-pavlik

If you have 'triple in some graph', I think you already have a quad there.

A Quad specifically means a Triple is a member of a named graph (potentially including a single default graph with no URI). I scarcely use named graphs.

Imagine I parse two identical Turtle files with a single triple, and a TriG file representing this:

a.ttl

<http://example.com/thing> a <http://example.com/Thing> .

b.ttl

<http://example.com/thing> a <http://example.com/Thing> .

dataset.trig

GRAPH <http://example.org/a.ttl> {
   <http://example.com/thing> a <http://example.com/Thing> .
}
GRAPH <http://example.org/b.ttl> {
   <http://example.com/thing> a <http://example.com/Thing> .
}

In the Turtle case, Quad#equals returns true. In the TriG case, Quad#equals returns false. Why define an equals function at all, if it isn't going to follow RDF semantics for strict identity of statements?

Since there's no concept of a Triple, our current Quad#equals is useless for comparing statements in multiple Datasets (or even between different graphs). The only place I use Quad#equals is in my Dataset implementation, inside individual instances.

If we get them as representation returned from deferencing the same IRI, I find it not aligning very well with https://www.w3.org/TR/rdf11-concepts/#section-dataset-conneg

Why should parsers for different media types emit the same thing? Nobody ever complained that JSON parsers can't return a DOM because HTML does.

The paragraph you reference specifically suggests how to convert the two data models between one another for compatibility, an admission they're not the same thing:

If an RDF dataset is returned and the consumer is expecting an RDF graph, the consumer is expected to use the RDF dataset's default graph.

You can do it by having two different datasets, each having its distinct default graph.

This is not intuitive. The point of having a Dataset is to manage multiple graphs; if you add "... unless you need multiple unnamed graphs, then use multiple Datasets" now I have to decide what to do in the case two datasets have conflicting information about the same named graph.

Jan 23 '19 22:01 awwright

@awwright This spec is not about covering everything for every library which does RDF, it's about a common set of interfaces most of us can agree on. That requires to make it open enough for custom features which are still spec compliant and allow custom features without (big) performance drawbacks.

I think what you want to do is possible in a spec compliant way. Maybe it must be implemented in a not so nice way, but it's possible. e.g. you can make a tripleEquals(a, b) function to compare only the SPO part of a quad. Also the spec doesn't forbid to implement a Graph with indexes only for SPO. If you also add graph = new DefaultGraph() to the prototype of Quad, it requires close to zero additional memory.

It looks like there is a consensus that triple will be removed (#124). If there is any chance to convince people of your point, I think code examples would be the best option with a comparison of doing it the spec way vs. your way.

Jan 24 '19 14:01 bergos

@bergos That's a fair point, but then the question becomes why Quad? I see evidence that Triple is somewhat simpler, more intuitive, has more uses, and so is more likely to see adoption.

Up until this one, every RDF library has exposed Triples, and now that we've had some implementation experience, what evidence do we have so far that Quads produces a more successful API?

And consider Quad#equals, which has little need for standardization (I can't use it to compare the atoms of the statement, it's not necessary for interoperability, the few libraries that need can just implement their own, yes?)

But that's assuming it's a one-or-the-other proposition. We can have both; it's the natural progression of standards to expand in scope as different implementations realize they're implementing the same features.

Jan 24 '19 19:01 awwright

Parsers for JSON-LD, TriG or N-Quads require having a Quad. I don't know how an option/compromise for Triples handling Quads could look like without being a Quad. The other way round by using DefaultGraph for the graph looks much less than a compromise. Also @timbl said there is always a 4th property for a statement (graph or why in rdflib.js).

Jan 24 '19 21:01 bergos

I think handling clearly #117 would get affected by having another option of undefined graph. In practice one would have to use either triples or quads and never mix them to avoid unpredictable behavior.

And consider Quad#equals, which has little need for standardization (I can't use it to compare the atoms of the statement, it's not necessary for interoperability, the few libraries that need can just implement their own, yes?)

I consider to proposing not to include Quad#equals or Term#equals in the spec.

Jan 24 '19 22:01 elf-pavlik

In #153 we have removed Triple alias and DataFactory.triple() method, at the same time recommending to represent a triple as quad set to an instance of DefaultGraph.

#154 show further motivation for it, Source#match() will query the 'union' (all graphs) with null or undefined passed as graph argument. To query just the 'default graph' one can pass an instance of DefaultGraph as graph argument. Allowing undefined graph on Quad or having Triple which would have graph: undefined wouldn't work well with Source#match since one couldn't query just for those.

Mar 04 '19 21:03 elf-pavlik

@awwright Quad is the more generic class of Triple. I think it would be easy to support your idea of Triple via subclassing Quad in some impl, but getting something like this into the spec is likely much harder. However, I do think it would be less cumbersome to propose (yet another) term type, e.g., 'NoGraph' (kinda like how we currently use 'DefaultGraph'), as opposed to introducing a new class Triple. This would ostensibly allow all the features you describe without any special handling in existing implementations, while also allowing for certain optimizations depending on the method/use-case. I'm sure people here will have qualms with the idea of proposing a new term type but if supporting Triples the way you describe them is a priority this might be the more compatible approach.

Mar 04 '19 21:03 blake-regalia

@blake-regalia What's the behavior for equals then? Two triples are the same Triple regardless of which graphs they're found in. Do we change the behavior of equals to match this?

What about implementations that rely on using equals to sort and do strict equality of the graph property? Does the function become non-commutative?

Mar 04 '19 22:03 awwright

Two triples are the same Triple regardless of which graphs they're found in. Do we change the behavior of equals to match this?

If you 'find it in a graph' then you have a Quad. As I understand @blake-regalia a Triple would have always graph set to suggested NoGraph, as soon as we have graph set to NamedNode or DefaultGraph (or BlankNode in generalized RDF) we have a Quad.

BTW personally I don't see need for suggested Triple and NoGraph and think we might just struggle with a different way of looking at the same thing.

Mar 04 '19 22:03 elf-pavlik

If you 'find it in a graph' then you have a Quad.

A Quad is an assertion that a given Triple exists in a given Graph. It makes sense to talk about two Quads I find and ask if they're the same Triple!

Mar 04 '19 22:03 awwright

function equalTriples(some, other) {
  return some.subject.equals(other.subject) &&
         some.predicate.equals(other.predicate) &&
         some.object.equals(other.object)
}

Mar 04 '19 22:03 elf-pavlik

@elf-pavlik Sure, but you also have to explain why we instead have a throughly useless Quad#equals function that doesn't follow RDF semantics at all.

Mar 04 '19 22:03 awwright

For illustration, here's an example of a problem I just ran into: I've got a Dataset of RDF statements, organized by source file. I aggregate this dataset into a single graph, and use this graph to build a search index, sitemaps, tables of data, and other queries across the whole collection. Doing this in a Dataset is possible. (Dataset/Quad is, indeed, a superset of Graph/Triple.)

However, it's clunky: In some cases, I would find the same RDF statement serialized to my Turtle file multiple times. I have to map/reduce the Dataset to another Dataset, changing the graph property to a constant. I have to write an assertion to check that the data I pull out of this aggregate dataset has the correct graph property (to protect against future changes). And having this in a separate Dataset sort of defies the point of a Dataset (which is to store multiple graphs).

Some of the statements will be found in multiple graphs, but Quad doesn't have a mechanism to specify more than one graph. So instead I use a Triple. And if (for some reason) I need to determine what graphs the triple is found in, I can query the Dataset's SPOG index.

If I get a Quad, I don't know if the graph property is significant or not. If I get a Quad in the default graph, can I assume future Quads will also be in the default graph? Or will I have to add code to handle different graphs?

To safely process a Quad I always have to handle all four properties. But the application doesn't always require this, and sometimes the semantics are undefined or under-constrained. The solution here seems to be to throw if the Dataset defines more than one graph. I find this dubious.

Mar 11 '19 21:03 awwright

@awwright similar to my point here: https://github.com/rdfjs/data-model-spec/issues/159#issuecomment-469442618

Mar 11 '19 21:03 namedgraph

If the point of the RDF/JS spec is to form consensus around the API, why is it going against what established RDF APIs have been doing for 20 years? Is it a case of NIH syndrome?

Take RDF4J, Jena, ruby-rdf - every single one of them contain an abstraction for graph and triple. That is because RDF 1.0 only standardized those.

Datasets and quads came much later, with SPARQL. Eventually they made it into RDF 1.1.

So if a developer is familiar with RDF at all, there is a much bigger chance s/he is familiar with triples and not quads. And this API does not even contain such terms. Why alienate and confuse potential users?

Mar 11 '19 21:03 namedgraph

For the sake of argument, here's a couple considerations:

It might be the case a different API would bring more success to RDF usage by applications. Now that we have some implementation experience, how does this theory bear out?
Many people only need an interface for data exchange that plays to the unique qualities of ECMAScript. But if a media type like Turtle is too unwieldy, why not specify a vocabulary like JSON-LD? (And in any event, I'm here because I want a standard API to manipulate a data structure the same way the DOM API manipulates XML (or compatible) documents.)

Mar 11 '19 22:03 awwright

why is it going against what established RDF APIs have been doing for 20 years?

RDF 1.1 , Trig, N-Quads all have 2014 release date. I think APIs started 20 years ago might have not taken Datasets and named graphs into account.

Let's think of this simple experiment, let's serve exactly same representation for application/n-triple & application/n-quad, similar for text/turtle & text/trig. Actually this should even work if we respond with the same N-Triple based content for each media type above.

What graph parser will assign when parsing

_:b0 <http://schema.org/jobTitle> "Professor" .
_:b0 <http://schema.org/name> "Jane Doe" .
_:b0 <http://schema.org/telephone> "(425) 123-4567" .
_:b0 <http://schema.org/url> <http://www.janedoe.com> .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .

served with content type application/n-quad or text/trig ?

clue

https://www.w3.org/TR/n-quads/#rdf-dataset-construction

This RDF triple is added to the graph labeled by the production graphLabel, if no graphLabel is present the triple is added to the RDF datasets default graph.
https://www.w3.org/TR/trig/#output-graph

The state curGraph is initially unset. It records the label of the graph for triples produced during parsing. If undefined, the default graph is used.

Mar 12 '19 17:03 elf-pavlik

@elf-pavlik is that a trick question? The default graph. What does that prove?

I think a more relevant experiment is reading such data, then taking it from the default graph and storing it into named graph, which name is most likely the URI the data was read from. Am I supposed to iterate the quads to change the graph component to do that?

This just goes against working with graphs as units (and triples as their constituents). That is important because currently Linked Data is graph-based, not quad-based.

Mar 17 '19 18:03 namedgraph

I think a more relevant experiment is reading such data, then taking it from the default graph and storing it into named graph, which name is most likely the URI the data was read from. Am I supposed to iterate the quads to change the graph component to do that?

Thinking about immutability conversation in #81 changing the graph component doesn't sound like a way to go. I think one would either use a transform stream which would create copy of each quad with different graph, or it would make sense in a similar way as one can give baseIRI to the parser to also provide some kind of nameForDefaultGraph. This way quads could have that IRI instead the default graph from the begging. I'll create issue for that in stream-spec repo.

Mar 18 '19 17:03 elf-pavlik

Something I either didn't see or forgot to mention:

Quad is the more generic class of Triple.

This is not true because Quad always makes an assertion that some graph contains some triple; Triple does not do this. Therefore, the two classes have disjoint semantics.

Apr 25 '22 23:04 awwright

data-model-spec data-model-spec copied to clipboard

A Triple which is not a Quad

data-model-spec
data-model-spec copied to clipboard