jena icon indicating copy to clipboard operation
jena copied to clipboard

SPARQL Substitution on INSERT does not preserve Blank Node IDs

Open HolgerKnublauch opened this issue 6 months ago • 8 comments

Version

5.4

What happened?

We are switching from initialBindings to substitution in preparation for initialBindings removal in Jena 6.

There is a difference in behavior that seems to be a show stopper for us. See the following test case. The expectation is that when I pass a blank node as binding, it should use exactly that blank node, but it seems to create a fresh blank node.

package org.topbraid.jenax.model;

import static org.junit.jupiter.api.Assertions.*;

import org.apache.jena.graph.Node;
import org.apache.jena.graph.NodeFactory;
import org.apache.jena.query.Dataset;
import org.apache.jena.query.DatasetFactory;
import org.apache.jena.query.Syntax;
import org.apache.jena.rdf.model.Model;
import org.apache.jena.rdf.model.ModelFactory;
import org.apache.jena.sparql.core.DatasetGraph;
import org.apache.jena.sparql.core.Var;
import org.apache.jena.sparql.engine.binding.Binding;
import org.apache.jena.sparql.engine.binding.BindingFactory;
import org.apache.jena.sparql.exec.UpdateExec;
import org.apache.jena.update.UpdateFactory;
import org.apache.jena.update.UpdateRequest;
import org.junit.jupiter.api.Test;

class SubstitutionBlankNodeTest {

	@Test
	void test() {
		
		String updateString = 
				"PREFIX : <http://example.org/>\n" +
				"INSERT { $node :p :o }\n" +
				"WHERE { }\n";
		
		UpdateRequest request = UpdateFactory.create(updateString, Syntax.syntaxARQ);
		Model model = ModelFactory.createDefaultModel();
		Dataset dataset = DatasetFactory.create(model);
		DatasetGraph dsg = dataset.asDatasetGraph();
		Node node = NodeFactory.createBlankNode();
		Binding binding = BindingFactory.binding(Var.alloc("node"), node);
		UpdateExec.newBuilder().
				dataset(dsg).
				update(request).
				substitution(binding).  // Fails
				// initialBinding(binding). // Works
				build().
				execute();
        
		int tripleCount = model.getGraph().find(node, Node.ANY, Node.ANY).toList().size();
		
		assertEquals(tripleCount, 1);
	}
}

Relevant output and stacktrace


Are you interested in making a pull request?

None

HolgerKnublauch avatar Jun 20 '25 07:06 HolgerKnublauch

would this not be considered a "normal" expectation for blank nodes? I presume you are assigning a different resource to the binding but not in the sense of owl:sameAs?

neumarcx avatar Jun 20 '25 09:06 neumarcx

I assume this is a simplified example but I think the reason it doesn't work is because ARQ is implementing the SPARQL Update spec strictly, more specifically Section 3.1.3 DELETE/INSERT which says:

Blank nodes that appear in an INSERT clause operate similarly to blank nodes in the template of a CONSTRUCT query, i.e., they are re-instantiated for any solution of the WHERE clause; refer to Templates with Blank Nodes in SPARQL Query 1.1 and to the formal semantics of DELETE/INSERT below for details

In the substitution case your update ends up being the following:

PREFIX : <http://example.org/>
INSERT { _:someBlankNode :p :o }
WHERE { }

So _:someBlankNode is used to generate a fresh blank node for each solution from the WHERE clause.

Assuming this is some kind of Jena backed storage you may be able to use ARQs blank node URI form i.e. NodeFactory.createURI("_:" + blankNode.getBlankNodeLabel()) and pass that in the binding.

rvesse avatar Jun 20 '25 10:06 rvesse

It worked well with initialBindings. If pre-binding blank nodes is not supported in UPDATEs then I guess the API should throw an error.

I hope it's just an oversight and the old behavior can be implemented for substitution too, as it otherwise significantly reduces the utility of SPARQL in Jena. In many use cases it is unpredictable whether a subject is a IRI or a blank node. If SPARQL can only be used for the former, I (and our customers) would need to replace all usages of SPARQL in programs and fall back to an API such the Graph API where blank nodes have predictable identities.

HolgerKnublauch avatar Jun 20 '25 13:06 HolgerKnublauch

I guess the API should throw an error.

It has been deprecated since Jena 5.0.0.

https://github.com/apache/jena/issues/2028

afs avatar Jun 20 '25 14:06 afs

Yes, initialBindings has been long deprecated. But I think substitution(binding) as used in my test case should throw an exception if it contains a blank node, assuming this is no longer supported.

HolgerKnublauch avatar Jun 20 '25 14:06 HolgerKnublauch

Assuming this is some kind of Jena backed storage you may be able to use ARQs blank node URI form i.e. NodeFactory.createURI("_:" + blankNode.getBlankNodeLabel()) and pass that in the binding.

It must be a Jena database because UpdateExec.newBuilder().dataset(dsg).

substitution works uniformly for all triple stores (local and remote, Jena and non-Jena) including reparsing.

afs avatar Jun 20 '25 14:06 afs

The initialBinding/substitute issue here seems to be more about whether the output contains the same blank node as an RDF term. If it goes through a result set syntax, then it won't be.

Initial binding in local execution is a limitation affecting all query execution and hinders development of the query execution.

A Jena API problem is that UpdateExec.newBuilder() returns a UpdateExecDatasetBuilder, not UpdateExecBuilder (same for QueryExec).

Saying what the user wants directly - UpdateExec.dataset(dsg),QueryExec.dataset(dsg) - gives more flexibility.

UpdateExecDatasetBuilder.create() returns the specifically local builder. UpdateExecDataset.newBuilder() is missing; QueryExecDataset.newBuilder() is exists.

afs avatar Jun 20 '25 15:06 afs

the Graph API where blank nodes have predictable identities.

They do in the Model API as well.

UpdateTransformOps / QueryTransformOps is also a possible approach.

afs avatar Jun 20 '25 15:06 afs

Did some further experiments, and notice that (with substitution) SELECT is still working as before (with initialBindings), while CONSTRUCT also seems to create fresh blank nodes.

import static org.junit.jupiter.api.Assertions.*;

import org.apache.jena.query.Dataset;
import org.apache.jena.query.DatasetFactory;
import org.apache.jena.query.Query;
import org.apache.jena.query.QueryExecution;
import org.apache.jena.query.QueryExecutionDatasetBuilder;
import org.apache.jena.query.QueryFactory;
import org.apache.jena.query.QuerySolution;
import org.apache.jena.query.QuerySolutionMap;
import org.apache.jena.query.ResultSet;
import org.apache.jena.rdf.model.Literal;
import org.apache.jena.rdf.model.Model;
import org.apache.jena.rdf.model.ModelFactory;
import org.apache.jena.rdf.model.Resource;
import org.apache.jena.rdf.model.Statement;
import org.apache.jena.vocabulary.RDFS;
import org.junit.jupiter.api.Test;

public class BindingQueryTest {
	
	private static boolean SUBSTITUTION = true;

	private QueryExecution create(Query query, Dataset dataset, QuerySolution initialBinding) {
		QueryExecutionDatasetBuilder builder = QueryExecution
				.create()
				.dataset(dataset)
				.query(query);
		if(SUBSTITUTION) {
			builder = builder.substitution(initialBinding);
		}
		else {
			builder = builder.initialBinding(initialBinding);
		}
		return builder.build();
	}

	
	@Test
	public void testSelect() {
		Model model = ModelFactory.createDefaultModel();
		Dataset dataset = DatasetFactory.create(model);
		Resource blank1 = model.createResource();
		Resource blank2 = model.createResource();
		Literal label1 = model.createTypedLiteral("One");
		blank1.addLiteral(RDFS.label, label1);
		blank2.addLiteral(RDFS.label, "Two");

		QuerySolutionMap binding = new QuerySolutionMap();
		binding.add("node", blank1);
		
		{
			Query query = QueryFactory.create("CONSTRUCT { $node ?p ?label } WHERE { $node ?p ?label }");
			QueryExecution qexec = create(query, dataset, binding);
			Model result = qexec.execConstruct();
			assertEquals(1, result.size());
			Statement s = result.listStatements().next();
			assertEquals(s.getSubject(), blank1);
			assertEquals(s.getObject(), label1);
		}

		{
			Query query = QueryFactory.create("SELECT $node ?label WHERE { $node ?p ?label }");
			QueryExecution qexec = create(query, dataset, binding);
			ResultSet rs = qexec.execSelect();
			assertTrue(rs.hasNext());
			QuerySolution s = rs.next();
			assertEquals(s.get("label"), label1);
			assertEquals(s.get("node"), blank1);
			assertFalse(rs.hasNext());
		}
	}
}

What is the reason why SELECT would still work as before while CONSTRUCT/INSERT produce new blank nodes?

Also, are there technical reasons why the old behaviour couldn't be restored for substitution too? At least as a flag?

This is critical for the continued use of SPARQL in our stack. We may not be able to upgrade to 6.0 and may instead have to maintain our own fork of Jena.

HolgerKnublauch avatar Jun 23 '25 03:06 HolgerKnublauch

What is the reason why SELECT would still work as before while CONSTRUCT/INSERT produce new blank nodes?

As I already pointed to the SPARQL specification defines the semantics of blank nodes in templates, used in CONSTRUCT and INSERT, specifically such that blank nodes are treated as placeholders used to generate new blank nodes for each solution.

Whereas for a SELECT it depends where you inserted the blank node. In your example you put it in the project expressions which happens to work (though honestly I'm surprised it does, that may actually be a bug in substitution) if you had put it in the BGP then it would most likely have been treated as a variable and probably not given you the result you wanted/expected.

Also, are there technical reasons why the old behaviour couldn't be restored for substitution too?

Yes, as @afs already said:

Initial binding in local execution is a limitation affecting all query execution and hinders development of the query execution.

It's a feature that only works for local execution and mostly does so by subverting SPARQL semantics hence why you are able to get some of the behaviours you consider "expected".

Legacy features in the project are maintenance overhead, the project is driven entirely by volunteer effort and has no paid contributors. While some of us have from time to time been afforded some small fraction of our $dayjob to make contributions it has been an incredibly small fraction of our time, and usually only to address specific issues pertinent to our employers. Volunteers cannot maintain every feature forever, especially when a feature hinders evolution of the project that volunteers actually want/have the time and energy to spend on it.

Whereas the new approach Substitution aims to be standard compliant, and works across all kinds of query execution, including remote.

We may not be able to upgrade to 6.0 and may instead have to maintain our own fork of Jena.

That seems somewhat of an extreme reaction

Many of Jena's APIs, including query execution, are inherently designed to be extensible. If you really cannot live without initial bindings then you would likely be better off simply providing a custom query engine for your use case that provides that feature, lifting the deprecated code from Jena as needed.

Custom query engines are pretty lightweight to maintain and keep up to date with Jena releases. I maintained one for many years in a previous job where the SPARQL engine needed specific algebra optimisations applying, custom algebra for some operators etc.

rvesse avatar Jun 23 '25 09:06 rvesse

Use:

        PREFIX : <http://example.org/>
        INSERT { ?N :p :o }
        WHERE  { BIND ( $node AS ?N ) }

afs avatar Jun 23 '25 09:06 afs

Thanks @afs for the work-around, which seems to work.

But why does that work if the explanation is that INSERT should always produce new blank nodes? This doesn't look consistent.

(Also it wouldn't really solve our problem that basically all SPARQL queries that may take a blank node now need to be rewritten).

@rvesse yes I do consider behavior "expected" when it was working like that for 15+ years. The new semantics have no conceivable advantages. If people want fresh blank nodes, they don't need to use pre-binding. If pre-binding doesn't preserve the blank nodes in the output, it becomes rather useless.

HolgerKnublauch avatar Jun 25 '25 08:06 HolgerKnublauch

if the explanation is that INSERT should always produce new blank nodes?

New blank nodes are create for each use of the template for each blank node written in the insert or construct template. This is a variable value, not writing in the blank node.

        INSERT { ?N :p :o }

does not have a blank node written in it.

Substitution uses syntax rewrite on the AST.

afs avatar Jun 25 '25 08:06 afs

Ok, so if the blank node binding works in the WHERE clause, would it be possible to leave blank nodes in the INSERT/DELETE/CONSTRUCT parts as variables and then insert the pre-bound values as variables into the result set that feeds into the quads? In what cases would it ever make sense to pre-bound some blank node into an INSERT/DELETE/CONSTRUCT if it will have a different value in the end?

HolgerKnublauch avatar Jun 25 '25 09:06 HolgerKnublauch

Ok, so if the blank node binding works in the WHERE clause, would it be possible to leave blank nodes in the INSERT/DELETE/CONSTRUCT parts as variables and then insert the pre-bound values as variables into the result set that feeds into the quads?

IIUYC: that is what

        PREFIX : <http://example.org/>
        INSERT { ?N :p :o }
        WHERE  { BIND ( $node AS ?N ) }

is doing.

afs avatar Jun 25 '25 12:06 afs

Yes, but this requires changes to the SPARQL queries to basically reassign the pre-bound variables, and we and our customers have a large number of those.

HolgerKnublauch avatar Jun 25 '25 13:06 HolgerKnublauch

Your product controls the execution so run a transform as has already been mentioned.

TopQuadrant customers have a contract with TopQuadrant.

afs avatar Jun 25 '25 13:06 afs

Ok thanks, I have spent this morning adjusting our query executions to modify the queries and updates, inserting the extra BINDs at the end (just before the last }), and replacing the variables in the INSERT/DELETE/CONSTRUCT templates. It seems to work.

Having said this, I still believe the current implementation in Jena should do this under the hood automatically and expect other users to be equally surprised about this issue.

In any case, from my perspective the ticket could be closed as I have a work-around, and I thank you for the guidance on that.

HolgerKnublauch avatar Jun 26 '25 10:06 HolgerKnublauch