comunica Performance issues when federating over multiple SPARQL endpoints

Issue type:

:snail: Performance issue

Description:

I'm running the following command to create a federated endpoint:

comunica-sparql-http -w4 -t300 sparql@http://localhost:8081/sparql sparql@http://localhost:8082/sparql sparql@http://localhost:8083/sparql sparql@http://localhost:8084/sparql

Here are some useful metrics about each endpoint, for running the following SPARQL query directly on each endpoint with responses having the following content-type: application/sparql-results+json;charset=UTF-8:

SELECT DISTINCT * WHERE { ?s ?p ?o } LIMIT 100

endpoint	time	response size
1	161ms	38.8 KB
2	53.4s	38.7 KB
3	753ms	38.5 KB
4	227ms	30.2 KB

Running the query on the comunica endpoint (http://localhost:3000/sparql), it's taking about 2.57min and return nothing.

In the logs, I'm able to see the following:

Server running on http://localhost:3000/sparql
Server worker (79250) running on http://localhost:3000/sparql
Server worker (79248) running on http://localhost:3000/sparql
Server worker (79249) running on http://localhost:3000/sparql
Server worker (79247) running on http://localhost:3000/sparql
[200] POST to /sparql
      Requested media type: application/sparql-results+json
      Received query query: SELECT DISTINCT * WHERE { ?s ?p ?o } LIMIT 100
Worker 79250 got assigned a new query (0).

<--- Last few GCs --->

[79250:0x158040000]   158323 ms: Scavenge 4020.7 (4123.6) -> 4018.2 (4125.9) MB, 9.6 / 0.0 ms  (average mu = 0.540, current mu = 0.479) task;
[79250:0x158040000]   158351 ms: Scavenge 4022.8 (4125.9) -> 4020.0 (4143.4) MB, 12.3 / 0.0 ms  (average mu = 0.540, current mu = 0.479) task;
[79250:0x158040000]   163267 ms: Mark-sweep 4031.7 (4143.6) -> 4023.1 (4148.9) MB, 4863.0 / 0.0 ms  (average mu = 0.254, current mu = 0.052) task; scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0x104811448 node::Abort() [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
 2: 0x10481162c node::ModifyCodeGenerationFromStrings(v8::Local<v8::Context>, v8::Local<v8::Value>, bool) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
 3: 0x104977fac v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
 4: 0x104b367a0 v8::internal::EmbedderStackStateScope::EmbedderStackStateScope(v8::internal::Heap*, v8::internal::EmbedderStackStateScope::Origin, cppgc::EmbedderStackState) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
 5: 0x104b351c4 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
 6: 0x104bb9820 v8::internal::ScavengeJob::Task::RunInternal() [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
 7: 0x104871fbc node::PerIsolatePlatformData::RunForegroundTask(std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task> >) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
 8: 0x104870cb0 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
 9: 0x106c2ffb8 uv__async_io [/nix/store/3a685f2r0l2fnz899vwl70vl36yykj0r-libuv-1.46.0/lib/libuv.1.dylib]
10: 0x106c42d6c uv__io_poll [/nix/store/3a685f2r0l2fnz899vwl70vl36yykj0r-libuv-1.46.0/lib/libuv.1.dylib]
11: 0x106c3066c uv_run [/nix/store/3a685f2r0l2fnz899vwl70vl36yykj0r-libuv-1.46.0/lib/libuv.1.dylib]
12: 0x10474d940 node::SpinEventLoop(node::Environment*) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
13: 0x10484fdb0 node::NodeMainInstance::Run() [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
14: 0x1047d9efc node::LoadSnapshotDataAndRun(node::SnapshotData const**, node::InitializationResult const*) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
15: 0x1047da1e8 node::Start(int, char**) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
16: 0x18b427f28 start [/usr/lib/dyld]
Worker 79250 died with SIGABRT. Starting new worker.
Server worker (79576) running on http://localhost:3000/sparql

I don't understand why it is hitting the 4 GB heap, because the size of the results for each endpoint is very small. Is there a memory leak?

Also by checking the logs of an endpoint, it seems that it is doing the following requests:

SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o. }

and

SELECT ?s ?p ?o WHERE { ?s ?p ?o. }

The LIMIT keyword seem to be lost somewhere.

This is, I guess, the reason why it's making the thing explode if I'm using a big endpoint in the list ; it will query everything from the big endpoint.

Is it possible that the LIMIT can be forwarded to avoid such issues?

Environment:

software	version
Comunica Engine	2.8.2
node	v18.17.1
npm	9.6.7
yarn	1.22.19
Operating System	darwin (Darwin 22.5.0)

Aug 23 '23 11:08 ludovicm67

Thanks for reporting!

Aug 23 '23 11:08 github-actions[bot]

This is a consequence of the federation algorithm that we use, which splits up queries at the level of triple patterns, and sends those to each SPARQL endpoint separately. The advantage is that it's very simple, and works over any type of interface (also other than SPARQL endpoints), but the downside is that it can cause performance/memory issues for complex queries or large datasets.

The LIMIT keyword seem to be lost somewhere.

While passing the LIMIT would work in this specific case, it will not in the general case.

In general, optimizations are definitely possible for this. Alternative federation algorithms (such as FedX) already exist, but require some significant implementation effort. We may implement this in the upcoming major update of Comunica (v3), but this might not be something for the very near future.

Aug 23 '23 11:08 rubensworks

FYI, we're working on Comunica v3, which will focus on improving performance of federated querying across SPARQL endpoints. This should resolve this issue when completed.

Oct 20 '23 08:10 rubensworks

Great, thanks for the update!

Oct 20 '23 08:10 ludovicm67

Comunica v3.x has been released, which may solve this issue. I'm closing this issue, but feel free to re-open in case the problem would still occur, in which case we can look at it more closely.

Mar 19 '24 12:03 rubensworks

comunica comunica copied to clipboard

Performance issues when federating over multiple SPARQL endpoints

Issue type:

Description:

Environment:

comunica
comunica copied to clipboard