comunica
comunica copied to clipboard
Performance issues when federating over multiple SPARQL endpoints
Issue type:
- :snail: Performance issue
Description:
I'm running the following command to create a federated endpoint:
comunica-sparql-http -w4 -t300 sparql@http://localhost:8081/sparql sparql@http://localhost:8082/sparql sparql@http://localhost:8083/sparql sparql@http://localhost:8084/sparql
Here are some useful metrics about each endpoint, for running the following SPARQL query directly on each endpoint with responses having the following content-type: application/sparql-results+json;charset=UTF-8
:
SELECT DISTINCT * WHERE { ?s ?p ?o } LIMIT 100
endpoint | time | response size |
---|---|---|
1 | 161ms | 38.8 KB |
2 | 53.4s | 38.7 KB |
3 | 753ms | 38.5 KB |
4 | 227ms | 30.2 KB |
Running the query on the comunica endpoint (http://localhost:3000/sparql
), it's taking about 2.57min and return nothing.
In the logs, I'm able to see the following:
Server running on http://localhost:3000/sparql
Server worker (79250) running on http://localhost:3000/sparql
Server worker (79248) running on http://localhost:3000/sparql
Server worker (79249) running on http://localhost:3000/sparql
Server worker (79247) running on http://localhost:3000/sparql
[200] POST to /sparql
Requested media type: application/sparql-results+json
Received query query: SELECT DISTINCT * WHERE { ?s ?p ?o } LIMIT 100
Worker 79250 got assigned a new query (0).
<--- Last few GCs --->
[79250:0x158040000] 158323 ms: Scavenge 4020.7 (4123.6) -> 4018.2 (4125.9) MB, 9.6 / 0.0 ms (average mu = 0.540, current mu = 0.479) task;
[79250:0x158040000] 158351 ms: Scavenge 4022.8 (4125.9) -> 4020.0 (4143.4) MB, 12.3 / 0.0 ms (average mu = 0.540, current mu = 0.479) task;
[79250:0x158040000] 163267 ms: Mark-sweep 4031.7 (4143.6) -> 4023.1 (4148.9) MB, 4863.0 / 0.0 ms (average mu = 0.254, current mu = 0.052) task; scavenge might not succeed
<--- JS stacktrace --->
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
1: 0x104811448 node::Abort() [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
2: 0x10481162c node::ModifyCodeGenerationFromStrings(v8::Local<v8::Context>, v8::Local<v8::Value>, bool) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
3: 0x104977fac v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
4: 0x104b367a0 v8::internal::EmbedderStackStateScope::EmbedderStackStateScope(v8::internal::Heap*, v8::internal::EmbedderStackStateScope::Origin, cppgc::EmbedderStackState) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
5: 0x104b351c4 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
6: 0x104bb9820 v8::internal::ScavengeJob::Task::RunInternal() [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
7: 0x104871fbc node::PerIsolatePlatformData::RunForegroundTask(std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task> >) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
8: 0x104870cb0 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
9: 0x106c2ffb8 uv__async_io [/nix/store/3a685f2r0l2fnz899vwl70vl36yykj0r-libuv-1.46.0/lib/libuv.1.dylib]
10: 0x106c42d6c uv__io_poll [/nix/store/3a685f2r0l2fnz899vwl70vl36yykj0r-libuv-1.46.0/lib/libuv.1.dylib]
11: 0x106c3066c uv_run [/nix/store/3a685f2r0l2fnz899vwl70vl36yykj0r-libuv-1.46.0/lib/libuv.1.dylib]
12: 0x10474d940 node::SpinEventLoop(node::Environment*) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
13: 0x10484fdb0 node::NodeMainInstance::Run() [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
14: 0x1047d9efc node::LoadSnapshotDataAndRun(node::SnapshotData const**, node::InitializationResult const*) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
15: 0x1047da1e8 node::Start(int, char**) [/nix/store/n4pkh2cs837cak2kyjgd6sjskcqqb1gr-nodejs-18.17.1/bin/node]
16: 0x18b427f28 start [/usr/lib/dyld]
Worker 79250 died with SIGABRT. Starting new worker.
Server worker (79576) running on http://localhost:3000/sparql
I don't understand why it is hitting the 4 GB heap, because the size of the results for each endpoint is very small. Is there a memory leak?
Also by checking the logs of an endpoint, it seems that it is doing the following requests:
SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o. }
and
SELECT ?s ?p ?o WHERE { ?s ?p ?o. }
The LIMIT
keyword seem to be lost somewhere.
This is, I guess, the reason why it's making the thing explode if I'm using a big endpoint in the list ; it will query everything from the big endpoint.
Is it possible that the LIMIT
can be forwarded to avoid such issues?
Environment:
software | version |
---|---|
Comunica Engine | 2.8.2 |
node | v18.17.1 |
npm | 9.6.7 |
yarn | 1.22.19 |
Operating System | darwin (Darwin 22.5.0) |
Thanks for reporting!
This is a consequence of the federation algorithm that we use, which splits up queries at the level of triple patterns, and sends those to each SPARQL endpoint separately. The advantage is that it's very simple, and works over any type of interface (also other than SPARQL endpoints), but the downside is that it can cause performance/memory issues for complex queries or large datasets.
The LIMIT keyword seem to be lost somewhere.
While passing the LIMIT
would work in this specific case, it will not in the general case.
In general, optimizations are definitely possible for this. Alternative federation algorithms (such as FedX) already exist, but require some significant implementation effort. We may implement this in the upcoming major update of Comunica (v3), but this might not be something for the very near future.
FYI, we're working on Comunica v3, which will focus on improving performance of federated querying across SPARQL endpoints. This should resolve this issue when completed.
Great, thanks for the update!
Comunica v3.x has been released, which may solve this issue. I'm closing this issue, but feel free to re-open in case the problem would still occur, in which case we can look at it more closely.