jena icon indicating copy to clipboard operation
jena copied to clipboard

Idle timeout expired

Open SamuelBoerlin opened this issue 5 months ago • 5 comments

Version

5.2.0

Question

We very occasionally run into errors like these during large and long-running backups and restores (~50min.):

Backup:

[74449] Runtime IO Exception (client left?) RC = 500 : java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms

Restore:

[line: 79482794, col: 22] Bad input stream [java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms]

Unfortunately the logs above are the only ones we have on this problem right now, as it happens very sporadically.

For backups e.g. we use curl like so:

curl --fail --silent --show-error -X GET -H "Accept: application/trig" -u "user:password" "http://db:3030/repo" > "backup.trig"

Curiously, despite the above error, curl exited successfully with status code 0, which makes it seem like the connection was actually closed gracefully instead of unexpectedly disconnected. The backup was of course incomplete/cut off.

What is the meaning of this "Idle timeout expired" exception?

Is the timeout configurable somewhere?

Thanks!

SamuelBoerlin avatar Jul 24 '25 13:07 SamuelBoerlin

So you are getting TimeoutException on download (backup) and upload (restore) cases?

java.util.concurrent.TimeoutException isn't from Jena itself.

"(client left?)" is Jena - it is part of a general catch for all kinds of network errors and they are commonly because the connection to the the client stopped responding (did not close the connection neatly) but that is not what is happening here. teh server used to print the stacktrace but clients leaving is something that happens in normal use and the logs fill up.

The project has encountered a situation that might be related but only in the test suite and only when running as a github action.

One or two tests occasionally fail on some kind of low level timeout in the JVM networking code. We don't know why - a guess is that the machine it is running on is a heavily loaded and shared server, and the host VM has a very long pauses due to kernel scheduling. In the past, we've seem quite long pauses (30s+) in scheduling.

In this situation they may be a compounding effect. application/trig is trying to output "pretty" trig by running the Turtle pretty printer on each graph. That is can be expensive in CPU and during preparation the network isn't being used. Combined with pauses, that might push something over the edge.

Unfortunately, there isn't a way to ask for one of the forms of TriG that is less costly. (Maybe, Jena should use one such for this case of dataset-trig output.)

Suggestion: use n-quads for backup. This is streaming and efficient and it does not have a potential long network pause.

Questions:

  1. What environment is this running in? A cloud provider?
  2. Is there a proxy or gateway between curl and the Fuseki server?

There are some comments on the Jetty issues related to failed connections interfering with connection management but they only look "maybe related" to me.

afs avatar Jul 24 '25 20:07 afs

Thanks for your detailed response!

Yes, there has been at least one case where it happened during a backup (download) and also during a restore (upload).

We run Fuseki in a docker container running on a Ubuntu 24.04 LTS VM (VMware). The backup/restore scripts and curl run in a different container than Fuseki, but on the same VM, so there's only the docker overlay network inbetween.

I'll definitely look into n-quads, thanks for the suggestion.

Would info level logs possibly be of any use regarding this problem? We currently only log warning and above.

SamuelBoerlin avatar Jul 28 '25 15:07 SamuelBoerlin

Would info level logs possibly be of any use regarding this problem?

Not Jena logs, maybe java.net.html logs but there will be quite large.

The other recurring failure seen in the test is when HTTP requests return with zero bytes (which is never legal - there is at least one line with status code). I'm guessing that this is somehow related because this kind of error is seems to occur in conditions like "Idle timeout expired" (busy time on github actions).

The curl example suggests the problem is server side or networking - until now we haven't had evidence whether it is server-side, environment or or the Jena client side code sending the request. But the "exit 0" from curl is strange although these large fetches are done "streaming" style - the HTTP response does not include the HTTP response size because it is not known when the response starts.

afs avatar Aug 03 '25 14:08 afs

The curl example suggests the problem is server side or networking - until now we haven't had evidence whether it is server-side, environment or or the Jena client side code sending the request. But the "exit 0" from curl is strange although these large fetches are done "streaming" style - the HTTP response does not include the HTTP response size because it is not known when the response starts.

I tried reproducing this behaviour with a simple chunk-encoded Python HTTP server. As you say, since the HTTP response does not include the response size header, the client cannot know whether a response was actually complete or not. Though if the HTTP server (or network connection I expect) is killed outright during the transfer, then curl fails as expected because the connection is not closed gracefully. So I think we can also rule out a network issue, because it would seem that the connection was actually closed gracefully because curl exited with 0.

SamuelBoerlin avatar Aug 04 '25 07:08 SamuelBoerlin

Good point. Another point against network is that a web search for anything similar does not turn up anything.

afs avatar Aug 04 '25 07:08 afs