java
java copied to clipboard
Copy.copyFileToPod() breaks on WebSocket
Describe the bug
A trivial example of copying a file larger than ~17MB to a pod using Copy.copyFileToPod breaks with:
Exception in thread "main" java.io.IOException: WebSocket has closed.
at io.kubernetes.client.util.WebSocketStreamHandler$WebSocketOutputStream.write(WebSocketStreamHandler.java:270)
at org.apache.commons.codec.binary.BaseNCodecOutputStream.flush(BaseNCodecOutputStream.java:124)
at org.apache.commons.codec.binary.BaseNCodecOutputStream.write(BaseNCodecOutputStream.java:177)
at org.apache.commons.compress.utils.CountingOutputStream.write(CountingOutputStream.java:48)
at org.apache.commons.compress.utils.FixedLengthBlockOutputStream$BufferAtATimeOutputChannel.write(FixedLengthBlockOutputStream.java:244)
at org.apache.commons.compress.utils.FixedLengthBlockOutputStream.writeBlock(FixedLengthBlockOutputStream.java:92)
at org.apache.commons.compress.utils.FixedLengthBlockOutputStream.maybeFlush(FixedLengthBlockOutputStream.java:86)
at org.apache.commons.compress.utils.FixedLengthBlockOutputStream.write(FixedLengthBlockOutputStream.java:122)
at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.write(TarArchiveOutputStream.java:454)
at io.kubernetes.client.util.Streams.copy(Streams.java:28)
at io.kubernetes.client.Copy.copyFileToPodAsync(Copy.java:374)
at io.kubernetes.client.Copy.copyFileToPod(Copy.java:350)
at rs.kg.ac.k8sclient.K8sclient.main(K8sclient.java:35)
Suppressed: java.io.IOException: This archive contains unclosed entries.
at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.finish(TarArchiveOutputStream.java:289)
at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.close(TarArchiveOutputStream.java:307)
at io.kubernetes.client.Copy.copyFileToPodAsync(Copy.java:378)
... 2 more
The code is trivial:
...
public static void main(String[] args)
throws IOException, ApiException, InterruptedException, CopyNotSupportedException {
String podName = "oip-robotics-blackfox-5c4f57dc87-k29wk";
String namespace = "default";
ApiClient client = Config.defaultClient();
Configuration.setDefaultApiClient(client);
Copy copy = new Copy();
copy.copyFileToPod(namespace, podName, null, Paths.get("Inputs-Djerdap-500k/300kscenarija.csv"), Paths.get("/tmp/300kscenarija.csv"));
System.out.println("Done!");
}
Even with small files it hangs waiting for the process to complete as reported in #1822 .
Client Version
15.0.1
Kubernetes Version
1.22.9
Java Version Java 8
Server (please complete the following information):
- OS: CentOS 7.8 Linux
- Environment MicroK8s cluster of 7 physical nodes
I'll try to reproduce this, but I suspect that it may be a buffer filling up somewhere. We use PipedInput/OutputStreams in a variety of places in this code and I suspect that those buffers may be filling and blocking.
Can you confirm that kubectl cp ... works correctly in the same environment?
Have you tried Copy.copyFileToPodAsync(...) does that work correctly?
Thanks for the quick reply @brendandburns. I can confirm that kubectl cp... works perfectly and that Copy.copyFileToPodAsync(...) throws exactly the same error as Copy.copyFileToPod(...).
Thanks @imilos what is the network environment between client and server? Is it a direct network connection?
@brendandburns I can confirm that the network connection is direct. I also tried running the client on the machine which belongs to the cluster using https://127.0.0.1:16443. Identical behavior.
Thanks for the info. I will try to replicate on my home cluster.
I just noticed that there was a similar conversation here for the C# client:
https://github.com/kubernetes-client/csharp/issues/949#issuecomment-1190019328
Looks like we need to add chunking for the WebSocket connection, if you are willing to recompile the client library you could try that otherwise I will get to it eventually.
Dear @brendandburns, a colleague from my team raised the CSharp client issue :). Glad there is a solution. I would be happy to build and test the client, but I'm currently on vacation with limited internet access. After the 1st of August, I'd be happy to build and test.
DefaultMaxPayloadBytes = 32 << 20 // 32MB
the k8s impl use default golang/x/websock
I believe the chunk size must be smaller than it
Dear @brendandburns , may I try if it works? I am available from now on.
can confirm that this is happening to me too, is there a solution for this in sight? I might rewrite my project in go, will this solve my issue?
Hi I think I've found the root of this issue and it is around the following piece of code
https://github.com/kubernetes-client/java/blob/6d39601da33cc64ee20a5d6e7500451b3278a6e7/util/src/main/java/io/kubernetes/client/util/WebSocketStreamHandler.java#L263-L274
My hypothesis is that. Despite the data being buffered to avoid flooding the web socket queue we are still reaching the max queue size. It happens when the connection is slow or the remote process is being processed slow then the queue is not released as fast as we write. Note that despite we are writing buffered, our loop is immediately adding more data to the buffer and when there is some latency while releasing the web socket queue we lead to an error because of WebSocketStreamHandler.this.socket.send(ByteString.of(buffer)) return false due to the following piece of code in okhttp3.internal.ws.RealWebSocket
...
// If this frame overflows the buffer, reject it and close the web socket.
if (queueSize + data.size > MAX_QUEUE_SIZE) {
close(CLOSE_CLIENT_GOING_AWAY, null)
return false
}
...
The proposal is to add some code to handle this case something like an active waiting
...
System.arraycopy(b, offset + bytesWritten, buffer, 1, bufferSize);
ByteString byteString = ByteString.of(buffer);
while (WebSocketStreamHandler.this.socket.queueSize() + byteString.size() > MAX_QUEUE_SIZE) {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
if (!WebSocketStreamHandler.this.socket.send(byteString)) {
throw new IOException("WebSocket has closed.");
}
...
WDYT?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.