java Copy.copyFileToPod() breaks on WebSocket

trafficstars

Describe the bug A trivial example of copying a file larger than ~17MB to a pod using Copy.copyFileToPod breaks with:

Exception in thread "main" java.io.IOException: WebSocket has closed.
	at io.kubernetes.client.util.WebSocketStreamHandler$WebSocketOutputStream.write(WebSocketStreamHandler.java:270)
	at org.apache.commons.codec.binary.BaseNCodecOutputStream.flush(BaseNCodecOutputStream.java:124)
	at org.apache.commons.codec.binary.BaseNCodecOutputStream.write(BaseNCodecOutputStream.java:177)
	at org.apache.commons.compress.utils.CountingOutputStream.write(CountingOutputStream.java:48)
	at org.apache.commons.compress.utils.FixedLengthBlockOutputStream$BufferAtATimeOutputChannel.write(FixedLengthBlockOutputStream.java:244)
	at org.apache.commons.compress.utils.FixedLengthBlockOutputStream.writeBlock(FixedLengthBlockOutputStream.java:92)
	at org.apache.commons.compress.utils.FixedLengthBlockOutputStream.maybeFlush(FixedLengthBlockOutputStream.java:86)
	at org.apache.commons.compress.utils.FixedLengthBlockOutputStream.write(FixedLengthBlockOutputStream.java:122)
	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.write(TarArchiveOutputStream.java:454)
	at io.kubernetes.client.util.Streams.copy(Streams.java:28)
	at io.kubernetes.client.Copy.copyFileToPodAsync(Copy.java:374)
	at io.kubernetes.client.Copy.copyFileToPod(Copy.java:350)
	at rs.kg.ac.k8sclient.K8sclient.main(K8sclient.java:35)
	Suppressed: java.io.IOException: This archive contains unclosed entries.
		at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.finish(TarArchiveOutputStream.java:289)
		at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.close(TarArchiveOutputStream.java:307)
		at io.kubernetes.client.Copy.copyFileToPodAsync(Copy.java:378)
		... 2 more

The code is trivial:

...
    public static void main(String[] args)
            throws IOException, ApiException, InterruptedException, CopyNotSupportedException {
        String podName = "oip-robotics-blackfox-5c4f57dc87-k29wk";
        String namespace = "default";
        ApiClient client = Config.defaultClient();
        Configuration.setDefaultApiClient(client);
        Copy copy = new Copy();
        copy.copyFileToPod(namespace, podName, null, Paths.get("Inputs-Djerdap-500k/300kscenarija.csv"), Paths.get("/tmp/300kscenarija.csv"));
        System.out.println("Done!");
}

Even with small files it hangs waiting for the process to complete as reported in #1822 .

Client Version 15.0.1

Kubernetes Version 1.22.9

Java Version Java 8

Server (please complete the following information):

OS: CentOS 7.8 Linux
Environment MicroK8s cluster of 7 physical nodes

Jul 13 '22 13:07 imilos

I'll try to reproduce this, but I suspect that it may be a buffer filling up somewhere. We use PipedInput/OutputStreams in a variety of places in this code and I suspect that those buffers may be filling and blocking.

Can you confirm that kubectl cp ... works correctly in the same environment?

Have you tried Copy.copyFileToPodAsync(...) does that work correctly?

Jul 13 '22 21:07 brendandburns

Thanks for the quick reply @brendandburns. I can confirm that kubectl cp... works perfectly and that Copy.copyFileToPodAsync(...) throws exactly the same error as Copy.copyFileToPod(...).

Jul 14 '22 06:07 imilos

Thanks @imilos what is the network environment between client and server? Is it a direct network connection?

Jul 14 '22 16:07 brendandburns

@brendandburns I can confirm that the network connection is direct. I also tried running the client on the machine which belongs to the cluster using https://127.0.0.1:16443. Identical behavior.

Jul 15 '22 15:07 imilos

Thanks for the info. I will try to replicate on my home cluster.

Jul 15 '22 22:07 brendandburns

I just noticed that there was a similar conversation here for the C# client:

https://github.com/kubernetes-client/csharp/issues/949#issuecomment-1190019328

Looks like we need to add chunking for the WebSocket connection, if you are willing to recompile the client library you could try that otherwise I will get to it eventually.

Jul 20 '22 14:07 brendandburns

Dear @brendandburns, a colleague from my team raised the CSharp client issue :). Glad there is a solution. I would be happy to build and test the client, but I'm currently on vacation with limited internet access. After the 1st of August, I'd be happy to build and test.

Jul 22 '22 18:07 imilos

DefaultMaxPayloadBytes = 32 << 20 // 32MB the k8s impl use default golang/x/websock

I believe the chunk size must be smaller than it

Jul 31 '22 21:07 tg123

Dear @brendandburns , may I try if it works? I am available from now on.

Aug 09 '22 12:08 imilos

can confirm that this is happening to me too, is there a solution for this in sight? I might rewrite my project in go, will this solve my issue?

Aug 17 '22 16:08 IdoWinter

Hi I think I've found the root of this issue and it is around the following piece of code

https://github.com/kubernetes-client/java/blob/6d39601da33cc64ee20a5d6e7500451b3278a6e7/util/src/main/java/io/kubernetes/client/util/WebSocketStreamHandler.java#L263-L274

My hypothesis is that. Despite the data being buffered to avoid flooding the web socket queue we are still reaching the max queue size. It happens when the connection is slow or the remote process is being processed slow then the queue is not released as fast as we write. Note that despite we are writing buffered, our loop is immediately adding more data to the buffer and when there is some latency while releasing the web socket queue we lead to an error because of WebSocketStreamHandler.this.socket.send(ByteString.of(buffer)) return false due to the following piece of code in okhttp3.internal.ws.RealWebSocket

...
    // If this frame overflows the buffer, reject it and close the web socket.
    if (queueSize + data.size > MAX_QUEUE_SIZE) {
      close(CLOSE_CLIENT_GOING_AWAY, null)
      return false
    }
...

The proposal is to add some code to handle this case something like an active waiting

...
        System.arraycopy(b, offset + bytesWritten, buffer, 1, bufferSize);
        ByteString byteString = ByteString.of(buffer);

        while (WebSocketStreamHandler.this.socket.queueSize() + byteString.size() > MAX_QUEUE_SIZE) {
          try {
            Thread.sleep(1000);
          } catch (InterruptedException e) {
            throw new RuntimeException(e);
          }
        }

        if (!WebSocketStreamHandler.this.socket.send(byteString)) {
          throw new IOException("WebSocket has closed.");
        }
...

WDYT?

Sep 22 '22 13:09 dani8art

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 21 '22 14:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 20 '23 15:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Feb 19 '23 15:02 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 19 '23 15:02 k8s-ci-robot

java java copied to clipboard

Copy.copyFileToPod() breaks on WebSocket

java
java copied to clipboard