java icon indicating copy to clipboard operation
java copied to clipboard

Copy.copyFileToPod() breaks on WebSocket

Open imilos opened this issue 3 years ago • 11 comments
trafficstars

Describe the bug A trivial example of copying a file larger than ~17MB to a pod using Copy.copyFileToPod breaks with:

Exception in thread "main" java.io.IOException: WebSocket has closed.
	at io.kubernetes.client.util.WebSocketStreamHandler$WebSocketOutputStream.write(WebSocketStreamHandler.java:270)
	at org.apache.commons.codec.binary.BaseNCodecOutputStream.flush(BaseNCodecOutputStream.java:124)
	at org.apache.commons.codec.binary.BaseNCodecOutputStream.write(BaseNCodecOutputStream.java:177)
	at org.apache.commons.compress.utils.CountingOutputStream.write(CountingOutputStream.java:48)
	at org.apache.commons.compress.utils.FixedLengthBlockOutputStream$BufferAtATimeOutputChannel.write(FixedLengthBlockOutputStream.java:244)
	at org.apache.commons.compress.utils.FixedLengthBlockOutputStream.writeBlock(FixedLengthBlockOutputStream.java:92)
	at org.apache.commons.compress.utils.FixedLengthBlockOutputStream.maybeFlush(FixedLengthBlockOutputStream.java:86)
	at org.apache.commons.compress.utils.FixedLengthBlockOutputStream.write(FixedLengthBlockOutputStream.java:122)
	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.write(TarArchiveOutputStream.java:454)
	at io.kubernetes.client.util.Streams.copy(Streams.java:28)
	at io.kubernetes.client.Copy.copyFileToPodAsync(Copy.java:374)
	at io.kubernetes.client.Copy.copyFileToPod(Copy.java:350)
	at rs.kg.ac.k8sclient.K8sclient.main(K8sclient.java:35)
	Suppressed: java.io.IOException: This archive contains unclosed entries.
		at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.finish(TarArchiveOutputStream.java:289)
		at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.close(TarArchiveOutputStream.java:307)
		at io.kubernetes.client.Copy.copyFileToPodAsync(Copy.java:378)
		... 2 more

The code is trivial:

...
    public static void main(String[] args)
            throws IOException, ApiException, InterruptedException, CopyNotSupportedException {
        String podName = "oip-robotics-blackfox-5c4f57dc87-k29wk";
        String namespace = "default";
        ApiClient client = Config.defaultClient();
        Configuration.setDefaultApiClient(client);
        Copy copy = new Copy();
        copy.copyFileToPod(namespace, podName, null, Paths.get("Inputs-Djerdap-500k/300kscenarija.csv"), Paths.get("/tmp/300kscenarija.csv"));
        System.out.println("Done!");
}

Even with small files it hangs waiting for the process to complete as reported in #1822 .

Client Version 15.0.1

Kubernetes Version 1.22.9

Java Version Java 8

Server (please complete the following information):

  • OS: CentOS 7.8 Linux
  • Environment MicroK8s cluster of 7 physical nodes

imilos avatar Jul 13 '22 13:07 imilos

I'll try to reproduce this, but I suspect that it may be a buffer filling up somewhere. We use PipedInput/OutputStreams in a variety of places in this code and I suspect that those buffers may be filling and blocking.

Can you confirm that kubectl cp ... works correctly in the same environment?

Have you tried Copy.copyFileToPodAsync(...) does that work correctly?

brendandburns avatar Jul 13 '22 21:07 brendandburns

Thanks for the quick reply @brendandburns. I can confirm that kubectl cp... works perfectly and that Copy.copyFileToPodAsync(...) throws exactly the same error as Copy.copyFileToPod(...).

imilos avatar Jul 14 '22 06:07 imilos

Thanks @imilos what is the network environment between client and server? Is it a direct network connection?

brendandburns avatar Jul 14 '22 16:07 brendandburns

@brendandburns I can confirm that the network connection is direct. I also tried running the client on the machine which belongs to the cluster using https://127.0.0.1:16443. Identical behavior.

imilos avatar Jul 15 '22 15:07 imilos

Thanks for the info. I will try to replicate on my home cluster.

brendandburns avatar Jul 15 '22 22:07 brendandburns

I just noticed that there was a similar conversation here for the C# client:

https://github.com/kubernetes-client/csharp/issues/949#issuecomment-1190019328

Looks like we need to add chunking for the WebSocket connection, if you are willing to recompile the client library you could try that otherwise I will get to it eventually.

brendandburns avatar Jul 20 '22 14:07 brendandburns

Dear @brendandburns, a colleague from my team raised the CSharp client issue :). Glad there is a solution. I would be happy to build and test the client, but I'm currently on vacation with limited internet access. After the 1st of August, I'd be happy to build and test.

imilos avatar Jul 22 '22 18:07 imilos

DefaultMaxPayloadBytes = 32 << 20 // 32MB the k8s impl use default golang/x/websock

I believe the chunk size must be smaller than it

tg123 avatar Jul 31 '22 21:07 tg123

Dear @brendandburns , may I try if it works? I am available from now on.

imilos avatar Aug 09 '22 12:08 imilos

can confirm that this is happening to me too, is there a solution for this in sight? I might rewrite my project in go, will this solve my issue?

IdoWinter avatar Aug 17 '22 16:08 IdoWinter

Hi I think I've found the root of this issue and it is around the following piece of code

https://github.com/kubernetes-client/java/blob/6d39601da33cc64ee20a5d6e7500451b3278a6e7/util/src/main/java/io/kubernetes/client/util/WebSocketStreamHandler.java#L263-L274

My hypothesis is that. Despite the data being buffered to avoid flooding the web socket queue we are still reaching the max queue size. It happens when the connection is slow or the remote process is being processed slow then the queue is not released as fast as we write. Note that despite we are writing buffered, our loop is immediately adding more data to the buffer and when there is some latency while releasing the web socket queue we lead to an error because of WebSocketStreamHandler.this.socket.send(ByteString.of(buffer)) return false due to the following piece of code in okhttp3.internal.ws.RealWebSocket

...
    // If this frame overflows the buffer, reject it and close the web socket.
    if (queueSize + data.size > MAX_QUEUE_SIZE) {
      close(CLOSE_CLIENT_GOING_AWAY, null)
      return false
    }
...

The proposal is to add some code to handle this case something like an active waiting

...
        System.arraycopy(b, offset + bytesWritten, buffer, 1, bufferSize);
        ByteString byteString = ByteString.of(buffer);

        while (WebSocketStreamHandler.this.socket.queueSize() + byteString.size() > MAX_QUEUE_SIZE) {
          try {
            Thread.sleep(1000);
          } catch (InterruptedException e) {
            throw new RuntimeException(e);
          }
        }

        if (!WebSocketStreamHandler.this.socket.send(byteString)) {
          throw new IOException("WebSocket has closed.");
        }
...

WDYT?

dani8art avatar Sep 22 '22 13:09 dani8art

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 21 '22 14:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 20 '23 15:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Feb 19 '23 15:02 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 19 '23 15:02 k8s-ci-robot