oras-go icon indicating copy to clipboard operation
oras-go copied to clipboard

`Transfer-Encoding: chunked` responses fail

Open lbjarre opened this issue 3 years ago • 14 comments

Hello,

I have been working on a project using the oras client (v2) against a repository hosted on ECR. We are storing an artifact with quite a lot of individual items (431 layers on the final manifest), which seems to cause ECR to chunk the response:

# REPO contains the API of the target repository, private in this case
❯ curl -X GET -I --user "AWS:$(aws ecr get-login-password)" "${REPO}/manifests/latest"
HTTP/1.1 200 OK
Content-Type: application/vnd.oci.image.manifest.v1+json
Docker-Distribution-Api-Version: registry/2.0
Sizes:
Date: Fri, 23 Sep 2022 09:14:00 GMT
Transfer-Encoding: chunked

The library currently explicitly checks for net/http.Response.ContentLength when validating the response, which is for these cases -1 and errors out: GET "${REPO}/manifests/latest": unknown response Content-Length, coming from registry/remote/repository.go

I tried to replicate the behavior with the open source registry image, but even with a manifest with >10000 layers it responds non-chunked, so this might unfortunately be a bit annoying to replicate fully locally.

lbjarre avatar Sep 23 '22 09:09 lbjarre

@nima Could you take a look since it is an ECR issue?

shizhMSFT avatar Sep 27 '22 12:09 shizhMSFT

Yes I can take a look at this; please assign it to me.

nima avatar Sep 27 '22 17:09 nima

I've started working on this as of today and will post some updates in the coming week.

nima avatar Oct 06 '22 02:10 nima

Testing suit is complete. Working on a repro now. The most number of layers I was able to create via a Dockerfile was ~128; beyond that it hits a limit.

docker image inspect 080565210187.dkr.ecr.local:5000/utnubu:multilayer | jq '.[0].RootFS.Layers|length'
124

This unfortunately doesn't reproduce the issue...

curl -X GET -I --user AWS:${AWS_PWD} https://080565210187.dkr.ecr.local:5000/v2/utnubu/manifests/multilayer
HTTP/2 200
content-type: application/vnd.docker.distribution.manifest.v2+json
docker-distribution-api-version: registry/2.0
sizes:
date: Sun, 16 Oct 2022 05:52:42 GMT

Comparing this to a basic image though, the headers returned have changed which I don't have an explanation for.

% curl -k -X GET -I --user AWS:${AWS_PWD} https://080565210187.dkr.ecr.local:5000/v2/utnubu/manifests/oras
HTTP/2 200
content-type: application/vnd.oci.image.manifest.v1+json
docker-distribution-api-version: registry/2.0
sizes:
content-length: 456
date: Sun, 16 Oct 2022 05:46:47 GMT

nima avatar Oct 16 '22 05:10 nima

I'm hitting the same issue on ECR - this is what I get from curl:

HTTP/1.1 200 OK
Content-Type: application/vnd.docker.distribution.manifest.v2+json
Docker-Distribution-Api-Version: registry/2.0
Sizes: 
Date: Sun, 16 Oct 2022 07:58:57 GMT
Transfer-Encoding: chunked

yuvalturg avatar Oct 16 '22 08:10 yuvalturg

Testing suit is complete. Working on a repro now. The most number of layers I was able to create via a Dockerfile was ~128; beyond that it hits a limit.

fyi we did not create the artifact via Dockerfiles, we used the oras SDK directly.

lbjarre avatar Oct 16 '22 09:10 lbjarre

Re the imposed layer limit of 127: https://github.com/docker/docs/issues/8230#issuecomment-468630187

@lbjarre do you have a quick recipe to generate the artifact via the oras SDK that I can use? Or alternatively provide such an image that I can use.

nima avatar Oct 16 '22 17:10 nima

@lbjarre do you have a quick recipe to generate the artifact via the oras SDK that I can use? Or alternatively provide such an image that I can use.

package main

import (
	"bytes"
	"context"
	"fmt"
	"log"

	v1 "github.com/opencontainers/image-spec/specs-go/v1"
	"oras.land/oras-go/v2"
	"oras.land/oras-go/v2/content"
	"oras.land/oras-go/v2/content/memory"
	"oras.land/oras-go/v2/registry/remote"
)

func main() {
	ctx := context.Background()

	mem := memory.New()
	manifest, err := makeArtifact(ctx, mem, 200)
	check(err)

	tag := "large-artifact"
	err = mem.Tag(ctx, manifest, tag)
	check(err)

	// TODO: setup auth to ECR.
	repo, err := remote.NewRepository("")
	check(err)

	_, err = oras.Copy(ctx, mem, tag, repo, "", oras.DefaultCopyOptions)
	check(err)
}

func makeArtifact(ctx context.Context, target content.Pusher, nLayers int) (v1.Descriptor, error) {
	var layers []v1.Descriptor
	for i := 0; i < nLayers; i++ {
		data := []byte(fmt.Sprintf("layer %d", i))
		desc := content.NewDescriptorFromBytes("application/text", data)
		if err := target.Push(ctx, desc, bytes.NewReader(data)); err != nil {
			return v1.Descriptor{}, err
		}
		layers = append(layers, desc)
	}

	return oras.Pack(ctx, target, layers, oras.PackOptions{})
}

func check(err error) {
	if err != nil {
		log.Fatal(err)
	}
}

This is roughly what I used for my testing setup.

lbjarre avatar Oct 17 '22 08:10 lbjarre

Thanks @lbjarre; I've adapted this for our environment. I've set the number of layers to 400, and here are the results...

% cd ~/Development/github/oras && go build multilayer.go && mv multilayer /tmp/ && cd - && /tmp/multilayer
~/Development/brazil/DoodleContainers/src/DoodleContainers
Creating layer 10 of 400...
Creating layer 20 of 400...
Creating layer 30 of 400...
...
Creating layer 390 of 400...
Creating layer 400 of 400...
&{200 OK 200 HTTP/2.0 2 0 map[Content-Type:[application/vnd.oci.image.manifest.v1+json] Date:[Thu, 20 Oct 2022 05:14:33 GMT] Docker-Distribution-Api-Version:[registry/2.0] Sizes:[]] {0xc000112600} -1 [] false false map[] 0xc00052a200 0xc0001d4420}
200
%

With the 400-layer manifest pushed, now trying the repro...

% curl -I -X GET --basic --user "AWS:${AWS_SECRET}" "https://080565210187.dkr.ecr.local:5000/v2/utnubu/manifests/fatty"
HTTP/2 200
content-type: application/vnd.oci.image.manifest.v1+json
docker-distribution-api-version: registry/2.0
sizes:
date: Thu, 20 Oct 2022 05:16:50 GMT

That is, I'm still unable to reproduce what you've reported here. Anything else you can think of that I might be missing?

Here's some validation for the 400-layer object existing in the registry...

% skopeo inspect --raw docker://080565210187.dkr.ecr.local:5000/utnubu:fatty|jq .
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.unknown.config.v1+json",
    "digest": "sha256:44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a",
    "size": 2
  },
  "layers": [
    {
      "mediaType": "application/text",
      "digest": "sha256:3ef3bd6c80bd7c673acc76fde924749be5cb8c7553fd17c2c3979f37a754f75b",
      "size": 7
    },
    {
      "mediaType": "application/text",
      "digest": "sha256:54db133a109fd7f0d6eb72da16df1af078f2bf86e917ae3c780a13d811d6aa6f",
      "size": 7
    },
    {
      "mediaType": "application/text",
      "digest": "sha256:0c6eea24ed5f3274df121af2d789e43839b3bcfb328db8deb86b46896d57be67",
      "size": 7
    },
    ...
    {
      "mediaType": "application/text",
      "digest": "sha256:b2f744a156137ae4a5bec0b9e91f91e3250b2492ff206606af9c967d2e393187",
      "size": 9
    },
    {
      "mediaType": "application/text",
      "digest": "sha256:1da268e8e05ab78ebc086c2f6f4ac28cb73d4a621aa1f12f5612ea32c3d77b4f",
      "size": 9
    },
    {
      "mediaType": "application/text",
      "digest": "sha256:c4c794002aded41f9abc6f592f8b6d40f113fb684311708a144a83c1fd911727",
      "size": 9
    }
  ]
}
%
% skopeo inspect --raw docker://080565210187.dkr.ecr.local:5000/utnubu:fatty|jq '.layers|length'
400
%

nima avatar Oct 20 '22 05:10 nima

Trying this again with the said 431 layers, in case that's the tipping point. I tried 500 but that was rejected for exceeding the maximum allowed.

nima avatar Oct 20 '22 18:10 nima

It reproduces for me with a large (4.5 gigs) image and not so many layers @nima

yuvalturg avatar Oct 20 '22 18:10 yuvalturg

I guess I'm a bit confused on how the manifest would turn to chunked, when the data that it references (and not it) increases in size. In your repro @yuvalturg, do you know if you're not getting chunked when the layers are tiny

nima avatar Oct 20 '22 21:10 nima

Hmmm, seems like it does make a difference.

Here's a 4-layer, 32B/layer image...

% curl --http1.1 -X GET -I -u AWS:${password} https://080565210187.dkr.ecr.local:5000/v2/utnubu/manifests/l4s32
HTTP/1.1 200 OK
Connection: close
Content-Type: application/vnd.oci.image.manifest.v1+json
Docker-Distribution-Api-Version: registry/2.0
Sizes:
Date: Sat, 22 Oct 2022 01:39:33 GMT
Content-Length: 751

Here's a 43-layer, 1KiB/layer image...

% curl --http1.1 -X GET -I -u AWS:${password} https://080565210187.dkr.ecr.local:5000/v2/utnubu/manifests/l43s1024
HTTP/1.1 200 OK
Connection: close
Content-Type: application/vnd.oci.image.manifest.v1+json
Docker-Distribution-Api-Version: registry/2.0
Sizes:
Date: Sat, 22 Oct 2022 01:39:57 GMT
Transfer-Encoding: chunked

Note how the Content-Length drops off, and in comes the Transfer-Encoding: chunked.

As an aside, the reason that the previous repro above did not work, is that HTTP/2 does not support HTTP 1.1's chunked transfer encoding:

Note: HTTP/2 doesn’t support HTTP 1.1's chunked transfer encoding mechanism, as it provides its own, more efficient, mechanisms for data streaming.

nima avatar Oct 22 '22 01:10 nima

@VishrutPatel's fix addresses the issue.

Using the same two tests from before the fix next. This first one, l4s32, we expect no change:

% curl --http1.1 -X GET -I -u AWS:${password} https://080565210187.dkr.ecr.local:5000/v2/utnubu/manifests/l4s32
HTTP/1.1 200 OK
Connection: close
Content-Length: 751
Content-Type: application/vnd.oci.image.manifest.v1+json
Docker-Distribution-Api-Version: registry/2.0
Sizes:
Date: Sat, 22 Oct 2022 02:05:34 GMT

But using l43s1024 we do expect a change; the reappearance of Content-Length, and the consequent disappearance of Transfer-Encoding:

% curl --http1.1 -X GET -I -u AWS:${password} https://080565210187.dkr.ecr.local:5000/v2/utnubu/manifests/l43s1024
HTTP/1.1 200 OK
Connection: close
Content-Length: 5747
Content-Type: application/vnd.oci.image.manifest.v1+json
Docker-Distribution-Api-Version: registry/2.0
Sizes:
Date: Sat, 22 Oct 2022 02:05:42 GMT

The FIY explanation of the root cause (why this happens for higher image layer counts and/or layer sizes). Thanks @VishrutPatel.

nima avatar Oct 22 '22 02:10 nima

Testing of the fix is in-progress, no issues so far. This should be resolved by mid-week, and then deployed worldwide.

nima avatar Oct 25 '22 00:10 nima

Update: Fix is being deployed WW; will communicate here when that's complete.

nima avatar Oct 31 '22 21:10 nima

The WW deployment is currently sitting at 44%; I'll update this again when we get closer to (or at) a 100%.

nima avatar Nov 09 '22 01:11 nima

[INFO] Today we are at ~ 58% coverage.

nima avatar Nov 10 '22 18:11 nima

[INFO] Okay, now at 69% WW coverage. Note that we will (for unrelated reasons) cease further deployments for the remainder of the week. I'll re-enable the deployment on Monday 11/14.

nima avatar Nov 10 '22 19:11 nima

[INFO] Deployments will commence again tomorrow.

nima avatar Nov 15 '22 03:11 nima

[INFO] Deployments are complete; @lbjarre can you confirm if the issue is resolved for you?

nima avatar Nov 17 '22 03:11 nima

I actually don't have the same testing setup left anymore, we solved our issue by just reducing the manifest size. However, if your testcases are working then I'm comfortable with calling this as resolved. Thanks for the help!

lbjarre avatar Nov 17 '22 07:11 lbjarre