image icon indicating copy to clipboard operation
image copied to clipboard

c/image/docker errors don’t include JSON from the "detail" field, which Quay uses for user-intended error messages

Open dmick opened this issue 4 years ago • 17 comments

/kind bug

Description

Having problems with image push to quay because of, among other things, https://github.com/containers/buildah.io/issues/93, but this really applies to any sort of failure during remote transactions:

podman will report the basic error, such as "manifest invalid", but does not report the 'detail' information as documented here:

https://github.com/opencontainers/distribution-spec/blob/main/spec.md#error-codes

for example, podman's report for a particular failure to a test server (for http snoopability) is: (with or without --log-level=debug)

Error: error copying image to the remote destination: Error writing manifest <manifest text here>: Error uploading manifest 6.7.4-x86_64 to smithi014:80/ceph/ceph-grafana: manifest invalid: manifest invalid

but the server's http response is

HTTP/1.1 400 BAD REQUEST
Server: nginx/1.12.1
Date: Fri, 21 May 2021 00:19:03 GMT
Content-Type: application/json
Content-Length: 197
Connection: close

{"errors":[{"code":"MANIFEST_INVALID","detail":{"message":"Missing referenced parent image sha256:300e315adb2f96afe5f0b2780b87f28ae95231fe3bdd1e16b9ba606307728f55"},"message":"manifest invalid"}]}

That detail message would have saved me weeks of scrabbling around for further information about what was wrong with the manifest.

Steps to reproduce the issue:

  1. podman push a manifest with errors (and likely other server-flagged error conditions, I don't know)

  2. observe reported errors

Describe the results you received:

the error structure's "message" field, "manifest invalid"

Describe the results you expected:

"message" and "detail.message", to diagnose the problem from the client without any other visibility into the server (no ability to snoop https, no ability to see server logs).

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

This has happened with 1.6, 2.2.1, and currently with 3.1.2

# podman --version
podman version 3.1.2

Output of podman info --debug:

# podman info --debug
host:
  arch: amd64
  buildahVersion: 1.20.1
  cgroupManager: systemd
  cgroupVersion: v1
  conmon:
    package: conmon-2.0.28-1.el8.2.1.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.28, commit: '
  cpus: 8
  distribution:
    distribution: '"centos"'
    version: "8"
  eventLogger: journald
  hostname: smithi014
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 4.18.0-193.28.1.el8_2.x86_64
  linkmode: dynamic
  memFree: 15532503040
  memTotal: 33443176448
  ociRuntime:
    name: crun
    package: crun-0.19.1-1.el8.2.1.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 0.19.1
      commit: 1535fedf0b83fb898d449f9680000f729ba719f5
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    selinuxEnabled: true
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 0
  swapTotal: 0
  uptime: 98h 42m 15.12s (Approximately 4.08 days)
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - registry.centos.org
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 3
    paused: 0
    running: 3
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 8
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 3.1.2
  Built: 1619185402
  BuiltTime: Fri Apr 23 13:43:22 2021
  GitCommit: ""
  GoVersion: go1.14.12
  OsArch: linux/amd64
  Version: 3.1.2

Package info (e.g. output of rpm -q podman or apt list podman):

# rpm -q podman
podman-3.1.2-1.el8.2.1.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

This issue occurs on pushes to quay.io, quay.ceph.io (local to the Ceph project), and my test instance on a local host (smithi014 above).

dmick avatar May 21 '21 23:05 dmick

@vrothberg Looks like something for Libimage?

mheon avatar May 23 '21 04:05 mheon

Thanks for reaching out! I moved the issue to containers/image

vrothberg avatar May 25 '21 08:05 vrothberg

From a very quick look, this does look correct per the linked spec: message is the specified field, details.message is not.

OTOH the specs are in practice only relevant to a point; we should look into what common implementations (both client and server) do in practice.

mtrmac avatar May 25 '21 10:05 mtrmac

Not sure I understand; both fields are in the spec and, clearly, detail is meant for humans to read to understand the problem; are you saying somehow that the spec prohibits printing the detail message? That seems unlikely and unuseful

dmick avatar May 25 '21 19:05 dmick

The detail field is OPTIONAL and MAY contain arbitrary JSON data providing information the client can use to resolve the issue.

“arbitrary JSON data”, This could just as well contain a base64-encoded JPEG image, or a binary backtrace, something protobuf-encoded, or anything at all; without further specification, it’s unclear how to display it. If the server wants to provide a human-readable message, message is the specified way to do that and I can’t see a good reason for it not to do that and to provide that data in an undocumented field.

Again, what happens in practice is more important, at this point I’m just noting what the spec says.

mtrmac avatar May 26 '21 10:05 mtrmac

  1. the spec also says "A 4XX response code from the registry MAY return a body in any format.", meaning there may not even be a body, and if there is, even the top-level body may not be JSON

  2. the spec also says "The message field is OPTIONAL, and if present, it SHOULD be a human readable string", thus does not guarantee the presence or format of the currently-printed message string

  3. if detail is JSON, it's printable, and it's hard to see how printing it can do harm, and indeed it can only be helpful at providing more information; it may range from not useful (empty, or echoing message) to very useful (as above).

  4. even if the server did report binary data, which, given all the hints and suggestions above, and common practice of using printable data for interchange in client-server apps, is pretty low probability, printing a possibly-binary string escaped is a pretty standard thing to do, and still might provide a hint to the client about how to resolve the issue.

dmick avatar May 26 '21 19:05 dmick

I've been exploring the podman error handling code and have discovered some things. Is anyone looking at this issue?

dmick avatar Jun 10 '21 22:06 dmick

Feel free to pick it :) What are your findings?

vrothberg avatar Jun 11 '21 08:06 vrothberg

It seems like the errors package with its Wrap() functionality is used throughout the various libraries, and using Cause() on the passed-back error would allow unwrapping down to the original error (formed from a json.Unmarshal in github.com/docker/distribution/registry/client/errors.parseHTTPErrorResponse), but also a user-defined type types.ManifestTypeRejectedError is used as a 'sentinel' error to collect several different underlying errors and compare for special behavior. The top-level code wants to use Unwrap/Cause to get to the original error, but in the case of ManifestTypeRejectedError, since it doesn't define Cause(), Cause can't do the full unwrap back to the original error. (The original error contains the Detail field as in the spec that has the information critical to the user.)

So this seems to be a sort of mixture of error processing that's kind of a mismatch between wrapping/unwrapping and a user-defined sentinel error type.

I'm currently thinking that errors.Is() would be a better solution to sensing the ManifestTypeRejectedError, to allow that type to still have a Cause() and participate in the unwrapping, but I'm still searching for the actual source to errors.Is() to be sure.

dmick avatar Jun 11 '21 09:06 dmick

Sure, there is a ManifestTypeRejectedError sentinel type, but the original report was about the contents of the error message, and whether it uses the message vs. the details.message field. No amount of Unwrap/Cause calls can change which field is present in an pre-existing text; if anything, it would just remove some of the data.

mtrmac avatar Jun 12 '21 01:06 mtrmac

(The original error contains the Detail field as in the spec that has the information critical to the user.)

The top-level Podman caller has really no business special-casing that; that protocol-level knowledge should exist deep in c/image/docker. IOW the message vs. details.message vs. … decision should happen before the error is wrapped in ManifestTypeRejectedError, whatever the decision or heuristics ends up being.

mtrmac avatar Jun 12 '21 01:06 mtrmac

I wasn't very clear. The original idea is to retrieve the error info sent back from the repo server, if present, yes. To do that, unwrapping the nested errors with Cause() would get to the original error, where it could be type-asserted and Detail found and printed.

However, in at least one case, a sentinel error stops the unwrapping, preventing that strategy. I mixed up ways to try to deal with that with the original problem of accessing the ultimate error.

That aside, as for layering...yes, agreed, I'm not at all clear about what division of responsibility is appropriate yet, but I agree that special casing at the top is not right at all. I've not yet actually proposed a fix as a result. I was just noting what I'd found so far, and problems that needed solution somehow, and I've been reading a lot about different techniques for handling errors in Go 1.x as well as proposals for 2.x; it seems the area is somewhat in flux.

Ideally the detail information would be generically available, but I haven't yet observed, in podman or elsewhere, an obvious convention for "a simple string representation of an error" and "extra information" (other than stack traces or per-layer messages). It could be I'm just missing something obvious so far.

dmick avatar Jun 12 '21 01:06 dmick

I should also note that I have two fairly-separate purposes in my investigations so far: one is to simply hack, as expediently as possible, some way to see the detailed info, so that I can continue to accomplish what I need to accomplish with podman and repos that caused discovery of this issue. The second is to get familiar enough with the code and conventions to try to offer naive suggestions toward some sort of resolution. I don't imagine myself enough of an expert in this code, or even Go, to be able to propose a well-fleshed-out fix, at least not any time soon.

dmick avatar Jun 12 '21 01:06 dmick

For a quick hack, I’d just edit something directly in vendor/github.com/docker/distribution/registry/api/errcode; that would not be an acceptable PR, though.

(Note to self: looking at docker/distribution aka distribution/distribution now, almost everything uses ErrorCode.WithDetail(originalGoError), uses of ErrorCode.WithMessage() are quite rare. Both the presence of Detail and its lack of use (nothing ever uses or shows it) seem consistent enough to be intentional. ??)

mtrmac avatar Jun 12 '21 13:06 mtrmac

Proposing a fix to docker/distribution that would allow a small fix to the image package to resolve this issue: https://github.com/distribution/distribution/pull/3447

dmick avatar Jun 30 '21 05:06 dmick

Well, the distribution/distribution team was unenamored of the solution. Trying Yet Another Approach. Sure wouldn't mind some help.

dmick avatar Jul 02 '21 02:07 dmick

I’m open to some cleanly integrated heuristics of that kind, in principle (somewhere around docker/errors.go:registryHTTPResponseToError), I’m afraid I can’t spend the time to collect the relevant data right now, but I do think it’s worth tracking.

(BTW is there an upstream Quay.io bug report to fix the specified message field in this case? That won’t help existing server deployments, of course, but it would avoid the issue going forward.)

mtrmac avatar Jul 02 '21 13:07 mtrmac