c/image/docker errors don’t include JSON from the "detail" field, which Quay uses for user-intended error messages
/kind bug
Description
Having problems with image push to quay because of, among other things, https://github.com/containers/buildah.io/issues/93, but this really applies to any sort of failure during remote transactions:
podman will report the basic error, such as "manifest invalid", but does not report the 'detail' information as documented here:
https://github.com/opencontainers/distribution-spec/blob/main/spec.md#error-codes
for example, podman's report for a particular failure to a test server (for http snoopability) is: (with or without --log-level=debug)
Error: error copying image to the remote destination: Error writing manifest <manifest text here>: Error uploading manifest 6.7.4-x86_64 to smithi014:80/ceph/ceph-grafana: manifest invalid: manifest invalid
but the server's http response is
HTTP/1.1 400 BAD REQUEST
Server: nginx/1.12.1
Date: Fri, 21 May 2021 00:19:03 GMT
Content-Type: application/json
Content-Length: 197
Connection: close
{"errors":[{"code":"MANIFEST_INVALID","detail":{"message":"Missing referenced parent image sha256:300e315adb2f96afe5f0b2780b87f28ae95231fe3bdd1e16b9ba606307728f55"},"message":"manifest invalid"}]}
That detail message would have saved me weeks of scrabbling around for further information about what was wrong with the manifest.
Steps to reproduce the issue:
-
podman push a manifest with errors (and likely other server-flagged error conditions, I don't know)
-
observe reported errors
Describe the results you received:
the error structure's "message" field, "manifest invalid"
Describe the results you expected:
"message" and "detail.message", to diagnose the problem from the client without any other visibility into the server (no ability to snoop https, no ability to see server logs).
Additional information you deem important (e.g. issue happens only occasionally):
Output of podman version:
This has happened with 1.6, 2.2.1, and currently with 3.1.2
# podman --version
podman version 3.1.2
Output of podman info --debug:
# podman info --debug
host:
arch: amd64
buildahVersion: 1.20.1
cgroupManager: systemd
cgroupVersion: v1
conmon:
package: conmon-2.0.28-1.el8.2.1.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.0.28, commit: '
cpus: 8
distribution:
distribution: '"centos"'
version: "8"
eventLogger: journald
hostname: smithi014
idMappings:
gidmap: null
uidmap: null
kernel: 4.18.0-193.28.1.el8_2.x86_64
linkmode: dynamic
memFree: 15532503040
memTotal: 33443176448
ociRuntime:
name: crun
package: crun-0.19.1-1.el8.2.1.x86_64
path: /usr/bin/crun
version: |-
crun version 0.19.1
commit: 1535fedf0b83fb898d449f9680000f729ba719f5
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
os: linux
remoteSocket:
path: /run/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: false
seccompEnabled: true
selinuxEnabled: true
slirp4netns:
executable: ""
package: ""
version: ""
swapFree: 0
swapTotal: 0
uptime: 98h 42m 15.12s (Approximately 4.08 days)
registries:
search:
- registry.fedoraproject.org
- registry.access.redhat.com
- registry.centos.org
- docker.io
store:
configFile: /etc/containers/storage.conf
containerStore:
number: 3
paused: 0
running: 3
stopped: 0
graphDriverName: overlay
graphOptions:
overlay.mountopt: nodev
graphRoot: /var/lib/containers/storage
graphStatus:
Backing Filesystem: extfs
Native Overlay Diff: "true"
Supports d_type: "true"
Using metacopy: "false"
imageStore:
number: 8
runRoot: /run/containers/storage
volumePath: /var/lib/containers/storage/volumes
version:
APIVersion: 3.1.2
Built: 1619185402
BuiltTime: Fri Apr 23 13:43:22 2021
GitCommit: ""
GoVersion: go1.14.12
OsArch: linux/amd64
Version: 3.1.2
Package info (e.g. output of rpm -q podman or apt list podman):
# rpm -q podman
podman-3.1.2-1.el8.2.1.x86_64
Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)
Yes
Additional environment details (AWS, VirtualBox, physical, etc.):
This issue occurs on pushes to quay.io, quay.ceph.io (local to the Ceph project), and my test instance on a local host (smithi014 above).
@vrothberg Looks like something for Libimage?
Thanks for reaching out! I moved the issue to containers/image
From a very quick look, this does look correct per the linked spec: message is the specified field, details.message is not.
OTOH the specs are in practice only relevant to a point; we should look into what common implementations (both client and server) do in practice.
Not sure I understand; both fields are in the spec and, clearly, detail is meant for humans to read to understand the problem; are you saying somehow that the spec prohibits printing the detail message? That seems unlikely and unuseful
The detail field is OPTIONAL and MAY contain arbitrary JSON data providing information the client can use to resolve the issue.
“arbitrary JSON data”, This could just as well contain a base64-encoded JPEG image, or a binary backtrace, something protobuf-encoded, or anything at all; without further specification, it’s unclear how to display it. If the server wants to provide a human-readable message, message is the specified way to do that and I can’t see a good reason for it not to do that and to provide that data in an undocumented field.
Again, what happens in practice is more important, at this point I’m just noting what the spec says.
-
the spec also says "A 4XX response code from the registry MAY return a body in any format.", meaning there may not even be a body, and if there is, even the top-level body may not be JSON
-
the spec also says "The message field is OPTIONAL, and if present, it SHOULD be a human readable string", thus does not guarantee the presence or format of the currently-printed message string
-
if detail is JSON, it's printable, and it's hard to see how printing it can do harm, and indeed it can only be helpful at providing more information; it may range from not useful (empty, or echoing message) to very useful (as above).
-
even if the server did report binary data, which, given all the hints and suggestions above, and common practice of using printable data for interchange in client-server apps, is pretty low probability, printing a possibly-binary string escaped is a pretty standard thing to do, and still might provide a hint to the client about how to resolve the issue.
I've been exploring the podman error handling code and have discovered some things. Is anyone looking at this issue?
Feel free to pick it :) What are your findings?
It seems like the errors package with its Wrap() functionality is used throughout the various libraries, and using Cause() on the passed-back error would allow unwrapping down to the original error (formed from a json.Unmarshal in github.com/docker/distribution/registry/client/errors.parseHTTPErrorResponse), but also a user-defined type types.ManifestTypeRejectedError is used as a 'sentinel' error to collect several different underlying errors and compare for special behavior. The top-level code wants to use Unwrap/Cause to get to the original error, but in the case of ManifestTypeRejectedError, since it doesn't define Cause(), Cause can't do the full unwrap back to the original error. (The original error contains the Detail field as in the spec that has the information critical to the user.)
So this seems to be a sort of mixture of error processing that's kind of a mismatch between wrapping/unwrapping and a user-defined sentinel error type.
I'm currently thinking that errors.Is() would be a better solution to sensing the ManifestTypeRejectedError, to allow that type to still have a Cause() and participate in the unwrapping, but I'm still searching for the actual source to errors.Is() to be sure.
Sure, there is a ManifestTypeRejectedError sentinel type, but the original report was about the contents of the error message, and whether it uses the message vs. the details.message field. No amount of Unwrap/Cause calls can change which field is present in an pre-existing text; if anything, it would just remove some of the data.
(The original error contains the Detail field as in the spec that has the information critical to the user.)
The top-level Podman caller has really no business special-casing that; that protocol-level knowledge should exist deep in c/image/docker. IOW the message vs. details.message vs. … decision should happen before the error is wrapped in ManifestTypeRejectedError, whatever the decision or heuristics ends up being.
I wasn't very clear. The original idea is to retrieve the error info sent back from the repo server, if present, yes. To do that, unwrapping the nested errors with Cause() would get to the original error, where it could be type-asserted and Detail found and printed.
However, in at least one case, a sentinel error stops the unwrapping, preventing that strategy. I mixed up ways to try to deal with that with the original problem of accessing the ultimate error.
That aside, as for layering...yes, agreed, I'm not at all clear about what division of responsibility is appropriate yet, but I agree that special casing at the top is not right at all. I've not yet actually proposed a fix as a result. I was just noting what I'd found so far, and problems that needed solution somehow, and I've been reading a lot about different techniques for handling errors in Go 1.x as well as proposals for 2.x; it seems the area is somewhat in flux.
Ideally the detail information would be generically available, but I haven't yet observed, in podman or elsewhere, an obvious convention for "a simple string representation of an error" and "extra information" (other than stack traces or per-layer messages). It could be I'm just missing something obvious so far.
I should also note that I have two fairly-separate purposes in my investigations so far: one is to simply hack, as expediently as possible, some way to see the detailed info, so that I can continue to accomplish what I need to accomplish with podman and repos that caused discovery of this issue. The second is to get familiar enough with the code and conventions to try to offer naive suggestions toward some sort of resolution. I don't imagine myself enough of an expert in this code, or even Go, to be able to propose a well-fleshed-out fix, at least not any time soon.
For a quick hack, I’d just edit something directly in vendor/github.com/docker/distribution/registry/api/errcode; that would not be an acceptable PR, though.
(Note to self: looking at docker/distribution aka distribution/distribution now, almost everything uses ErrorCode.WithDetail(originalGoError), uses of ErrorCode.WithMessage() are quite rare. Both the presence of Detail and its lack of use (nothing ever uses or shows it) seem consistent enough to be intentional. ??)
Proposing a fix to docker/distribution that would allow a small fix to the image package to resolve this issue: https://github.com/distribution/distribution/pull/3447
Well, the distribution/distribution team was unenamored of the solution. Trying Yet Another Approach. Sure wouldn't mind some help.
I’m open to some cleanly integrated heuristics of that kind, in principle (somewhere around docker/errors.go:registryHTTPResponseToError), I’m afraid I can’t spend the time to collect the relevant data right now, but I do think it’s worth tracking.
(BTW is there an upstream Quay.io bug report to fix the specified message field in this case? That won’t help existing server deployments, of course, but it would avoid the issue going forward.)