docker-registry icon indicating copy to clipboard operation
docker-registry copied to clipboard

Registry degrades and reports corruption

Open Henkis opened this issue 10 years ago • 39 comments

We are using a privately hosted registry on amazon which seems to degrade after we have pulled larger images (1GB+) a few times from it, it finely stop responding. A restart of the registry container seems to fix the problem for a while:

df29cb2b16e8: Download complete
8673c2be5440: Download complete
efb1cc4f166f: Error downloading dependent layers
2014/08/26 23:09:14 Error pulling image (97a9cc4076fd69d8a35be028a74f0e00b8066555) from 256.31.14.183/server, flate: corrupt input before offset 13576289

And again:

Pulling repository 256.31.14.183/server
a44fa854766a: Error pulling image (97a9cc4076fd69d8a35be028a74f0e00b8066555) from 256.31.14.183/adserver, archive/tar: invalid tar header 1/, archive/tar: invalid tar header
5e66087f3ffe: Error downloading dependent layers
2014/08/27 06:03:14 Error pulling image (97a9cc4076fd69d8a35be028a74f0e00b8066555) from 256.31.14.183/server, archive/tar: invalid tar header

We often see these messages in the log but not always:

<Greenlet at 0x7f8bbb7269b0: <bound method ParallelKey._fetch_part of <docker_registry.core.boto.ParallelKey object at 0x7f8bbb2be550>>('/tmp/tmpPPltfO', 1, 13495293, 26990585)> failed with timeout

We currently use the latest registry (registry 0.8.0 cd3581c06bdc) but have hade the same problems earlier.

Registry config:

Dockerfile:

FROM registry
ADD ./config.yml /docker-registry/config/config.yml

config.yml:

common:
    loglevel: info
    secret_key: <OURLITTLESECRET>


# This is the default configuration when no flavor is specified
dev:
    storage: local
    storage_path: /tmp/registry
    loglevel: debug

# To specify another flavor, set the environment variable SETTINGS_FLAVOR
# $ export SETTINGS_FLAVOR=prod
prod:
    storage: s3
    storage_path: /prod
    loglevel: debug
    # Amazon S3 Storage Configuration
    s3_access_key: <OURLITTLESECRET>
    s3_secret_key: <OURLITTLESECRET>
    s3_bucket: arkivdigital-docker-registry
    boto_bucket: ourrepo-registry
    # Enabling LRU cache for small files. This speeds up read/write on small files
    # when using a remote storage backend (like S3).
    cache:
        host: docker-registryblabla.cache.amazonaws.com
        port: 6379
        db: 0
    cache_lru:
        host: docker-registryblabla.cache.amazonaws.com
        port: 6379
        db: 0

    standalone: true
    disable_token_auth: true

# This flavor is automatically used by unit tests
test:
    storage: local
    storage_path: /tmp/test

Henkis avatar Aug 27 '14 06:08 Henkis

A little more from the log:

Wed, 27 Aug 2014 06:03:14 GMT
/docker-registry/prod/images/6e66087f3ffe002664507d225d07b6929843c3f0299f5335a70c1727c8833737/layer
2014-08-27 06:03:14,189 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:7gRFO3tyNcihZCHvmLPGn9WWVFI=
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/usr/local/lib/python2.7/dist-packages/docker_registry/core/boto.py", line 76, in _fetch_part
    boto_key.get_contents_to_file(f, headers={'Range': brange})
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1603, in get_contents_to_file
    response_headers=response_headers)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1435, in get_file
    query_args=None)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1488, in _get_file_internal
    for bytes in self:
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 368, in next
    data = self.resp.read(self.BufferSize)
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 416, in read
    return httplib.HTTPResponse.read(self, amt)
  File "/usr/lib/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 392, in recv
    self._wait(self._read_event)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 298, in _wait
    self.hub.wait(watcher)
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 341, in wait
    result = waiter.get()
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 568, in get
    return self.hub.switch()
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 331, in switch
    return greenlet.switch(self)
timeout: timed out
<Greenlet at 0x7f8bbb7269b0: <bound method ParallelKey._fetch_part of <docker_registry.core.boto.ParallelKey object at 0x7f8bbb2be550>>('/tmp/tmpPPltfO', 1, 13495293, 26990585)> failed with timeout

Henkis avatar Aug 27 '14 06:08 Henkis

After a few more pulls i stops responding and nothing is written to the container log.

Client:

2014/08/27 15:53:22 Error pulling image (97a9cc4076fd69d8a35be028a74f0e00b8066555) from 256.31.14.183/server, Failed to download json: Get http://256.31.14.183/v1/images/31f7549bbb3e5ffba93ed7a37ae6cc86fafe92cfd343eca00a788a39c1e57023/json: read tcp 256.31.14.183:80: i/o timeout
root@ip-172-31-7-6:~# ./test.sh
2014/08/27 16:07:14 Error: Invalid Registry endpoint: Get http://256.31.14.183/v1/_ping: read tcp 256.31.14.183:80: i/o timeout

Henkis avatar Aug 27 '14 16:08 Henkis

This tells me you are having issues reaching your S3 bucket. Can you look into details on that front? AWS region?

@shin- what do you think?

dmp42 avatar Aug 27 '14 19:08 dmp42

If it's S3 related I will deploy a boto.cfg with region set and debug enabled, initially a reboot of the registry container helped so I did not blamed S3 connectivity but I'm not entirely sure any longer.

It's currently not possible to set S3 region in the registry config, correct?

Henkis avatar Aug 27 '14 20:08 Henkis

Right - it's not possible if you are running it inside a docker container (possibly a gevent version issue). Keep me posted.

dmp42 avatar Aug 27 '14 20:08 dmp42

I'we hard wired boot against eu-west-1 but I am still seeing the same issues. I noticed in the log today that worker threads reported problems inside the container:

2014-09-01 20:27:14 [108] [WARNING] Worker graceful timeout (pid:108)
2014-09-01 20:27:14,172 WARNING: Worker graceful timeout (pid:108)
2014-09-01 20:27:15 [108] [INFO] Worker exiting (pid: 108)

If it's S3 related, any ides on how to proceed with debug? I'we seen quite a few 404:s in the logs but they are always for an _inprogess file is that in order ?

Another error with current debug:

Client:

b848adeb3e09: Error downloading dependent layers
2014/09/01 22:58:37 Error pulling image (fb469e5e3584041a51bdd35175776033a829e6b6) from 172.31.9.215/adserver, flate: corrupt input before offset 13576377

Server:

Mon, 01 Sep 2014 20:58:25 GMT
/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,804 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:oyWlGXzFPB36R4WVfRw7m+esL1g=
2014-09-01 20:58:25,822 DEBUG: path=/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,822 DEBUG: auth_path=/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,822 DEBUG: Method: HEAD
2014-09-01 20:58:25,822 DEBUG: Path: /prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,822 DEBUG: Data:
2014-09-01 20:58:25,822 DEBUG: Headers: {}
2014-09-01 20:58:25,822 DEBUG: Host: arkivdigital-docker-registry.s3.amazonaws.com
2014-09-01 20:58:25,823 DEBUG: Port: 80
2014-09-01 20:58:25,823 DEBUG: Params: {}
2014-09-01 20:58:25,823 DEBUG: Token: None
2014-09-01 20:58:25,823 DEBUG: StringToSign:
HEAD


Mon, 01 Sep 2014 20:58:25 GMT
/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,823 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:oyWlGXzFPB36R4WVfRw7m+esL1g=
2014-09-01 20:58:25,834 DEBUG: path=/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,834 DEBUG: auth_path=/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,834 DEBUG: Method: HEAD
2014-09-01 20:58:25,834 DEBUG: Path: /prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,834 DEBUG: Data:
2014-09-01 20:58:25,834 DEBUG: Headers: {'Range': 'bytes=13561477-67820546'}
2014-09-01 20:58:25,834 DEBUG: Host: arkivdigital-docker-registry.s3.amazonaws.com
2014-09-01 20:58:25,834 DEBUG: Port: 80
2014-09-01 20:58:25,834 DEBUG: Params: {}
2014-09-01 20:58:25,835 DEBUG: Token: None
2014-09-01 20:58:25,835 DEBUG: StringToSign:
HEAD


Mon, 01 Sep 2014 20:58:25 GMT
/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,835 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:oyWlGXzFPB36R4WVfRw7m+esL1g=
2014-09-01 20:58:25,854 DEBUG: path=/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,854 DEBUG: auth_path=/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,854 DEBUG: Method: GET
2014-09-01 20:58:25,854 DEBUG: Path: /prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,855 DEBUG: Data:
2014-09-01 20:58:25,855 DEBUG: Headers: {}
2014-09-01 20:58:25,855 DEBUG: Host: arkivdigital-docker-registry.s3.amazonaws.com
2014-09-01 20:58:25,855 DEBUG: Port: 80
2014-09-01 20:58:25,855 DEBUG: Params: {}
2014-09-01 20:58:25,856 DEBUG: Token: None
2014-09-01 20:58:25,856 DEBUG: StringToSign:
GET


Mon, 01 Sep 2014 20:58:25 GMT
/secret-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,856 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:QaHXpfcR+J4yOoAEKiLTHNqW28Y=
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/usr/local/lib/python2.7/dist-packages/docker_registry/core/boto.py", line 76, in _fetch_part
    boto_key.get_contents_to_file(f, headers={'Range': brange})
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1603, in get_contents_to_file
    response_headers=response_headers)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1435, in get_file
    query_args=None)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1488, in _get_file_internal
    for bytes in self:
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 368, in next
    data = self.resp.read(self.BufferSize)
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 416, in read
    return httplib.HTTPResponse.read(self, amt)
  File "/usr/lib/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 392, in recv
    self._wait(self._read_event)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 298, in _wait
    self.hub.wait(watcher)
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 341, in wait
    result = waiter.get()
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 568, in get
    return self.hub.switch()
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 331, in switch
    return greenlet.switch(self)
timeout: timed out
<Greenlet at 0x7f66e6898910: <bound method ParallelKey._fetch_part of <docker_registry.core.boto.ParallelKey object at 0x7f66e642b990>>('/tmp/tmp36R_mv', 1, 13564110, 27128219)> failed with timeout

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/usr/local/lib/python2.7/dist-packages/docker_registry/core/boto.py", line 76, in _fetch_part
    boto_key.get_contents_to_file(f, headers={'Range': brange})
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1603, in get_contents_to_file
    response_headers=response_headers)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1435, in get_file
    query_args=None)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1488, in _get_file_internal
    for bytes in self:
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 368, in next
    data = self.resp.read(self.BufferSize)
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 416, in read
    return httplib.HTTPResponse.read(self, amt)
  File "/usr/lib/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 392, in recv
    self._wait(self._read_event)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 298, in _wait
    self.hub.wait(watcher)
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 341, in wait
    result = waiter.get()
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 568, in get
    return self.hub.switch()
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 331, in switch
    return greenlet.switch(self)
timeout: timed out
<Greenlet at 0x7f66e68987d0: <bound method ParallelKey._fetch_part of <docker_registry.core.boto.ParallelKey object at 0x7f66e642b990>>('/tmp/tmp36R_mv', 4, 54256440, 67820545)> failed with timeout

Henkis avatar Sep 01 '14 21:09 Henkis

@Henkis the 404 are irrelevant.

What indeed matters is the timeouts reaching your bucket.

Unfortunately, I have little help to offer as far as debugging is concerned. I would go starting a python script from scratch using boto S3 (https://github.com/boto/boto) and query my bucket repeatedly until I trigger the issue, then dig into boto.

Keep me posted on any progress on this.

dmp42 avatar Sep 05 '14 18:09 dmp42

@Henkis can you try bumping your boto timeout values:

[Boto]
http_socket_timeout = 60

inside boto.cfg

... and report here if that helps?

Thanks.

dmp42 avatar Sep 10 '14 01:09 dmp42

I have tried adding the socker timeout to my boto.cfg, it doesn't help.

Same problems maybe some more debug output:

[debug] http.go:162 http://55.67.195.101/v1/images/b026380b9b1536f67fc9db483808d236e2aaa02fe7dbf9db3377dfc830941f6a/json -- HEADERS: map[User-Agent:[docker/1.2.0 go/go1.3.1 git-c94034bf8411f: Download complete
[debug] image.go:325 Json string: {{"id":"b026380b9b1536f67fc9db483808d236e2aaa02fe7dbf9db3377dfc830941f6a","parent":"94034bf8411f38a7a280d612bff163da58ea5b622e7695b6c62edffb001ee1ab","created":"2014-08-26T19:40:30.559806204Z","container":"9328ff054fc27a5163c1ac44bfcb026380b9b15: Pulling fs layer
ainname":"","User":"","Memory":0,"MemorySwap":0,"CpuShares":0,"AttachStdin":false,"AttachStdout":false,"AttachStderr":false,"PortSpecs":null,"ExposedPorts":{"443/tcp":{},"80/tcp"b026380b9b15: Downloading 40.68 MB/67.82 MB 1s
b026380b9b15: Error pulling image (59c2ef25615aee308a641a79d79547445b6a2831) from 55.67.195.101/server, flate: corrupt input before offset 40690609 orrupt input before offset 40690609 7] -job pull(55.67.195.101/server, 59c2ef25615aee308a641a79d79547445b6a2831) = ERR (1)le install --gemfile /home/app/webapp/Gemfile --path /home/app/bundle --deploymentb026380b9b15: Error downloading dependent layers
 assets:precompile;    cp config/database.production.yml config/database.yml"],"Dns":null,"Image":"94034bf8411f38a7a280d612bff163da58ea5b622e7695b6c62edffb001ee1ab","Volumes":null,"VolumesFrom":"","WorkingDir":"","Entrypoint":null,"NetworkDisabled":false,"OnBuild":[]},"docker_version":"0.9.1","config":{"Hostname":"95699cec7d3e","Domainname":"","User":"","Memory":0,"MemorySwap":0,"CpuShares":0,"AttachStdin":false,"AttachStdout":false,"AttachStderr":false,"PortSpecs":null,"ExposedPorts":{"443/tcp":{},"80/tcp":{}},"Tty":false,"OpenStdin":false,"StdinOnce":false,"Env":["HOME=/root","PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],"Cmd":["/sbin/my_init"],"Dns":null,"Image":"94034bf8411f38a7a280d612bff163da58ea5b622e7695b6c62edffb001ee1ab","Volumes":null,"VolumesFrom":"","WorkingDir":"","Entrypoint":null,"NetworkDisabled":false,"OnBuild":[]},"architecture":"amd64","os":"linux","Size":69388467}}
[debug] http.go:162 http://55.67.195.101/v1/images/b026380b9b1536f67fc9db483808d236e2aaa02fe7dbf9db3377dfc830941f6a/layer -- HEADERS: map[User-Agent:[docker/1.2.0 go/go1.3.1 git-commit/fa7b24f kernel/3.13.0-24-generic os/linux arch/amd64]]
[debug] session.go:188 server supports resume
[debug] image.go:97 Start untar layer
[debug] archive.go:88 [tar autodetect] n: [31 139 8 0 0 9 110 136 0 255]
[info] encountered error during pull and clearing it before resume: unexpected EOF
2014/09/14 14:09:54 Error pulling image (59c2ef25615aee308a641a79d79547445b6a2831) from 55.67.195.101/server, flate: corrupt input before offset 40690609

Henkis avatar Sep 14 '14 12:09 Henkis

Ok, the greenlet is timing out actually. That might be a gunicorn bug you are hitting.

By any chance, would you be able to run of master? (I know there are quite some changes...) - or better, force gunicorn to 19.1 instead of 18.0 (inside requirements/main.txt).

Thanks!

dmp42 avatar Sep 15 '14 21:09 dmp42

Hi,

I have tried your solution by forcing gunicorn to 19.1, and I still have the same problem but a different error. First error was just like Henkis : the greenlet was timing out.

Now I have that:

Sep 23 10:15:59  docker[23048]: 2014-09-23 10:15:59 [18] [ERROR] Error handling request
Sep 23 10:15:59  docker[23048]: Traceback (most recent call last):
Sep 23 10:15:59  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
Sep 23 10:15:59  docker[23048]: resp.write(item)
Sep 23 10:15:59  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
Sep 23 10:15:59  docker[23048]: util.write(self.sock, arg, self.chunked)
Sep 23 10:15:59  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
Sep 23 10:15:59  docker[23048]: sock.sendall(data)
Sep 23 10:15:59  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
Sep 23 10:15:59  docker[23048]: data_sent += self.send(_get_memory(data, data_sent), flags)
Sep 23 10:15:59  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 443, in send
Sep 23 10:15:59  docker[23048]: return sock.send(data, flags)
Sep 23 10:15:59  docker[23048]: error: [Errno 104] Connection reset by peer
Sep 23 10:16:10  docker[23048]: 2014-09-23 10:16:10 [19] [ERROR] Error handling request
Sep 23 10:16:10  docker[23048]: Traceback (most recent call last):
Sep 23 10:16:10  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
Sep 23 10:16:10  docker[23048]: resp.write(item)
Sep 23 10:16:10  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
Sep 23 10:16:10  docker[23048]: util.write(self.sock, arg, self.chunked)
Sep 23 10:16:10  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
Sep 23 10:16:10  docker[23048]: sock.sendall(data)
Sep 23 10:16:10  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
Sep 23 10:16:10  docker[23048]: data_sent += self.send(_get_memory(data, data_sent), flags)
Sep 23 10:16:10  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 443, in send
Sep 23 10:16:10  docker[23048]: return sock.send(data, flags)
Sep 23 10:16:10  docker[23048]: error: [Errno 32] Broken pipe

or that :

Sep 23 10:35:08  docker[23048]: 2014-09-23 10:35:08 [16] [ERROR] Error handling request
Sep 23 10:35:08  docker[23048]: Traceback (most recent call last):
Sep 23 10:35:08  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
Sep 23 10:35:08  docker[23048]: resp.write(item)
Sep 23 10:35:08  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
Sep 23 10:35:08  docker[23048]: util.write(self.sock, arg, self.chunked)
Sep 23 10:35:08  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
Sep 23 10:35:08  docker[23048]: sock.sendall(data)
Sep 23 10:35:08  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
Sep 23 10:35:08  docker[23048]: data_sent += self.send(_get_memory(data, data_sent), flags)
Sep 23 10:35:08  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
Sep 23 10:35:08  docker[23048]: return sock.send(data, flags)
Sep 23 10:35:08  docker[23048]: error: [Errno 104] Connection reset by peer

mcadam avatar Sep 23 '14 10:09 mcadam

At least the greenlet is no longer going in lalaland.

So:

  • do you experience degradation and ultimately unresponsiveness after these errors?
  • are you running nginx in front of your registry?

This smells to me like a known regression where cancelled requests wouldn't be handled correctly.

Assuming you are indeed behind nginx, can you try killing keep-alive? (eg: proxy_set_header Connection "";)

dmp42 avatar Sep 23 '14 17:09 dmp42

Right now no nginx, we acces the registry directly on local network on port 5000. After these errors, the docker pull command just fail and thats it after multiple times it works...

mcadam avatar Sep 23 '14 17:09 mcadam

@mcadam that's likely this: https://github.com/benoitc/gunicorn/issues/818

dmp42 avatar Sep 23 '14 18:09 dmp42

I've been experiencing exactly the same symptoms and it appears the problem, at least in my case, is in fact downloading from S3. I created a tiny boto script to download a 69M layer:

#!/usr/bin/python
from boto.s3.connection import S3Connection, Location
conn = S3Connection('***', '***')
bucket = conn.get_bucket('link-docker')
key = bucket.get_key('test/images/d497ad3926c8997e1e0de74cdd5285489bb2c4acd6db15292e04bbab07047cd0/layer')
key.get_contents_to_filename("test_layer")

and found it took 4s to run the first time, then 15s, 23s, 9s and finally after 2m3s:

Traceback (most recent call last):
...
  File "/usr/lib/python2.7/dist-packages/boto/s3/key.py", line 1471, in _get_file_internal
    for bytes in self:
  File "/usr/lib/python2.7/dist-packages/boto/s3/key.py", line 365, in next
    data = self.resp.read(self.BufferSize)
  File "/usr/lib/python2.7/dist-packages/boto/connection.py", line 415, in read
    return httplib.HTTPResponse.read(self, amt)
  File "/usr/lib/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)  
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/lib/python2.7/ssl.py", line 341, in recv
    return self.read(buflen)
  File "/usr/lib/python2.7/ssl.py", line 260, in read
    return self._sslobj.read(len)
socket.error: [Errno 104] Connection reset by peer

Presumably this kind of unreliability with S3 isn't normal so I'm investigating further...

ofrasergreen avatar Oct 22 '14 20:10 ofrasergreen

Same issue here:

b1b7444f3af2: Error downloading dependent layers 
2014/10/29 21:09:51 Error pulling image (latest) from myregistry.com:5000/img-myimage, flate: corrupt input before offset 86796209

Restarting does not solve the problem.

rodlogic avatar Oct 29 '14 23:10 rodlogic

We're having similar problems with 0.9.0. Everything seems to be working fine when just one or two machines are pulling at the same time, but when we let our frontend array (30+ machines) pull at the same time then all hell breaks loose.

Our workaround has been putting a haproxy in front and limiting that the haproxy allows only as many connections to the registry as there are worker threads running and to spread out the image pulls over a larger time period. We're also running registry in three different machines, all using the same S3 backend so that the haproxy has more backends to spread the requests out.

garo avatar Nov 12 '14 07:11 garo

@garo and others

  1. as a workaround, I would suggest you delegate the layer delivery directly to S3 (see storage_redirect under https://github.com/docker/docker-registry/#general-options - doc sucks but it should be easy to figure out from S3 driver code) - alternatively, you may even use cloudfront directly on top of the S3 bucket and have the registry 302 to CF signed urls (ping me on irc docker-dev if you need help)
  2. this will receive some scrutiny for 1.0 release - now, most of the energy is dedicated to V2 (in go - branch "next-generation"), expected to land Q1, so, I don't want you to expect too much from the current version and would advise you use 1. from above to workaround your actual problem...

dmp42 avatar Nov 12 '14 17:11 dmp42

Thanks! The storage_redirect seems to fix all our issues. To others: Just add "storage_redirect: true" to the registry.yaml. You can verify that it works by using tcp dump: "tcpdump -l -A -s 1500 src port 5000 |grep Location:" and you should get nice headers like: "Location: https://yourbucket.s3.amazonaws.com/..."

garo avatar Nov 13 '14 12:11 garo

To add to @garo 's comment (and to save you some digging) you can add -e STORAGE_REDIRECT=true to your docker command if you are running this in a container.

trenton42 avatar Nov 26 '14 19:11 trenton42

We've got this issue too on our registries. We're running them inside an AWS VPC. We're running one registry container per host, with 6 instance currently. Those instances are fronted by an elastic load balancer whose timeout it set to 5 mins. We're using S3 as a storage backend.

After finding this thread, we set the registry with the -e STORAGE_REDIRECT=true option to delegate image downloads directly to S3. That has helped tremendously with the myriad of errors we were getting such as EOF.

However, we're still getting EOF errors on some registry calls that aren't actual image layer retrievals. For example, I saw an EOF error on a /ancestry call. I don't see any errors in the Docker Registry logs when these types of issues happen so I'm sort of at a loss.

These errors seem to happen when we have a heavier load on the registries such as 5-10 images pulling from the registries from the same time. However, that doesn't seem like a heavy load especially with storage redirect and 6 instances of the registry behind a load balancer.

dsw88 avatar Mar 10 '15 21:03 dsw88

@all we recently removed ParallelKey fetch from S3 from the registry - it was triggering timeouts for large objects, and cluttering disk space with orphaned temporary files. It's hard to tell if any of the side-effects you experienced here are linked to it or not, but that removal can sure only help.

dmp42 avatar Mar 10 '15 21:03 dmp42

Hmm that might help, is there any documentation on how to disable that feature on our registries? I'm using the container so I've usually just had environment variables that map to what's in the YAML config file, but I don't see the option you mentioned in that YAML config file schema.

dsw88 avatar Mar 11 '15 04:03 dsw88

It's code removal (#961). You can test by using the latest master.

dmp42 avatar Mar 11 '15 20:03 dmp42

Ok I'm trying the fix out now, I'll have to wait a few days to validate whether our percentage of deploy failures when pulling from the registries goes down.

If anyone else is interested in trying this fix quickly, I have a container built on https://registry.hub.docker.com/u/dsw88/registry/. The tag is 0.9.1-parallelkey-fix. I built the container from master, which appears safe at this point since there haven't been any major code changes since the 0.9.1 stable release on Jan. 8: https://github.com/docker/docker-registry/commits/master

dsw88 avatar Mar 16 '15 18:03 dsw88

Ok it looks like I'm still seeing the EOF errors after applying the parallel key fix. I don't know yet whether their occurrence has been reduced, since I'll need to wait a few days to have enough real-world data from deploys.

But regardless, it appears that parallel key isn't the ultimate fix for this issue. Any other suggestions? Do you think it would be good to set up a Redis LRU cache? I'm wondering if having that cache would decrease the number of errors since it will need to make fewer round-trips to S3.

dsw88 avatar Mar 16 '15 23:03 dsw88

I would definitely recommend using the Redis LRU cache. I would also advise to delegate actual bits delivery to either S3 or a Cloudfront view of it (https://github.com/docker/docker-registry/blob/master/ADVANCED.md#s3-storage-with-cloudfront)

dmp42 avatar Mar 17 '15 01:03 dmp42

Ok so I'm now using the LRU cache and doing S3 storage redirect. We continue to get EOF errors, but we'll see if this at least cuts it down.

I'm still concerned about the underlying issue here, however, as adding a cache is at best masking the problem. I'll try to do some debugging in the registry to find out what's causing this error, but I'm not super familiar with the codebase or technology stack so it'll be slow going.

One of my problems is I can't see any errors in the registry logs when this problem occurs. @dmp42 do you have any suggestions about how I could go about getting enough information about the error to start debugging? The Docker client isn't much help when pulling images because it just says "unexpected EOF".

dsw88 avatar Mar 18 '15 18:03 dsw88

It appears that the small files like /ancestry and others are still failing periodically for us:

$ docker pull <host_name>/<image_name>:238   
Pulling repository <host_name>/<image_name>                      
2015/03/18 20:44:16 Error pulling image (<tag_name>) from <host_name>/<image_name>, 
Get http://<host_name>/v1/images/8a39dc87bd3e270444da2b7316ffcc8f7c2e65f5d91e5a3c3d2bcf17b905a7f6/ancestry: EOF                                                                                  

When I look in the registry logs, it shows that it got that request and even returned a 200, so presumably it thought it returned the image layer correctly:

[18/Mar/2015:20:44:16 +0000] "GET /v1/images/8a39dc87bd3e270444da2b7316ffcc8f7c2e65f5d91e5a3c3d2bcf17b905a7f6/ancestry HTTP/1.1" 200 3196 "-" "docker/1.1.2 go/go1.2.1 git-commit/d84a070 kernel/3.14.0-0.bpo.2-amd64 os/linux arch/amd64"

So are the threads in the webapp dying or timing out while it's streaming the response or something?

dsw88 avatar Mar 18 '15 21:03 dsw88

@dsw88 I now think you are not talking about the same issue that was initially described here (the recommended workarounds were/are here to address registry side EOF/issues communicating with S3). Your problem is more likely on the client side pipeline.

Do you use an http proxy? or a reverse-proxy in front of your registry? (nginx, HAproxy) Do you use boot2docker?

Also, you are running a quite old version of docker (1.1.2).

I would strongly suggest upgrading if you can...

dmp42 avatar Mar 18 '15 23:03 dmp42