postgres
postgres copied to clipboard
crash looping on GKE: wal-e pipe capacity issue with newer kernels
This is apparently due to https://github.com/wal-e/wal-e/issues/270
example spew:
wal_e.retries WARNING MSG: retrying after encountering exception
DETAIL: Exception information dump:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/wal_e/retries.py", line 62, in shim
return f(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/wal_e/worker/gs/gs_worker.py", line 76, in fetch_partition
with get_download_pipeline(PIPE, PIPE, self.decrypt) as pl:
File "/usr/local/lib/python2.7/dist-packages/wal_e/pipeline.py", line 92, in __enter__
self.stdin = pipebuf.NonBlockBufferedWriter(stdin)
File "/usr/local/lib/python2.7/dist-packages/wal_e/pipebuf.py", line 225, in __init__
_setup_fd(self._fd)
File "/usr/local/lib/python2.7/dist-packages/wal_e/pipebuf.py", line 62, in _setup_fd
set_buf_size(fd)
File "/usr/local/lib/python2.7/dist-packages/wal_e/pipebuf.py", line 53, in set_buf_size
fcntl.fcntl(fd, fcntl.F_SETPIPE_SZ, OS_PIPE_SZ)
IOError: [Errno 1] Operation not permitted
HINT: A better error message should be written to handle this exception. Please report this output and, if possible, the situation under which it arises.
STRUCTURED: time=2016-10-22T20:14:41.882253-00 pid=221
Thanks for the report @tmc! What provider and k8s version did you use to deploy Workflow? That can help us nail down the issue so we can figure out a fix we could propose upstream and then bump wal-e to a release with that fix.
Did the fix in https://github.com/wal-e/wal-e/issues/270#issuecomment-240811659 work for you?
It might be a neat experiment to try a slightly older kernel version with kubernetes and see if this issue still persists.
Yes that fix worked, GKE, k8s 1.4
this is still present on k8s 1.4.6 on GKE
Unfortunately there is nothing we can do on our end to fix this other than to use the provided workaround or to fix it in wal-e and bump the installed version. If you can provide a patch that fixes this issue for you, please make a PR upstream and we can bump wal-e forwards to the fix once it's merged.
I haven't seen this issue in the wild on GKE with k8s 1.4+ so I don't have a reliable test case (or even the slightest idea how this issue crops up) to test a fix against. Until then I cannot help you.
Hey all, WAL-E maintainer here.
I will accept a patch with a lower pipe size that doesn't tank performance that works with defaults or some adaptive code to deal with this new limit. I suspect the adaptive approaches may be more trouble than its worth, but if someone can surprise me, that'd be great.
I'm on k8s 1.5.2 on GKE. When I try the workaround from https://github.com/wal-e/wal-e/issues/270#issuecomment-240811659 I get:
root@deis-database-540367895-x6r6d:/# echo 0 > /proc/sys/fs/pipe-user-pages-soft
bash: /proc/sys/fs/pipe-user-pages-soft: Read-only file system
Am I doing something wrong?
For anyone else that runs into the same problem I had, I was able to solve it by downloading the workflow chart, unpacking it and editing database-deployment.yml to add the annotation security.alpha.kubernetes.io/sysctls: fs.pipe-user-pages-soft=0 to the Deployment.
I ran into this again on an upgrade.
security.alpha.kubernetes.io will not work in GKE, because all alpha features disabled there.
I think one could write a patch with some arithmetic to solve this (by reading the value, generously estimating how many pipes will be created, and dividing).
On Wed, May 10, 2017 at 12:29 PM Travis Cline [email protected] wrote:
I ran into this again on an upgrade.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/deis/postgres/issues/154#issuecomment-300588254, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcF-tkSOpzFHFhZVfGPHnO8RPJaDDFks5r4hAWgaJpZM4Kd9Z8 .