buildstep icon indicating copy to clipboard operation
buildstep copied to clipboard

chown in /start -- is it necessary?

Open tilgovi opened this issue 9 years ago • 17 comments

The chown in /start can take time and messes with the consistency of things like the dokku checks timeouts. My understanding is that it's necessary in case something else (like a dokku plugin) modifies the filesystem after the compile step.

  • Is my understand correct?
  • Is it necessary?
  • Is there a better way we can ensure those post-compile steps don't mess up permissions, such as by running them under the application user?
  • If not, could we deal with this in dokku explicitly as another step of generating the slug before deploy, so that it can be executed between hooks?

tilgovi avatar Feb 13 '15 00:02 tilgovi

Take a look at the Herokuish branch and project. I'd like to have this conversation based on how it's doing things since it'll be merged in and replace a lot of buildstep soon.

progrium avatar Feb 16 '15 16:02 progrium

There's also a relation to #133 - chowning the dir helps with the mounting of data volumes and random users as the permissions on for the mounted directories get „fixed“ before the app starts as unprivileged user, at least for volumes mounted below /app

@tilgovi I've tried to replicate the impact on the checks plugin but haven't been able to create a fail due to a large app directory. Any indication what is needed to make the chown so slow that it has a noticeable impact? I tried with a dummy directory containing 131072 files in a 2 level hashed directory and even on my slowest drives it took less than a second.

mjonuschat avatar Feb 16 '15 17:02 mjonuschat

@yabawock I've seen the issue almost solely under high resource utilization.

michaelshobbs avatar Feb 16 '15 18:02 michaelshobbs

I can't seem to trigger a situation where the chown is the sole culprit of the failing check. When torturing a system so that the chown takes more than a second to complete (iowait > 80%) starting the application without running chown also takes an extremely long time (enough to fail the checks on its own), at least with an application that's not just serving a static „Hello World“ page.

I'm pretty sure there is a usage mix that can trigger a failed check, but if that is „normal“ utilization of your host I'd consider removing the chown a worse stop-gap measure than increasing the wait/timeout values for the checks plugin.

@progrium herokuish doesn't do anything different, the chown is in procfile-setup-home().

mjonuschat avatar Feb 16 '15 20:02 mjonuschat

Wouldn't your underlying filesystem have a significant role in this?

On Mon, Feb 16, 2015 at 2:58 PM, Morton Jonuschat [email protected] wrote:

I can't seem to trigger a situation where the chown is the sole culprit of the failing check. When torturing a system so that the chown takes more than a second to complete (iowait > 80%) starting the application without running chown also takes an extremely long time (enough to fail the checks on its own), at least with an application that's not just serving a static „Hello World“ page.

I'm pretty sure there is a usage mix that can trigger a failed check, but if that is „normal“ utilization of your host I'd consider removing the chown a worse stop-gap measure than increasing the wait/timeout values for the checks plugin.

@progrium https://github.com/progrium herokuish doesn't do anything different, the chown is in procfile-setup-home().

— Reply to this email directly or view it on GitHub https://github.com/progrium/buildstep/issues/135#issuecomment-74571490.

Jeff Lindsay http://progrium.com

progrium avatar Feb 17 '15 05:02 progrium

I've tried with both ext4 and btrfs. Maybe XFS (Redhat default?) is a possible candidate, there are contradicting reports, some say that XFS is slow for this kind of filesystem operation, others report it's faster than ext4

mjonuschat avatar Feb 17 '15 13:02 mjonuschat

The box where I have this issue is actually very memory constrained. I don't know if that's the cause but I suspect so. There's a fair bit of swapping during this time.

If it's not reasonable to change this behavior then that's totally fine. I just thought it would be worth discussing.

So, reasons for slowness aside, what are the reasons we couldn't run the chown and checkpoint the result so that it's done already at deploy?

tilgovi avatar Feb 17 '15 16:02 tilgovi

Given the current way things work the chown is optional and mostly fixes modification that happened after the compile phase to the image/slug. As I already mentioned it also helps with volumes mounted below /app that can have permission problems due to the user account changing on each deploy.

Would the user change on every start of the application (as it does on Heroku) the chown wouldn't be optional anymore since the ownership of /app would need to be adjusted for the current user. At the moment this isn't feasible due to bugs in docker, but that might change in the future.

Weighting all pros and cons I would reason to keep the chown as I think using volumes for persistent storage below /app is currently much more common (although not in line with an ephemeral filesystem ) than a severely resource constrained box.

mjonuschat avatar Feb 17 '15 20:02 mjonuschat

Any reason not to make that a distinct phase before deploy, though?

tilgovi avatar Feb 17 '15 20:02 tilgovi

The distinct phase before deploy would be something a container manager does, wouldn't it?

mjonuschat avatar Feb 17 '15 21:02 mjonuschat

Yes. I should maybe refile this against dokku. But first we'd need to separate the start script and the chown so the latter can be executed separately, I think.

On Tue, Feb 17, 2015, 13:26 Morton Jonuschat [email protected] wrote:

The distinct phase before deploy would be something a container manager does, wouldn't it?

— Reply to this email directly or view it on GitHub https://github.com/progrium/buildstep/issues/135#issuecomment-74757445.

tilgovi avatar Feb 17 '15 21:02 tilgovi

Please take into consideration that buildstep is rather generic and it's not just dokku using it. Even if there was only dokku your proposal seems more prone to errors than the current situation to me. To make a reliable chown of everything under /app at build time all data volumes would need to be mounted. But then you would get into the situation that an older version of the application could be running with account u12345 and appropriate file permissions on the data volumes. Now while building you change the permissions within the docker volume. Suddenly the running application using account u9876 can no longer access/write the persistent data and the new application is still being deployed.

I know that the same situation can occur now if you are using a dokku version with zero downtime deploys, but at least it will be fixed on application restart if a deployment error occurs.

When I originally implemented it, it was meant as a safeguard against „careless“ post processing of the slug and curtesy to docker volumes below /app. Currently it could be removed - in master as well as the herokuish branch - as both operate with the same user during build and run phase.

All in all it seems like a band aid for a problem with a different root cause to me. I am not opposed to delegating the responsibility for correct permissions to the container manager if any kind of modification happens to the slug after the build phase, even mounting a volume. But I fear it will do more harm than good - especially since full user randomization was originally in the herokuish branch and only got removed due to bugs in docker. So the chown might get to be a requirement in the future.

@progrium What's your take on the situation?

mjonuschat avatar Feb 17 '15 22:02 mjonuschat

If it can be removed, we can take it out in herokuish (which should be landing here soon) and revisit it as issues around it come up again. I'm all for experiments like that, especially if they end up simplifying or removing code.

progrium avatar Feb 19 '15 02:02 progrium

I am deploying a node.js app with dokku and running into the problem that this chown step runs forever and uses high system resources/io and somehow gets stuck

I have 2 containers deploying the same code, one running as web and one running as worker, when they start up they are sometimes (strangely not always) stuck on the following:

root     11370  0.3  0.1  12884  1040 ?        D    02:20   0:00 chown -R u23590:u23590 /app
root     11522  0.2  0.1  12884  1048 ?        D    02:20   0:00 chown -R u23590:u23590 /app

vmstat

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  1      0  68908  72208 555784    0    0   522   895  206  570  3  1  1 58 38
 3  0      0  64812  71876 559660    0    0  4224     0   52   94  0  1  0 98  1
 0  2      0  85064  70864 543220    0    0   768 20480   58   58  0  0  0 100  0
 0  2      0  84444  70864 543732    0    0   256  4080   45   23  0  0  0 100  0
 0  2      0  83948  70864 544244    0    0   256  4096   57   47  1  0  0 99  0

iotop shows these chown being the reason why the system is locked up:

4028 be/4 root      235.78 K/s  389.03 K/s  0.00 % 76.72 % [kworker/0:3]
   46 be/4 root      102.17 K/s  310.44 K/s  0.00 %  8.79 % [kworker/0:1]
11522 be/4 root       31.44 K/s    0.00 B/s  0.00 %  7.80 % chown -R u23590:u23590 /app
11370 be/4 root       23.58 K/s    0.00 B/s  0.00 %  6.12 % chown -R u23590:u23590 /app

looking at what files these chown processes are accessing it seems they are busy going through node_modules files

sudo lsof -p 11370
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
chown   21948 root  cwd    DIR   0,33     4096     2 /
chown   21948 root  rtd    DIR   0,33     4096     2 /
chown   21948 root  txt    REG   0,33    60160    85 /bin/chown
chown   21948 root  mem    REG   0,33             64 /lib/x86_64-linux-gnu/libnss_files-2.19.so (path dev=202,1, inode=401387)
chown   21948 root  mem    REG   0,33             62 /lib/x86_64-linux-gnu/libnss_nis-2.19.so (path dev=202,1, inode=401379)
chown   21948 root  mem    REG   0,33             60 /lib/x86_64-linux-gnu/libnsl-2.19.so (path dev=202,1, inode=401375)
chown   21948 root  mem    REG   0,33             58 /lib/x86_64-linux-gnu/libnss_compat-2.19.so (path dev=202,1, inode=401374)
chown   21948 root  mem    REG   0,33             51 /lib/x86_64-linux-gnu/libc-2.19.so (path dev=202,1, inode=401384)
chown   21948 root  mem    REG   0,33             42 /lib/x86_64-linux-gnu/ld-2.19.so (path dev=202,1, inode=401377)
chown   21948 root    0u   CHR    1,3      0t0 53490 /dev/null
chown   21948 root    1w  FIFO    0,8      0t0 53237 pipe
chown   21948 root    2w  FIFO    0,8      0t0 53238 pipe
chown   21948 root    3r   DIR   0,33     4096  9198 /app/node_modules/bower/lib/node_modules
chown   21948 root    5r   DIR   0,33    12288  9364 /app/node_modules/bower/lib/node_modules/lodash

dokku is running on a VPS and this consumes all io available - sometimes after 20 minutes or so its finally finished and resources are released, when I sudo docker exec -it container /bin/bash and do chown -R u23590:u23590 /app then it runs within 1 or 2 seconds as expected

Can I somehow disable this chown step if its not required?

Im using this .buildpacks

https://github.com/ddollar/heroku-buildpack-apt
https://github.com/heroku/heroku-buildpack-nodejs.git#v87
https://github.com/captain401/heroku-buildpack-xvfb

bymodude avatar Mar 05 '16 02:03 bymodude

Note: dokku no longer uses buildstep, so you are almost certainly running on an unsupported version of dokku.

josegonzalez avatar Mar 07 '16 16:03 josegonzalez

no Im running current version of dokku, but from your comment I realize it now uses herokuish which is doing the same chown step which takes forever in my case

is that an issue that should be filed over at herokuish then or over at dokku or is it something altogether that Im just out of luck with? I would assume that any node.js deployment on dokku with many dependencies (30K files) in node_modules may run into similar problems when running on smaller VPS

$ sudo dokku version
0.4.14

bymodude avatar Mar 07 '16 16:03 bymodude

Related herokuish issue https://github.com/gliderlabs/herokuish/issues/114 , which you may wish to comment upon.

josegonzalez avatar Mar 07 '16 18:03 josegonzalez