unit Isolation: added support for per-application cgroups.

Firstly, this is not to be confused with CLONE_NEWCGROUP which unit already supports and is related to namespaces. To re-cap, namespaces allow processes to have different views of various parts of the system such as filesystem mounts, networking, hostname etc.

Whereas cgroup0 is a Linux kernel facility for collecting a bunch of processes together to perform some task on the group as a whole, for example to implement resource limits.

There are two parts to cgroup, the core part of organising processes into a hierarchy and the controllers which are responsible for enforcing resource limits etc.

There are currently two versions of the cgroup sub-system, the original cgroup and a version 21 introduced in 3.16 (August 2014) and marked stable in 4.5 (March 2016).

This commit supports the cgroup V2 API and implements the ability to place applications into their own cgroup on a per-application basis. You can put them each into their cgroup or you can group some together. The ability to set resource limits can easily be added in future.

The initial use case of this would be to aid in observability of unit applications which becomes much easier if you can just monitor them on a per cgroup basis.

One thing to note about cgroup, is that unlike namespaces which are controlled via system calls such as clone(2) and unshare(2), cgroups are setup and controlled through the cgroupfs pseudo-filesystem.

cgroup is Linux only and this support will only be enabled if configure finds the cgroup2 filesystem mount, e.g

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot)

To make use of this in unit a new "cgroup" section has been added to the isolation configuration.

e.g

  "applications": {
      "python": {
          "type": "python",
          "processes": 5,
          "path": "/opt/unit/unit-cgroup-test/",
          "module": "app",

          "isolation": {
              "cgroup": {
                  "name": "unit/python"
              }
          }
      }
  }

Will place the "python" application into its own cgroup under CGROUP_ROOT/unit/python

If you know of an already existing cgroup where you's like to place it, you can, e.g

"name": "system.slice/unit/python"

Where system.slice has already been created by systemd and may already have some overall system limits applied which would also apply to unit. Limits apply down the hierarchy and lower groups can't exceed the previous group limits.

So what does this actually look like? Lets take the unit-calculator application2 and have each of its applications placed into their own cgroup. If we give each application a new section like

  "isolation": {
      "cgroup": {
          "name": "unit/unit-calculator/add"
      }
  }

changing the name for each one, we can visualise the result with the systemd-cgls command, e.g

  │   └─session-5.scope (#4561)
  │     ├─  6667 sshd: andrew [priv]
  │     ├─  6684 sshd: andrew@pts/0
  │     ├─  6685 -bash
  │     ├─ 12632 unit: main v1.28.0 [/opt/unit/sbin/unitd --control 127.0.0.1:808>
  │     ├─ 12634 unit: controller
  │     ├─ 12635 unit: router
  │     ├─ 13550 systemd-cgls
  │     └─ 13551 less
  ├─unit (#4759)
  │ └─unit-calculator (#5037)
  │   ├─subtract (#5069)
  │   │ ├─ 12650 unit: "subtract" prototype
  │   │ └─ 12651 unit: "subtract" application
  │   ├─multiply (#5085)
  │   │ ├─ 12653 unit: "multiply" prototype
  │   │ └─ 12654 unit: "multiply" application
  │   ├─divide (#5101)
  │   │ ├─ 12671 unit: "divide" prototype
  │   │ └─ 12672 node divide.js
  │   ├─sqroot (#5117)
  │   │ ├─ 12679 unit: "sqroot" prototype
  │   │ └─ 12680 /home/andrew/src/unit-calculator/sqroot/sqroot
  │   └─add (#5053)
  │     ├─ 12648 unit: "add" prototype
  │     └─ 12649 unit: "add" application

We can see that the main unit processes are in the same cgroup as the shell from where they were started, by default child process are placed into the same cgroup as the parent.

Then we can see that each application has been placed into its own cgroup.

The cgroups are removed on shutdown or as required on reconfiguration.

Jul 18 '22 21:07 ac000

One open question is how to handle failures to create or add processes to cgroups, I see three options.

a) Ignore failures and carry on b) Fail hard c) Carry on but log failures

I'm leaning towards c.

Jul 18 '22 21:07 ac000

If a client can trigger the failure, we shouldn't fail, but I don't expect that to be possible. This is a [re]configuration-time failure, right? I'd just fail hard (and that's what we do in most cases, AFAIK) (by fail hard I mean don't allow the reconfiguration and keep the old one).

Jul 18 '22 22:07 alejandro-colomar

If a client can trigger the failure, we shouldn't fail, but I don't expect that to be possible. This is a [re]configuration-time failure, right? I'd just fail hard (and that's what we do in most cases, AFAIK).

but do you have a diff to see what would mean fail hard vs log & continue?

Jul 18 '22 23:07 alejandro-colomar

On Mon, 18 Jul 2022 15:39:29 -0700 Alejandro Colomar @.***> wrote:

If a client can trigger the failure, we shouldn't fail, but I don't expect that to be possible. This is a [re]configuration-time failure, right? I'd just fail hard (and that's what we do in most cases, AFAIK).

Yes, start up and reconfigure is when cgroups could be created/populated.

Actually I haven't noticed if there is any process monitoring code, so say if an application process dies, unit would re-create it... then it would become a runtime failure, but as I say I haven't noticed that there is any such code.

Jul 18 '22 23:07 ac000

On Mon, 18 Jul 2022 15:49:00 -0700 Alejandro Colomar @.***> wrote:

@alejandro-colomar commented on this pull request.

+static nxt_int_t
+nxt_cgroup_create(const nxt_task_t *task, nxt_process_t *process) +{
char cgroot[PATH_MAX];

char cgpath[PATH_MAX];

if (task->thread->runtime->type != NXT_PROCESS_MAIN
   || nxt_process_type(process) != NXT_PROCESS_PROTOTYPE) {
   return NXT_OK;
}

if (process->isolation.cgroup.name == NULL) {
   return NXT_OK;
}

nxt_make_cgpaths(process, cgpath, sizeof(cgpath), cgroot, sizeof(cgroot));
I have a longstanding concern about this kind of sizeof usage:

You know, I might just get rid of that function, I wasn't particularly happy with it myself and only the cleanup function actually needs both cgpath and cgroot.

Jul 19 '22 00:07 ac000

This updated commit should address all the above feedback (except the error handling).

Notable changes are

The removal of nxt_make_cgpaths(), it was introduced for a problem that no longer exists.
The auto/cgroup script has been re-worked to make it POSIX compliant.

Jul 21 '22 00:07 ac000

Apart from most the above, the biggest change here is to unconditionally include the cgroup_cleanup member in nxt_process_isolation_t, that way we can get rid of the #if/#endif from nxt_main_process_cleanup().

Aug 13 '22 15:08 ac000

Changes this time

cgroup name validation
mark the cgroup name as required
check the return value of the fprintf(3) in nxt_cgroup_proc_add()

Aug 18 '22 01:08 ac000

@ac000 @alejandro-colomar

If the user provides the config:

{
    "isolation": {
        "cgroup": {
            "name": ""
        }
    }
}

It will add the process to the root of the cgroup hierarchy (the path will be /sys/fs/cgroup//cgroup.procs). This could be an easy fix or work by design, but the cgroup root is a valid configuration, then I'm wondering now if makes sense to call this name...

The cgroup users are used to placing their configurations in the cgroup fs and organizing their profiles using a directory hierarchy, then maybe we can make this clear in the config if we document it as a path:

{
    "isolation": {
        "cgroup": {
            "path": "relative/path"
        }
    }
}

If the path is relative, then it's relative to Unit's cgroup slice (we can check Unit's cgroup path by reading the /proc/self/cgroup).

But if the path is absolute, it's an absolute path to the cgroup root:

{
    "isolation": {
        "cgroup": {
            "path": "/unit/apps/my_app"
        }
    }
}

In this case, the slice used is /sys/fs/cgroup/unit/apps/my_app.

This way users also have an easy way of placing all applications inside the resource limits of the unit itself, automatically, without the need to previously know where it is. This is useful because commonly the Unit's "daemon config"/"systemd unit" is configured in a separate repo than the application's configs.

This also makes it very clear we are dealing with cgroup paths, which cgroup users are used to.

WDYT?

Aug 19 '22 14:08 i4ki

I'll give it some thought, but on the face of it, that sounds reasonable.

Aug 19 '22 17:08 ac000

Implemented @i4ki's suggestion.
Updated the commit message to reflect this change.

So essentially it means that specifying an absolute path will retain the old behaviour of creating the cgroups under the cgroupfs mount point.

Whereas specifying a relative path will create the cgroups under unit's main process cgroup.

For testing I have the unit calculator app setup with all the ruby/java/go/etc dependencies installed, but it takes a bit of fudging to get going.

I also have a simpler python app test, using the following config

{
    "listeners": {
        "[::1]:8080": {
            "pass": "applications/python"
        }
    },
    "applications": {
        "python": {
            "type": "python",
            "processes": 5,
            "path": "/opt/unit/unit-cgroup-test/",
            "module": "app", 
            "isolation": {
                "cgroup": {
                    "path": "app/python"
                }
            }
        }
    }
}

Adjust paths etc as required.

/opt/unit/unit-cgroup-test contains the following python app.py

def application(environ, start_response):
    start_response("200 OK", [("Content-Type", "text/plain")])
    return (b"Hello, Python on Unit!")

Then I usually just use systemd-cgls to check things are going where they should. For example the above config prodices

│   │ │ ├─app-glib-cinnamon\x2dcustom\x2dlauncher\x2d3-43951.scope (#90951)
│   │ │ │ ├─ 30853 unit: main v1.28.0 [/opt/unit/sbin/unitd --no-daemon]
│   │ │ │ ├─ 30855 unit: controller
│   │ │ │ ├─ 30856 unit: router
│   │ │ │ ├─ 43951 xterm -bg rgb:20/20/20 -fg white -fa DejaVu Sans Mono
│   │ │ │ ├─ 43956 bash
│   │ │ │ ├─ 58828 sudo -i
│   │ │ │ ├─ 58831 -bash
│   │ │ │ └─app (#110298)
│   │ │ │   └─python (#110314)
│   │ │ │     ├─ 36421 unit: "python" prototype
│   │ │ │     ├─ 36422 unit: "python" application
│   │ │ │     ├─ 36423 unit: "python" application
│   │ │ │     ├─ 36424 unit: "python" application
│   │ │ │     ├─ 36425 unit: "python" application
│   │ │ │     └─ 36426 unit: "python" application

By specifying a relative path the cgroup has been created as a sub group of the main unit process cgroup.

Whereas by specifying an absolute path the cgroup path is created directly under the cgroupfs mount point separate from the main unit cgroup.

│   │ │ ├─app-glib-cinnamon\x2dcustom\x2dlauncher\x2d3-43951.scope (#90951)
│   │ │ │ ├─ 30853 unit: main v1.28.0 [/opt/unit/sbin/unitd --no-daemon]
│   │ │ │ ├─ 30855 unit: controller
│   │ │ │ ├─ 30856 unit: router
│   │ │ │ ├─ 43951 xterm -bg rgb:20/20/20 -fg white -fa DejaVu Sans Mono
│   │ │ │ ├─ 43956 bash
│   │ │ │ ├─ 58828 sudo -i
│   │ │ │ └─ 58831 -bash
...
├─unit (#110189)
│ └─app (#110233)
│   └─python (#110249)
│     ├─ 36029 unit: "python" prototype
│     ├─ 36030 unit: "python" application
│     ├─ 36031 unit: "python" application
│     ├─ 36032 unit: "python" application
│     ├─ 36033 unit: "python" application
│     └─ 36034 unit: "python" application

Aug 23 '22 21:08 ac000

hey @ac000 @alejandro-colomar,

about the error handling and logging, there are a lot of weird cases where writing to cgroup.procs can fail, some examples are below:

ENOTSUP: The subsystem can be in an invalid state (cgroup.type reports it) in which all write operations for moving processes around the hierarchy fail.
EBUSY: The "no internal processes" rule.
EPERM: Basic file permissions.
etc

So in my opinion, we should always log the error returned if we failed to move the process into the cgroup. This is important to understand why the application didn't join the cgroup.

With regard to "hard fail or log + carry on", I see that both approaches are valid in different use cases. The "hard fail" could be very important if the application requires strict resource control, as for example let's say the application is a sandbox/playground/ webshell/buildbot/CI/etc, and then the number of processes spawned and memory usage must be restricted (pids.max, memory.max, io.max, etc), then if we failed to place the process into the right cgroup when creating or respawning it, we must not continue because otherwise it could be a security issue or it would allow users to use more than the allowed quota in the cloud, etc. The developers of such applications would need to take more care of the cgroup state/hierarchy, in order to make sure their processes can always join the intended cgroups.

I also understand that for most applications, this is not a requirement, and not joining a cgroup is just a harmless issue. In this case, the high availability of the applications is way more important.

So maybe having an isolation.cgroup.strict boolean option to control this behavior? Then, if it's false by default, it just logs the errors and continue but if true we fail on (re)configuration, and runtime re-spawns?

In the future, I think we can even make the Unit implementation respect (or be polite) in the case of external tools also manipulating the cgroup hierarchy, but this could be done later, after the user's feedback.

Sorry if this was already discussed or if it doesn't make much sense, just trying to help.

Aug 28 '22 13:08 i4ki

This has the following changes

Don't bother forbidding '/./' in the cgroup path
Forbid an empty cgroup path
Use getline(3) instead of fscanf(3) for reading /proc/self/cgroup
Just use PATH_MAX directly instead of 'pathlen'
Handle the possibility of multiple cgroup2 mount points

Aug 31 '22 22:08 ac000

@i4ki For the initial implementation we've decided to just follow what the rest of the isolation code does and if we can't put a process in a cgroup we will simply fail to start that process and log a message.

There was a question that if we wanted to allow to enable strict behaviour with cgroups or not should we also apply that to the isolation code in general.

Oct 17 '22 11:10 ac000

Changes this time around :-

During configure output if we found cgroupv2 support.
If we fail to place a process in a cgroup, fail that process and log a message.

Oct 17 '22 12:10 ac000

Changes

If we failt to add a process to a cgroup, try and clean up the directories that we may have created.

Oct 17 '22 12:10 ac000

Changes this time around :-

* During configure output if we found cgroupv2 support.

* If we fail to place a process in a cgroup, fail that process and log a message.

Please remember to push once after the rebase and once again after the changes (in no specific order), so that github doesn't mess with the interdiff. I think you also need to send a small message after each rebase so that github doesn't merge the interdiff, but I'm not sure about this.

:)

Oct 17 '22 12:10 alejandro-colomar

Please remember to push once after the rebase and once again after the changes (in no specific order), so that github doesn't mess with the interdiff. I think you also need to send a small message after each rebase so that github doesn't merge the interdiff, but I'm not sure about this.

Hmm, that's a bit of a bugger...

Oct 22 '22 01:10 ac000

For cb4b9ae "Isolation: wired up cgroup to build system." 66830d2 "Isolation: wired up cgroup support to the config system.":

Reviewed-by: Alejandro Colomar <[email protected]>

Great job!!

Oct 25 '22 08:10 alejandro-colomar

Changes:-

Don't leak memory from looping getline(3)
Re-arrange includes in src/nxt_cgroup.c
Re-appropriate ENODATA for when we don't find a cgroupv2 entry in /proc/self/cgroup
Fix variable shadowing in nxt_mk_cgpath(). len was declared both outside and inside a loop

Oct 26 '22 14:10 ac000

Changes :-

-Revert changes around getline(3)

Oct 27 '22 14:10 ac000

43c651a "Isolation: wired up per-application cgroup support internally."

Reviewed-by: Alejandro Colomar <[email protected]>

Oct 27 '22 14:10 alejandro-colomar

Rebased with master

Dec 09 '22 14:12 ac000

Changes :-

PATH_MAX -> NXT_MAX_PATH_LEN
Better termination of the prototype processes in case of error

Dec 09 '22 14:12 ac000

What do you think about this simplification? Since we know truncation is not possible, it makes sense to simply not check for it. Let's reconsider; we almost shipped a bug that was in code supposedly preventing bugs... Simple code has less bugs. And given that snprintf(3) is so misdesigned that it's hard to use it correctly, I prefer going for stpcpy(3).

However, if we apply this patch, I'd write a clear note in the commit message that truncation is not possible, and detail that we already check for it in the config validation code (and BTW, it makes sense to me to validate only once).

Cheers, Alex

diff --git a/src/nxt_cgroup.c b/src/nxt_cgroup.c
index 20497a45..0d1ec4f0 100644
--- a/src/nxt_cgroup.c
+++ b/src/nxt_cgroup.c
@@ -8,9 +8,9 @@
 #include <nxt_cgroup.h>
 
 
-static int nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir,
+static char *nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir,
     char *cgpath);
-static nxt_int_t nxt_mk_cgpath(nxt_task_t *task, const char *dir,
+static char *nxt_mk_cgpath(nxt_task_t *task, const char *dir,
     char *cgpath);
 
 
@@ -18,6 +18,7 @@ nxt_int_t
 nxt_cgroup_proc_add(nxt_task_t *task, nxt_process_t *process)
 {
     int        len;
+    char       *p;
     char       cgprocs[PATH_MAX];
     FILE       *fp;
     nxt_int_t  ret;
@@ -29,8 +30,8 @@ nxt_cgroup_proc_add(nxt_task_t *task, nxt_process_t *process)
         return NXT_OK;
     }
 
-    ret = nxt_mk_cgpath(task, process->isolation.cgroup.path, cgprocs);
-    if (nxt_slow_path(ret == NXT_ERROR)) {
+    p = nxt_mk_cgpath(task, process->isolation.cgroup.path, cgprocs);
+    if (nxt_slow_path(p == NULL)) {
         return NXT_ERROR;
     }
 
@@ -39,13 +40,7 @@ nxt_cgroup_proc_add(nxt_task_t *task, nxt_process_t *process)
         return NXT_ERROR;
     }
 
-    len = strlen(cgprocs);
-
-    len = snprintf(cgprocs + len, PATH_MAX - len, "/cgroup.procs");
-    if (nxt_slow_path(len >= PATH_MAX - len)) {
-        nxt_errno = ENAMETOOLONG;
-        return NXT_ERROR;
-    }
+    stpcpy(p, "/cgroup.procs");
 
     fp = nxt_file_fopen(task, cgprocs, "we");
     if (nxt_slow_path(fp == NULL)) {
@@ -67,17 +62,16 @@ nxt_cgroup_proc_add(nxt_task_t *task, nxt_process_t *process)
 void
 nxt_cgroup_cleanup(nxt_task_t *task, const nxt_process_t *process)
 {
-    char       *ptr;
-    char       cgroot[PATH_MAX], cgpath[PATH_MAX];
-    nxt_int_t  ret;
+    char  *ptr, *ret;
+    char  cgroot[PATH_MAX], cgpath[PATH_MAX];
 
     ret = nxt_mk_cgpath(task, "", cgroot);
-    if (nxt_slow_path(ret == NXT_ERROR)) {
+    if (nxt_slow_path(ret == NULL)) {
         return;
     }
 
     ret = nxt_mk_cgpath(task, process->isolation.cgroup.path, cgpath);
-    if (nxt_slow_path(ret == NXT_ERROR)) {
+    if (nxt_slow_path(ret == NULL)) {
         return;
     }
 
@@ -89,11 +83,11 @@ nxt_cgroup_cleanup(nxt_task_t *task, const nxt_process_t *process)
 }
 
 
-static int
+static char *
 nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir, char *cgpath)
 {
-    int         i, len;
-    char        *buf, *ptr;
+    int         i;
+    char        *buf, *ptr, *p;
     FILE        *fp;
     size_t      size;
     ssize_t     nread;
@@ -101,10 +95,10 @@ nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir, char *cgpath)
 
     fp = nxt_file_fopen(task, "/proc/self/cgroup", "re");
     if (nxt_slow_path(fp == NULL)) {
-        return -1;
+        return NULL;
     }
 
-    len = -1;
+    p = NULL;
     buf = NULL;
     found = 0;
     while ((nread = getline(&buf, &size, fp)) != -1) {
@@ -133,21 +127,22 @@ nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir, char *cgpath)
         ptr++;
     }
 
-    len = snprintf(cgpath, PATH_MAX, NXT_CGROUP_ROOT "%s/%s", ptr, dir);
+    p = stpcpy(cgpath, NXT_CGROUP_ROOT);
+    p = stpcpy(p, ptr);
+    *p++ = '/';
+    p = stpcpy(p, dir);
 
 out_free_buf:
 
     nxt_free(buf);
 
-    return len;
+    return p;
 }
 
 
-static nxt_int_t
+static char *
 nxt_mk_cgpath(nxt_task_t *task, const char *dir, char *cgpath)
 {
-    int  len;
-
     /*
      * If the path from the config is relative, we need to make
      * the cgroup path include the main unit processes cgroup. I.e
@@ -155,19 +150,8 @@ nxt_mk_cgpath(nxt_task_t *task, const char *dir, char *cgpath)
      *   NXT_CGROUP_ROOT/<main process cgroup>/<cgroup path>
      */
     if (dir[0] != '/') {
-        len = nxt_mk_cgpath_relative(task, dir, cgpath);
+        return nxt_mk_cgpath_relative(task, dir, cgpath);
     } else {
-        len = snprintf(cgpath, PATH_MAX, NXT_CGROUP_ROOT "%s", dir);
+        return stpcpy(stpcpy(cgpath, NXT_CGROUP_ROOT), dir);
     }
-
-    if (len == -1) {
-        return NXT_ERROR;
-    }
-
-    if (len >= PATH_MAX) {
-        nxt_errno = ENAMETOOLONG;
-        return NXT_ERROR;
-    }
-
-    return NXT_OK;
 }

$ git diff --stat
 src/nxt_cgroup.c | 62 +++++++++++++++++++++++---------------------------------------
 1 file changed, 23 insertions(+), 39 deletions(-)

Dec 09 '22 17:12 alejandro-colomar

While we do check at config time (you could argue it's superfluous, but for the absolute path case it's a quick sanity check before things go too far) for the absolute path case, we don't for the relative path case, so truncation checks are still required.

I think it's best to leave it as is at this late stage.

Dec 09 '22 21:12 ac000

While we do check at config time (you could argue it's superfluous, but for the absolute path case it's a quick sanity check before things go too far) for the absolute path case, we don't for the relative path case, so truncation checks are still required.

I think it's best to leave it as is at this late stage.

Yes, I have to agree. We can merge the patch as is, which looks good to me (ugly but good), and then improve it afterwards.

Cheers,

Alex

Dec 09 '22 21:12 alejandro-colomar

Hi,

If we don't test here, that's fine, but we'll need to know that we need to truncate later. So, my previous simplification proposal would be invalid, and we'd need to add stpecpy() at some point. We can do that after the next release.

Yes, I have to agree. We can merge the patch as is, which looks good to me (ugly but good), and then improve it afterwards.

I suggest we finish the feature one time.

The feature is clear now and the same for its logic.
I believe the 1.30 version is more challenging, if a feature is released, we should move on and focus on something new.
It seems a bit strange to do the same feature again, I understand maybe it's an improvement, but it's better to finish it together.

Dec 09 '22 23:12 hongzhidao

The initial cgroups support is done. There is no 1.30 version, extra features will likely be added but there is no time frame.

As agreed this will be committed imminently. If you waited until things were perfect you'd never ship anything!

Dec 10 '22 03:12 ac000

@hongzhidao

My improvement is related to the bug that I reported last week about buffer overruns. I just found another case of a buffer overrun in another part of Unit. Since this case from Andrew was safe (for some particular meaning of safe) so far, I'm fine with merging it, and fixing the bigger issue separately in 1.30. The only remaining issue before releasing 1.29 is the isolation bug that I fixed, so I prefer focusing on that. I'll soon post my tests, and we can maybe develop some pytests for that.

Dec 10 '22 15:12 alejandro-colomar

unit unit copied to clipboard

Isolation: added support for per-application cgroups.

unit
unit copied to clipboard