libseccomp icon indicating copy to clipboard operation
libseccomp copied to clipboard

RFE: Add support for maximum supported kernel version

Open drakenclimber opened this issue 10 months ago • 20 comments

This patchset proposes to solve issue #11 - RFE: support "maximum kernel version".

Signficant changes in this patchset

  • Updates syscalls.csv with the kernel versions that syscalls were added for x86, x86_64, and x32. (See the discussion heading below for why I only did these three architectures.)
  • Adds two new filter attributes, SCMP_FLTATR_ACT_ENOSYS and SCMP_FLTATR_CTL_KVERMAX, for managing the maximum supported kernel version and what to do with syscalls that are newer than that version
  • If this feature is enabled by the user, then libseccomp will add a rule for every single known syscall up to the maximum supported kernel version. These rules will perform the DEFAULT action. (See the discussion below for more info.)
  • Adds supporting documentation and a test

Fixes: #11 CC: @kolyshkin @cyphar

Finally, I am hoping to discuss this issue at Linux Security Summit 2025 in Denver, Colorado USA on June 26th and 27th. I would love to get community feedback about the problem, the proposed solution, etc.

Edit - Here's the link to the LSS 2025 presentation/discussion: https://www.youtube.com/watch?v=24k-G7RTV1c

drakenclimber avatar Feb 12 '25 22:02 drakenclimber

According to my system calls table there are holes in syscall numbering on several architectures (looked at arm64, arm, armoabi, x86-64, x32 and i386). New style architectures share syscall numbering and new entries are added at the end of table.

Your syscalls.csv shown me that I missed "parisc64" architecture. Will have to add support for it. (Edit: DONE)

When it comes to LTS/stable kernels then I think that one of rules in them is "no new stuff" which in this case mean no new system calls. Distribution kernels may add them and many did that in the past so check "is syscall present" may need to be more complex than "is kernel version high enough".

As you have support for syscall.tbl for x86 variants then for start it can be expanded for other architectures too. Will not cover all system calls but you get data for many.

I used those scripts for quick check with my syscalls-table project:

#!/bin/bash

KERNELDIR=~/devel/sources/linux/

for kernel_version in 3.{0..19} 4.{0..20} 5.{0..19} 6.{0..13}
do
                echo $kernel_version
                (cd $KERNELDIR; git checkout v${kernel_version})
                bash scripts/update-tables.sh $KERNELDIR
                pip install .
                python examples/tables-to-yaml.py $kernel_version
                cp -r data/tables data/tables-${kernel_version}
                cp syscalls.yml syscalls-${kernel_version}.yml
done

examples/tables-to-yaml.py one:

#!/usr/bin/python3

import sys
import system_calls
import yaml

kernel_version = ""

if len(sys.argv) > 1:
    kernel_version = sys.argv[1]

syscalls = system_calls.syscalls()

with open("syscalls.yml", "r") as yf:
    yml = yaml.safe_load(yf)

for syscall_name in yml["syscalls"]:

    if not yml["syscalls"][syscall_name]["from"]:
        yml["syscalls"][syscall_name]["from"] = kernel_version

    for arch in syscalls.archs():
        try:
            number = syscalls.get(syscall_name, arch)
        except system_calls.NotSupportedSystemCall:
            number = ""
            pass
        yml["syscalls"][syscall_name]["archs"][arch]["number"] = number
        if number and not yml["syscalls"][syscall_name]["archs"][arch]["from"]:
            yml["syscalls"][syscall_name]["archs"][arch]["from"] = kernel_version


with open("syscalls.yml", "w") as yf:
    yaml.dump(yml, yf)

Not checked result for correctness yet.

hrw avatar Feb 13 '25 07:02 hrw

Coverage Status

coverage: 89.436% (+0.4%) from 89.046% when pulling 9990a2805812d1beec278b1a9a5383c650cc5707 on drakenclimber:issues/11 into e7e633c28aed5333b185bfc0ad6f8d70b5fc20be on seccomp:main.

coveralls avatar Feb 13 '25 18:02 coveralls

Moved the discussion list to the v3 comment

Here's a side-by-side diff of between v1 of this patchset's syscalls.csv and v2's syscalls.csv

drakenclimber avatar Feb 18 '25 20:02 drakenclimber

Isn't that 'kernel wide' new system calls are added at the end and 'new on this architecture' ones are added where they were supposed to be?

I remember system calls which were added on subset of architectures in kernel X (and got the highest number) and then kernel X+1, X+2 added it for other architectures. And if there were any new 'kernel wide' system calls added in meantime then it looked like some were added in a middle of table.

hrw avatar Feb 19 '25 11:02 hrw

Please note that "afs_syscall, break, fattach, fdetach, ftime, getmsg, getpmsg, gtty, isastream, lock, madvise1, mpx, prof, profil, putmsg, putpmsg, security, stty, tuxcall, ulimit, vserver" are officially unimplemented system calls. My syscalls-table has them on ignorelist so that can be why you have some diff.

And problem of x32 is that you need x32 headers in system to get them properly handled. Otherwise you get x86-64 ones. My github action which updates syscalls-table data has extra step to make sure that they are present.

hrw avatar Feb 19 '25 11:02 hrw

Posted on mastodon about it: https://society.oftrolls.com/@hrw/114030254556485861 as some other people may find it useful too.

hrw avatar Feb 19 '25 11:02 hrw

Isn't that 'kernel wide' new system calls are added at the end and 'new on this architecture' ones are added where they were supposed to be?

I remember system calls which were added on subset of architectures in kernel X (and got the highest number) and then kernel X+1, X+2 added it for other architectures. And if there were any new 'kernel wide' system calls added in meantime then it looked like some were added in a middle of table.

Yes, that was my recollection as well, but I wanted data to back it up. I expect this model to continue going forward.

For libseccomp I think that means that we can't rely on a "less than" rule for unknown syscalls. We'll either need an explicit rule for each syscall or a series of ranges.

Thanks for the verification, @hrw

drakenclimber avatar Feb 19 '25 14:02 drakenclimber

https://gpages.juszkiewicz.com.pl/syscalls-table/syscalls.html allows to disable and reorder columns which can be handy when you want to compare numbers between architectures.

I recommend sorting by arm64 or riscv64 column to see how new system calls are present on each architecture.

Note that everything from 'avr32' to right side does not exist in current Linux kernel - they are kept for historical purposes.

hrw avatar Feb 19 '25 14:02 hrw

Changes for v3:

  • Fixed the x32 syscall numbers. Thanks to @hrw for the guidance here

Moved the discussion list to the v4 comment. Here's a side-by-side diff of before and after this patchset (v3)

drakenclimber avatar Feb 19 '25 19:02 drakenclimber

There are ~25 syscalls that we need to dig deeper into. For example, afs_syscall() was syscall number 137 prior to this patchset, and is now a PNR

Like I wrote above: afs_syscall() and a bunch of others are listed in system call tables in kernel but are not implemented. My code ignores them.

hrw avatar Feb 19 '25 19:02 hrw

Like I wrote above: afs_syscall() and a bunch of others are listed in system call tables in kernel but are not implemented. My code ignores them.

Ack. That's on my todo list :)

drakenclimber avatar Feb 19 '25 19:02 drakenclimber

It looks like syscalls are typically added to the end of the list, but is this always true? And will it always be true in the future? And what about long-term stable kernels?

A lot of work went into unifying the syscall tables for newer syscalls a few years ago. For all future syscalls (barring a few esoteric architectures) the syscall numbers will match between architectures and so any holes should be expected to be kept (except maybe for arch-specific syscalls, I don't know if there's a proper policy around that).

For completeness though, it might be necessary to have a more complicated rule. In runc we just do the hacky solution, which is okay in general but is not theoretically correct.

cyphar avatar Feb 28 '25 10:02 cyphar

Changes for v4:

  • Add handling for special syscalls like afs_syscall() in scmp_populate_syscalls_csv.py
  • Fix a bug in scmp_populate_syscalls_csv.py where it wasn't handling x32 read() properly
  • Regenerate syscalls.csv with this new info
  • All automated tests now pass :)

I think this is ready for more in-depth review

Discussion

  • ~Should we support every architecture from the start?~
    • ~This patchset only adds kernel versions for x86, [x86_64] (https://github.com/seccomp/libseccomp/commit/e2b42b6a159d23df97e4da4098ea7ac7f8eb99f9), and x32. They have had a consistent syscall.tbl since 2015 (kernel version 4.0), so they were an easy initial candidate to prove out the logic. I would prefer to support all architectures from the start, but I'm not certain how easy/hard it will be to flesh out the remainder of syscalls.csv~
  • ~libseccomp has been around since kernel version 3.7.10 or so. Do we need to go that far back with our kernel version table?~
    • ~This patchset only goes back to 2015 (linux kernel version 4.0)~
    • Patch ~55bf2ea~ ~b424f57~ 6f34216db56cdcf3b3e7d4086fdf339d3d7cc369 now lists kernel versions all the way back to kernel v3.0
  • One thing that has kept me up at night with this patchset - did I get the correct kernel versions in which a syscall was added?
    • ~I wrote a simple Python script to populate the x86-ish syscall kernel versions, and I'm reasonably confident the numbers are right, but "reasonably confident" is insufficient when security is concerned. @hrw has written a tool to determine syscall kernel versions, and it could be used to populate our table (or perhaps verify my numbers)~
    • Patch ~005280d~ ~9b285ef~ 22854ce7fd39e780e9832c705ce4bc76898e53ae uses the syscalls-table tool to populate syscalls.csv. libseccomp's kernel versions (prior to this patch) ~align very, very closely to the output from the syscalls-table tool with the exception of x32~ now match to the best of my knowledge.
      • Here's a side-by-side diff of before and after this patchset (v4)
      • ~There are ~25 syscalls that we need to dig deeper into. For example, afs_syscall() was syscall number 137 prior to this patchset, and is now a PNR~
      • ~As mentioned above, we need to figure out what's up with x32~
      • ~x32 syscall numbers now largely match our previous numbers~
      • x32 syscall numbers now match our previous numbers
  • Can we simplify the logic and shrink the filter? I don't think so
    • @pcmoore has wondered if we could simplify the logic to only return -ENOSYS for syscalls greater than the maximum supported number. (Again, this patchset explicitly creates a rule for every known syscall rather than a single if syscall_num > max_num rule.) Note that most (all?) architectures have several holes in their syscall table. It looks like syscalls are typically added to the end of the list, but is this always true? And will it always be true in the future? And what about long-term stable kernels?
    • Running this script as follows ./tools/scmp_populate_syscalls_csv.py -d ~/git/other/syscalls-table/data -v shows that syscalls have been added in the middle 112 times since kernel v3.0. arm, s390, x86_64, parisc, x32, and more have all historically done it. Unfortunately, I don't think we can safely rely on new syscalls being added to the end of the list :(
    • In comment https://github.com/seccomp/libseccomp/pull/457#issuecomment-2690349272, @cyphar shared that future syscalls should be added at the end, but unfortunately that doesn't solve older kernels. I'm leaning toward adding an explicit rule for each known syscall as this is guaranteed to work on older kernels and will work on newer kernels regardless of what the kernel community does or doesn't do. Thoughts?
  • As written, SCMP_FLTATR_CTL_KVERMAX must be set at the end of creating the libseccomp context. Any seccomp_arch_add() after setting the maximum kernel version will result in -EINVAL.
    • Aside - libseccomp doesn't allow overwriting of existing rules, and (regardless of this patchset) silently ignores the "new" rule and doesn't add it to the filter. Thus as currently implemented, we must populate the known rules logic at the very end of the filter construction.
    • Do we consider changing the existing behavior of silently ignoring new rules, and instead overwrite the existing rules? That would simplify this patchset

drakenclimber avatar Feb 28 '25 21:02 drakenclimber

What is the benefit of this over having an ENOSYS default action?

kees avatar Mar 12 '25 15:03 kees

What is the benefit of this over having an ENOSYS default action?

Good question. Some users have requested different behavior for an invalid syscall vs. an unsupported syscall.

But if an application is content without having such a distinction, then an ENOSYS default should work quite well for those users.

drakenclimber avatar Mar 17 '25 22:03 drakenclimber

For some more background on why ENOSYS actions are not enough, one struggle we have is that Docker has defined their seccomp policy as being EPERM by default and then they have intentionally omitted syscalls from their policy set instead of explicitly setting an EPERM errno. For those intentionally disallowed syscalls, we would prefer EPERM over ENOSYS, but for newer syscalls we would prefer ENOSYS.

We could fix the Docker policy (though they still haven't moved to ENOSYS at all, so I would't hold out hope for that), but the past decade of user policies are probably based on copies of the original Docker policy and we can't update those.

In runc, we work around this by taking the largest syscall number specified in the filter and then prefixing a stub filter to the libseccomp one which will return ENOSYS if the syscall being attempted is larger than the largest syscall number (otherwise it will fallthrough to the libseccomp filter). This mostly works but is still quite hacky -- the approach I proposed in the past was to have a minimum kernel version that would set the boundary between the EPERM and ENOSYS behaviour (basically, every libseccomp filter that set a minimum kernel version would have an implicit ENOSYS fallback for ACT_ERRNO returns if the syscall number was too large). I'm not sure if a maximum kernel version knob will help with our issues though...

cyphar avatar Aug 21 '25 02:08 cyphar

In runc, we work around this by taking the largest syscall number specified in the filter and then prefixing a stub filter to the libseccomp one which will return ENOSYS if the syscall being attempted is larger than the largest syscall number (otherwise it will fallthrough to the libseccomp filter). This mostly works but is still quite hacky -- the approach I proposed in the past was to have a minimum kernel version that would set the boundary between the EPERM and ENOSYS behaviour (basically, every libseccomp filter that set a minimum kernel version would have an implicit ENOSYS fallback for ACT_ERRNO returns if the syscall number was too large). I'm not sure if a maximum kernel version knob will help with our issues though...

Thanks for the additional details; that really helps.

Based on what you've written, I think this should meet your needs. Ignoring the semantics of minimum vs maximum, this proposal would allow Docker to do the following:

// error handling intentionally omitted
seccomp_init(SCMP_ACT_ERRNO(EPERM));
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, allowed syscalls, 0)
seccomp_attr_set(ctx, SCMP_FLTATR_ACT_ENOSYS, SCMP_ACT_ERRNO(ENOSYS))
seccomp_attr_set(ctx, SCMP_FLTATR_CTL_KVERMAX, SCMP_KV_6_5)

In the above code, there are three possible results:

  • syscall is explicitly allowed
  • syscall is denied (and existed in kernel 6.5 or older), -EPERM is returned
  • syscall is denied (and is newer than 6.5), -ENOSYS is returned

drakenclimber avatar Aug 21 '25 19:08 drakenclimber

syscall is denied (and is newer than 6.5)

Ah okay, I missed that this wouldn't apply to syscalls you explicitly set. In that case, it sounds exactly like what we need then. Thanks so much for working on this, I'll try to set some time to look through it in more detail.

cyphar avatar Aug 22 '25 02:08 cyphar

v5 changes:

  • Cleaned up the scripts I used to generate and update the syscalls.csv table. (I wasn't sure if we wanted to keep them, so they weren't very polished.)
  • I added clear steps in the commit messages on how to create and update syscalls.csv using these new tools
  • Moved the syscall version enum to its own header file
  • Made the creation of the syscalls.csv header line smarter
  • Updated for kernel v6.16

drakenclimber avatar Aug 28 '25 13:08 drakenclimber

I'm going to go ahead and merge the first five patches into main as those shouldn't be too controversial and it would be good to start populating the syscall table with version numbers.

The first five patches can be seen at 2bc718995e782a8473ba9db8509a398ef69b2edc.

pcmoore avatar Oct 08 '25 21:10 pcmoore