Skip channel program tests on Apple Silicon
Motivation and Context
I have a small VM set up on my macbook to do some basic testing with. Unfortunately, the channel program tests consistently cause me problems. The issue appears to be that there is something wrong with our lua interpreter when running on apple silicon.
Description
This PR simply skips the channel program tests if it detects that it's running on Apple Silicon.
How Has This Been Tested?
Ran the test suite on my macbook and on a normal linux VM.
Types of changes
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Performance enhancement (non-breaking change which improves efficiency)
- [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
- [ ] Documentation (a change to man pages or other documentation)
Checklist:
- [x] My code follows the OpenZFS code style requirements.
- [ ] I have updated the documentation accordingly.
- [x] I have read the contributing document.
- [ ] I have added tests to cover my changes.
- [x] I have run the ZFS Test Suite with this change applied.
- [x] All commit messages are properly formatted and contain
Signed-off-by.
I really rather believe, FWIW, that if there's "something wrong with our lua interpreter when running on apple silicon" then we should forbid channel program support on apple silicon, not forbid the tests, which seems counterproductive.
@adamdmoss that's a good point. @pcd1193182 what are your thoughts on disabling all channel programs on Apple silicon until we get a real fix?
You can't currently disable channel programs; snapshot destruction requires it (see dsl_destroy_snapshots_nvl()).
Regardless, this is a small, targeted test skip for a specific platform that keeps the rest of the test suite running on that platform and so keeps that platform working and useful for development. We do this all the time - we have Linux and FreeBSD skips throughout the test suite.
@pcd1193182 can you give some more detail on the problem you're seeing? Is it a test-case problem, or a real problem with lua on Apple silicon?
It's a real problem; if you try to run any lua program that triggers an exception (like the one in the divide by zero test), the setjmp/longjmp mechanism doesn't appear to work, and you end up with a panic on linux.
You can't currently disable channel programs; snapshot destruction requires it (see dsl_destroy_snapshots_nvl()).
It's a real problem; if you try to run any lua program that triggers an exception (like the one in the divide by zero test), the setjmp/longjmp mechanism doesn't appear to work, and you end up with a panic on linux.
So, end result being that snapshot destruction and LUA should be disabled on Linux on Apple Silicon for the time being?
I think that is probably the right call; not sure where to hook that into the channel program logic, though.
Per the discussion in the leadership call, I attempted to isolate this to only the tests that actually induce problems. However, there are two issues with this. The first is that it appears that every single synctask_core test causes the problem. This is true even of tests that shouldn't have any issues; tst.snapshot_simple, for example, is basically just "snapshot this filesystem", but manual testing shows that even that simple of an operation causes the system to panic.
The second issue is that these aren't the only problematic tests. There are tests in pyzfs that also exercise the channel programs. I think the best fix here is to try to disable channel programs in general on Apple Silicon, rather than just disabling the tests.
How would we delete snapshots without channel programs these days?
I was thinking we would disable the zfs channel command, basically. The destroy_snaps ioctl invokes the program directly from the kernel. Although it is interesting that that ioctl works without issues; the simple snapshot channel program test in pyzfs also works, but tst.snapshot_simple doesn't. I can't quite figure out why.