zfs icon indicating copy to clipboard operation
zfs copied to clipboard

Skip channel program tests on Apple Silicon

Open pcd1193182 opened this issue 10 months ago • 7 comments

Motivation and Context

I have a small VM set up on my macbook to do some basic testing with. Unfortunately, the channel program tests consistently cause me problems. The issue appears to be that there is something wrong with our lua interpreter when running on apple silicon.

Description

This PR simply skips the channel program tests if it detects that it's running on Apple Silicon.

How Has This Been Tested?

Ran the test suite on my macbook and on a normal linux VM.

Types of changes

  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Performance enhancement (non-breaking change which improves efficiency)
  • [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • [ ] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the OpenZFS code style requirements.
  • [ ] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [ ] I have added tests to cover my changes.
  • [x] I have run the ZFS Test Suite with this change applied.
  • [x] All commit messages are properly formatted and contain Signed-off-by.

pcd1193182 avatar Feb 13 '25 21:02 pcd1193182

I really rather believe, FWIW, that if there's "something wrong with our lua interpreter when running on apple silicon" then we should forbid channel program support on apple silicon, not forbid the tests, which seems counterproductive.

adamdmoss avatar Feb 18 '25 19:02 adamdmoss

@adamdmoss that's a good point. @pcd1193182 what are your thoughts on disabling all channel programs on Apple silicon until we get a real fix?

tonyhutter avatar Feb 18 '25 21:02 tonyhutter

You can't currently disable channel programs; snapshot destruction requires it (see dsl_destroy_snapshots_nvl()).

Regardless, this is a small, targeted test skip for a specific platform that keeps the rest of the test suite running on that platform and so keeps that platform working and useful for development. We do this all the time - we have Linux and FreeBSD skips throughout the test suite.

robn avatar Feb 18 '25 21:02 robn

@pcd1193182 can you give some more detail on the problem you're seeing? Is it a test-case problem, or a real problem with lua on Apple silicon?

tonyhutter avatar Feb 18 '25 21:02 tonyhutter

It's a real problem; if you try to run any lua program that triggers an exception (like the one in the divide by zero test), the setjmp/longjmp mechanism doesn't appear to work, and you end up with a panic on linux.

pcd1193182 avatar Feb 20 '25 20:02 pcd1193182

You can't currently disable channel programs; snapshot destruction requires it (see dsl_destroy_snapshots_nvl()).

It's a real problem; if you try to run any lua program that triggers an exception (like the one in the divide by zero test), the setjmp/longjmp mechanism doesn't appear to work, and you end up with a panic on linux.

So, end result being that snapshot destruction and LUA should be disabled on Linux on Apple Silicon for the time being?

tonyhutter avatar Feb 25 '25 02:02 tonyhutter

I think that is probably the right call; not sure where to hook that into the channel program logic, though.

pcd1193182 avatar Mar 03 '25 17:03 pcd1193182

Per the discussion in the leadership call, I attempted to isolate this to only the tests that actually induce problems. However, there are two issues with this. The first is that it appears that every single synctask_core test causes the problem. This is true even of tests that shouldn't have any issues; tst.snapshot_simple, for example, is basically just "snapshot this filesystem", but manual testing shows that even that simple of an operation causes the system to panic.

The second issue is that these aren't the only problematic tests. There are tests in pyzfs that also exercise the channel programs. I think the best fix here is to try to disable channel programs in general on Apple Silicon, rather than just disabling the tests.

pcd1193182 avatar Jul 01 '25 19:07 pcd1193182

How would we delete snapshots without channel programs these days?

amotin avatar Jul 01 '25 19:07 amotin

I was thinking we would disable the zfs channel command, basically. The destroy_snaps ioctl invokes the program directly from the kernel. Although it is interesting that that ioctl works without issues; the simple snapshot channel program test in pyzfs also works, but tst.snapshot_simple doesn't. I can't quite figure out why.

pcd1193182 avatar Jul 01 '25 19:07 pcd1193182