obs-studio icon indicating copy to clipboard operation
obs-studio copied to clipboard

UI: Use an alternate signal stack on Linux/macOS

Open fzwoch opened this issue 1 year ago • 2 comments

Description

Create an alternate signal stack and set up our signals to run on this stack instead of the default one.

Motivation and Context

This works around an issue that seems to manifests itself on macOS arm64 where the use of sigalstack() will return with an EPERM error after a certain point of OBS Studio UI initialization.

One example where this is an issue is the use of plugins written in Go. The Go runtime will try to create an alternate signal stack if none has been created yet and move existing signal handlers to the new stack. In case of an error the Go runtime will panic causing the OBS process to deadlock.

Currently, the Teleport plugin does not work because of that puzzling behavior. Seems like other applications seem to show the similar behavior regarding sigaltstack() - for which no one seems to have a good explanation yet.

As far as I know having an alternate signal stack should be save and not interfere with anything else(?)

How Has This Been Tested?

Build OBS with this applied and loaded, tested with Teleport plugin. Directly on M1 Macboon Air. Teleport via loopback.

Types of changes

Tweak (non-breaking change to improve existing functionality)

Checklist:

  • [x] My code has been run through clang-format.
  • [x] I have read the contributing document.
  • [x] My code is not on the master branch.
  • [x] The code has been tested.
  • [x] All commit messages are properly formatted and commits squashed where appropriate.
  • [x] I have included updates to all appropriate documentation.

fzwoch avatar Dec 28 '22 21:12 fzwoch

In case of an error the Go runtime will panic causing the OBS process to deadlock.

If your plugin is trying to crash, it seems like we should let it crash/hang. This is an attempt to keep going despite it entering an invalid state which seems more dangerous than just crashing.

kkartaltepe avatar Dec 28 '22 21:12 kkartaltepe

Having the stack should be a legal state. The Go runtime will work with the stack instead of trying to create one with sigalstack() itself.

Both cases are perfectly legal. The actual issue is that the sigaltstack() syscall on macOS arm64 fails for some reason. I could not find anyone yet who could explain why, and why also on this specific platform. (it begins to happen at some specific code locations, that does not give a specific reason why it should do so)

I'm not trying to keep going, I'm just selecting a code path that works around a potential OS bug(?). If there is an explanation why sigaltstack() suddenly fails and there are means to mitigate that I'm open to that too.

https://github.com/obsproject/obs-studio/discussions/7795#discussioncomment-4499099 is my latest gibberish for investigating the issue.

fzwoch avatar Dec 28 '22 22:12 fzwoch

As an aside, sigalstack is written in the commit message, but is seemingly a typo.

RytoEX avatar Jan 15 '23 00:01 RytoEX

Thanks, fixed the typo.

Some additional remarks:

  • The original reason of failure is still unclear to me. It works on other platforms and I have no grasp why it is different here. I suspected a memory corruption somewhere, but with Apple's address sanitizer I do not get any reports - on the other hand the issue also does not occur when running with the sanitizer enabled.
  • Go expects the signal handler to be installed with the SA_ONSTACK flag on all platforms, so it can move them to an alternate stack (which Go creates internally by itself, and moves the signals to that one). It is no a real issue until SIGINT is triggered (where Go will probably terminate with a complaint that a signal is triggered that is not on the current signal stack).
  • I'm not sure whether signal(SIGPIPE, SIG_IGN); needed to be moved to a call to sigaction(). I guess it does not need the SA_ONSTACK flag as it will be ignored and never hits the stack? But I'm not sure on that one. I read elsewhere that is in general recommend to use sigaction() instead of signal() so I went for it.

I can imagine it is discuss-able whether to merge this is or not. For proper Go support in plugins it would be nice to have the SA_ONSTACK flag for the installed signal handler though. But also it would be nice to understand and fix the original problem which I, frankly, do not.

fzwoch avatar Jan 16 '23 09:01 fzwoch

@fzwoch Have you tried to reproduce the crash using an OBS built without the browser source enabled? Given that CEF more or less "takes over" the program once loaded and Chromium is interfering with signal stacks for crash handling iirc.

PatTheMav avatar Jan 18 '23 19:01 PatTheMav

I tried removing as many plugins as possible.

On the other hand, in the above link to the discussion you can see that issue can be traced to a particular line in the code (may be different for others, not sure):

In ui_OBSBasic.h:

statusbar = new OBSBasicStatusBar(OBSBasic);

This line is run before any plugin is loaded. So I don't think there can be much interference from plugins.

I also send something in Apple's direction, but my experience from the last decade is that this bug tracker is a void-hole. I still have open tickets from 2015 there. They eventually fixed the issues I reported, but if it was due to the ticket I have no idea.. but lets see if have things have changed in the meantime.

fzwoch avatar Jan 18 '23 20:01 fzwoch

I tried removing as many plugins as possible.

On the other hand, in the above link to the discussion you can see that issue can be traced to a particular line in the code (may be different for others, not sure):

In ui_OBSBasic.h:

statusbar = new OBSBasicStatusBar(OBSBasic);

This line is run before any plugin is loaded. So I don't think there can be much interference from plugins.

I also send something in Apple's direction, but my experience from the last decade is that this bug tracker is a void-hole. I still have open tickets from 2015 there. They eventually fixed the issues I reported, but if it was due to the ticket I have no idea.. but lets see if have things have changed in the meantime.

FWIW the code you use seems to rely on the old (and deprecated) sigaltstack struct - the current stack_t struct does not take a char pointer, but instead a void pointer.

Using this code (instead of yours) works fine:

    stack_t ss;
    memset(&ss, 0, sizeof(ss));
    ss.ss_sp = malloc(SIGSTKSZ);
    ss.ss_size = SIGSTKSZ;
    ss.ss_flags = 0;
    if (sigaltstack(&ss, NULL) < 0) {
        perror("sigaltstack");
        abort();
    }

Dunno if this is the actual fix, but I think the code does the same in essence, and I haven't gotten OBS to crash yet. So maybe that's what Go needs to do instead?

PatTheMav avatar Jan 18 '23 23:01 PatTheMav

Small mistake here. You need to check for < 0.

It does not seem to make a differnce for me whether I malloc some data or point to char stack data.

fzwoch avatar Jan 19 '23 00:01 fzwoch

Small mistake here. You need to check for < 0.

It does not seem to make a differnce for me whether I malloc some data or point to char stack data.

That's right, but even with the comparison fixed I cannot get it to abort on my arm64 Macs (I moved it after the calls you identified and even further back in the program execution). The moment I used the implementation you provided (using a char array instead) in place of mine the abort-branch is taken.

PatTheMav avatar Jan 19 '23 01:01 PatTheMav

That let's me believe there is some kind of memory corruption somewhere. I guess having a data array on the stack changes the stack memory layout compared to allocating it on the heap.

For me, there is no difference in aborting if I use the heap instead of the stack. But it does work with the address sanitizer, which probably changes memory layout by a lot.

So the change of the signal stack may not the actual solution to the issue, but changes the memory layout in a way that prevents the failure.

Can you get away with using char signalStack[SIGSTKSZ]; when you move the code block as a first action ot the beginning of main()?

I have a Macbook Air M1 2020 with Monterey 12.6.1. It is the only machine I have and I need it to stay on Monterey at the moment.

I got reports from someone using a Mac Mini M1 and someone using Ventura Beta 2 with the same issue.

fzwoch avatar Jan 19 '23 08:01 fzwoch

I'm calling this a Heisenbug now, because the closer you try to investigate it, the less likely it seems to occur - your variant obviously worked at the beginning of the main() function, when I moved it further back in OBSInit() however, it also never ran afoul of the check.

PatTheMav avatar Jan 19 '23 16:01 PatTheMav

Thanks for checking it out @PatTheMav. So there is something behaving weirdly. A bit of sanity has returned to myself..

fzwoch avatar Jan 20 '23 08:01 fzwoch

Another question @PatTheMav. Did you change something on your machines while testing when you got different behaviors?

I noticed something particular on my side. I can run the code successfully when I attach a monitor via USB-C and close the laptop's lid. When I run the code with the laptop lid open it will crash. Lid open and monitor attached also results in a crash.

fzwoch avatar Jan 29 '23 11:01 fzwoch

Another question @PatTheMav. Did you change something on your machines while testing when you got different behaviors?

I noticed something particular on my side. I can run the code successfully when I attach a monitor via USB-C and close the laptop's lid. When I run the code with the laptop lid open it will crash. Lid open and monitor attached also results in a crash.

Can't really tell. Given that OBS is a) multithreaded and b) has Qt, CEF, macOS' native application framework, and depending on your setup Python's runtime all thrown into the mix, I'm not surprised that futzing around with the signal stack can lead to errors.

PatTheMav avatar Jan 29 '23 16:01 PatTheMav

I will actually close this. Removing the stack also does not work as advertised on macOS. The man page contradicts the code in their kernel. I just give up on this platform.

fzwoch avatar May 03 '23 15:05 fzwoch

I will actually close this. Removing the stack also does not work as advertised on macOS. The man page contradicts the code in their kernel. I just give up on this platform.

If the docs are incorrect, please log a bug to Apple and explain what you feel is broken so it can be properly investigated.

RytoEX avatar May 03 '23 18:05 RytoEX