krossbow icon indicating copy to clipboard operation
krossbow copied to clipboard

Fix memory crash on iOS with old memory model

Open remy-bardou-lifeonair opened this issue 2 years ago • 13 comments

This fixes a crash in DarwinWebSocketClient when using the old memory model.

The project I'm working on is still using the old memory model and I kept crashing upon initiating the web socket connection.

The crash is related to memory corruption when accessing the NSURLSessionWebSocketTask in any of the callbacks from NSURLSessionDelegate, which makes me think maybe we should be freezing it, or always access it from the same thread?

I couldn’t figure out how to fix the root cause other than enforcing access to it from the same thread, which this achieves by always using withContext(Dispatchers.Main).

The amount of processing done seems pretty light so it shouldn’t be too problematic to have it happen on the main thread?

FWIW, the crash doesn't happen with the new memory model.

We've been using this for several weeks on my project with no issues. Just thought it would be nice to contribute this fix back, even if it's not the most elegant way to fix it.

Stacktrace:

Uncaught Kotlin exception: kotlin.native.IncorrectDereferenceException: illegal attempt to access non-shared <object>@b00988 from other thread
    at 0   IosApp                        0x106cdfbff        kfun:kotlin.Throwable#<init>(kotlin.String?){} + 95 
    at 1   IosApp                        0x106cd9213        kfun:kotlin.Exception#<init>(kotlin.String?){} + 91 
    at 2   IosApp                        0x106cd93c7        kfun:kotlin.RuntimeException#<init>(kotlin.String?){} + 91 
    at 3   IosApp                        0x106cec04b        kfun:kotlin.native.IncorrectDereferenceException#<init>(kotlin.String){} + 91 
    at 4   IosApp                        0x106cef283        ThrowIllegalObjectSharingException + 423 
    at 5   IosApp                        0x106e46bef        _ZN12_GLOBAL__N_128throwIllegalSharingExceptionEP9ObjHeader + 27 
    at 6   IosApp                        0x106e481c7        _ZN12_GLOBAL__N_136terminateWithIllegalSharingExceptionEP9ObjHeader + 11 
    at 7   IosApp                        0x106e51c73        _ZNK16KRefSharedHolder3refIL11ErrorPolicy3EEEP9ObjHeaderv + 111 
    at 8   IosApp                        0x106cbe0ff        _ZL39Kotlin_Interop_unwrapKotlinObjectHolderP11objc_object + 47 
    at 9   IosApp                        0x1071f9593        _6f72672e68696c64616e2e6b726f7373626f773a6b726f7373626f772d776562736f636b65742d636f7265_knbridge18 + 163 
    at 10  libdispatch.dylib                   0x104668803        _dispatch_call_block_and_release + 31 
    at 11  libdispatch.dylib                   0x10466a3a7        _dispatch_client_callout + 19 
    at 12  libdispatch.dylib                   0x104673777        _dispatch_lane_serial_drain + 979 
    at 13  libdispatch.dylib                   0x104674813        _dispatch_lane_invoke + 491 
    at 14  libdispatch.dylib                   0x1046840c7        _dispatch_workloop_worker_thread + 1231 
    at 15  libsystem_pthread.dylib             0x10416bf83        _pthread_wqthread + 287 
    at 16  libsystem_pthread.dylib             0x104173a9b        start_wqthread + 7 

remy-bardou-lifeonair avatar May 19 '22 13:05 remy-bardou-lifeonair

Again thanks a lot, your contributions are super valuable. I remember having to jump through some hoops when Krossbow was still using the old memory model. I didn't know it was even possible to still use Krossbow on the old model.

In any case, it's great that there is a function to distinguish the cases at runtime! I'll take a closer look at this. I wonder how I'll be able to write tests for this, though 😆

Thanks again!

joffrey-bion avatar May 19 '22 13:05 joffrey-bion

The amount of processing done seems pretty light so it shouldn’t be too problematic to have it happen on the main thread?

The problem is not necessarily the amount of processing done inside those methods, but rather the context switch itself. Going back and forth between a background thread and the single-thread may degrade performance visibly in environments with high websocket traffic, because every single message will incur 2 context switches.

This is why usually the fact that things run on the main thread should be expressed at a higher level to limit context switches. I just checked the old implementation of Krossbow and with the old memory model the production code looked the same, it was on test side that we ensured everything was called from the main thread. I guess in real-life applications it might be worth calling from the main thread directly? After all this whole PR just adds withContext(Main) to things that the user would call, not callbacks from the websocket.

joffrey-bion avatar May 19 '22 14:05 joffrey-bion

Again thanks a lot, your contributions are super valuable. I remember having to jump through some hoops when Krossbow was still using the old memory model. I didn't know it was even possible to still use Krossbow on the old model.

In any case, it's great that there is a function to distinguish the cases at runtime! I'll take a closer look at this. I wonder how I'll be able to write tests for this, though 😆

Thanks again!

Yeah, I was wondering the same regarding tests. I feel like it would need a new test module where the new memory model is disabled?

remy-bardou-lifeonair avatar May 19 '22 14:05 remy-bardou-lifeonair

(Just FYI, I'll come back to this PR a bit later)

remy-bardou-lifeonair avatar May 23 '22 08:05 remy-bardou-lifeonair

Ok, so you're right calling connect() from within withContext(Dispatchers.Main) does work but if you don't specify the context, or even use Dispatchers.Default, it will crash!

The reason is that when we create the URLSession.sessionWithConfiguration and capture the NSOperationQueue.currentQueue(), this returns nil, because it couldn't determine which iOS queue this was running on.

https://developer.apple.com/documentation/foundation/nsoperationqueue/1413097-currentqueue

Which means the URLSession is setup to use its own background queue, which is different from the one that was used to call connect() and instantiate the webSocket object (NSURLSessionWebSocketTask).

I don't think the current code can work without crashing with the old memory model, unless connect is called from the main thread. I also don't think it's very intuitive from a developer POV to use withContext(Dispatchers.Main) before calling an API called connect() because that sounds like a bad idea. Unless a dev goes to the implementation and realises that almost nothing is done on the main thread and it's rather used as the callback thread.

Because of this, I updated the PR to be much simpler and basically fallback to using main thread if the currentQueue() can't be determined. I know it's still not ideal, but what do you think?

remy-bardou-lifeonair avatar May 24 '22 10:05 remy-bardou-lifeonair

Thanks for coming back to this, and for the detailed explanations.

My initial understanding of the problem was that the send operations needed to be performed on the same thread as the callback queue. What you seem to be suggesting is that the problem is just that the callback queue for sessionWithConfiguration needs to be the same as the "current one" at the moment of the webSocketTaskWithURL call, but the send operations don't have to be on that same queue, is this correct? This is quite different. I remember trying to use mainQueue() in the past and it didn't solve the issue, and that would explain why.

As I said, it's hard for me to test iOS stuff because I have no mac, so I have to go through a CI feedback loop (combined with the absence of docker on GitHub macOS runners and the 10min required to install it, it kills productivity quite a lot). So could you please test the following for me? 🙏

  1. With your current PR code, on the old MM, does it work when calling both connect and send operations on Dispatchers.Default? AFAIU, connect() will implicitly switch to Main in that case (because currentQueue() will give null). If my initial assumption was correct, calling session.sendText(...) on another thread should fail. But in light of what you said it might just work.

  2. Could you please clarify in which cases you observed currentQueue() returning null? Are there other cases than Dispatchers.Default? Any custom dispatcher with a custom thread pool maybe?

  3. Could you please attach somewhere here the actual error/stacktrace you get when it doesn't work (for instance with the current released code used in old MM). That would be helpful for posterity. For instance, you can update the description of the PR or add it in a reply.

Sorry if those are big asks but that would help me quite a lot with how to judge the proper way forward 👼

I also don't think it's very intuitive from a developer POV to use withContext(Dispatchers.Main) before calling an API called connect() because that sounds like a bad idea

I wholeheartedly agree with this. It would be best if users wouldn't have to think about any of this, especially multiplatform users. That's why I'm trying to figure out what exactly the problem is, in which cases it can occur, and what the users should be aware of or should not care about. We could also add artificial crashes with nice error messages to help them fix problems if they have things to ensure on their side.

joffrey-bion avatar May 24 '22 12:05 joffrey-bion

I feel like it would need a new test module where the new memory model is disabled?

I had tried to avoid doing that, but you're bringing my old ghosts to haunt me now. I was like "let's forget about the old MM, it will soon come to pass, and users will migrate". And you come here and hit me with a taste of reality 😂

joffrey-bion avatar May 24 '22 13:05 joffrey-bion

I feel like it would need a new test module where the new memory model is disabled?

I had tried to avoid doing that, but you're bringing my old ghosts to haunt me now. I was like "let's forget about the old MM, it will soon come to pass, and users will migrate". And you come here and hit me with a taste of reality 😂

Haha, well I don't know how representative I am of the community though. And tbh, we just updated to adopt the new memory model and the migration went fine, and now it doesn't crash.

The only reason why I'm trying to fix this is because I assume there might be other projects not using the new memory model? And it can't really be enforced by the library (unless maybe we want to crash explicitly when we detect it's using the old MM, but that's a bit harsh).

Anyway, I'll come back to your regarding your other questions!

remy-bardou-lifeonair avatar May 24 '22 13:05 remy-bardou-lifeonair

  1. No, doesn't work. It then crashes on receiveMessageWithCompletionHandler with
Uncaught Kotlin exception: kotlin.native.IncorrectDereferenceException: illegal attempt to access non-shared <object>@15cc5c8 from other thread
    at 0   IosApp                        0x109189bb7        kfun:kotlin.Throwable#<init>(kotlin.String?){} + 95 
    at 1   IosApp                        0x1091831cb        kfun:kotlin.Exception#<init>(kotlin.String?){} + 91 
    at 2   IosApp                        0x10918337f        kfun:kotlin.RuntimeException#<init>(kotlin.String?){} + 91 
    at 3   IosApp                        0x109196003        kfun:kotlin.native.IncorrectDereferenceException#<init>(kotlin.String){} + 91 
    at 4   IosApp                        0x10919923b        ThrowIllegalObjectSharingException + 423 
    at 5   IosApp                        0x1092f0ba7        _ZN12_GLOBAL__N_128throwIllegalSharingExceptionEP9ObjHeader + 27 
    at 6   IosApp                        0x1092f217f        _ZN12_GLOBAL__N_136terminateWithIllegalSharingExceptionEP9ObjHeader + 11 
    at 7   IosApp                        0x1092fbc2b        _ZNK16KRefSharedHolder3refIL11ErrorPolicy3EEEP9ObjHeaderv + 111 
    at 8   IosApp                        0x1091680b7        _ZL39Kotlin_Interop_unwrapKotlinObjectHolderP11objc_object + 47 
    at 9   IosApp                        0x108c2063f        _737061726b706c75672e6d702d65706963636f6e6e6563743a6d702d65706963636f6e6e656374_knbridge90 + 163 
    at 10  libdispatch.dylib                   0x106b00803        _dispatch_call_block_and_release + 31 
    at 11  libdispatch.dylib                   0x106b023a7        _dispatch_client_callout + 19 
    at 12  libdispatch.dylib                   0x106b0b777        _dispatch_lane_serial_drain + 979 
    at 13  libdispatch.dylib                   0x106b0c813        _dispatch_lane_invoke + 491 
    at 14  libdispatch.dylib                   0x106b1c0c7        _dispatch_workloop_worker_thread + 1231 
    at 15  libsystem_pthread.dylib             0x10664bf83        _pthread_wqthread + 287 
    at 16  libsystem_pthread.dylib             0x106653a9b        start_wqthread + 7 
  1. I've observed in ALL my attempts except when running on Dispatchers.Main that currentQueue() returns null. This includes, Dispatchers.Default, newSingleTreadedContext() and using the default context from a flow

  2. Done, I've updated the PR description with the stack trace

remy-bardou-lifeonair avatar May 24 '22 14:05 remy-bardou-lifeonair

Thank you so much for the additional details.

  1. This is very interesting (and also unfortunate). It is therefore not sufficient to start the connection on Main. Callers need to be aware of this and also perform send operations on Main. As I mentioned earlier, context-switching to the Main thread for send operations in the implementation of the session would slow down messaging if the users are still calling it from another thread. That said, maybe it's acceptable if they're using the slow memory model in the first place? The users would still be able to speed things up by switching to the main thread themselves up front, but what bothers me is that they would have no way to be made aware of this. Crashing early in sendText/sendBytes with a nice error message might be more appropriate. WDYT?

  2. and using the default context from a flow - what do you mean? Flow collection happens in the current coroutine context, and previous operators too unless flowOn is used. I don't quite get what you mean by the "default context". Could you please elaborate?

  3. Thanks a lot!

joffrey-bion avatar May 24 '22 14:05 joffrey-bion

This is very interesting (and also unfortunate). It is therefore not sufficient to start the connection on Main. Callers need to be aware of this and also perform send operations on Main. As I mentioned earlier, context-switching to the Main thread for send operations in the implementation of the session would slow down messaging if the users are still calling it from another thread. That said, maybe it's acceptable if they're using the slow memory model in the first place? The users would still be able to speed things up by switching to the main thread themselves up front, but what bothers me is that they would have no way to be made aware of this. Crashing early in sendText/sendBytes with a nice error message might be more appropriate. WDYT?

Actually, NSOperationQueue.currentQueue() does work on the main thread, therefore, NSURLSession callbacks are called on the same thread as connect (main thread), and therefore also call the sendMessage APIs on the main thread, which is why it works!

That said, I think there are a few things we could do to make users "opt" into that "not as nice" option:

  • add a log when we catch this?
  • crash early with error message when we can't determine the currentQueue + add an explicit parameter either on connect or in the constructor to allow using Main thread for the web socket operations
  1. and using the default context from a flow - what do you mean?

yeah actually that's what I meant, using the default coroutine context used by a flow, which in the case my app is a custom coroutine context (not default)

remy-bardou-lifeonair avatar May 24 '22 15:05 remy-bardou-lifeonair

This is very interesting (and also unfortunate). It is therefore not sufficient to start the connection on Main. Callers need to be aware of this and also perform send operations on Main. As I mentioned earlier, context-switching to the Main thread for send operations in the implementation of the session would slow down messaging if the users are still calling it from another thread. That said, maybe it's acceptable if they're using the slow memory model in the first place? The users would still be able to speed things up by switching to the main thread themselves up front, but what bothers me is that they would have no way to be made aware of this. Crashing early in sendText/sendBytes with a nice error message might be more appropriate. WDYT?

Actually, NSOperationQueue.currentQueue() does work on the main thread, therefore, NSURLSession callbacks are called on the same thread as connect (main thread), and therefore also call the sendMessage APIs on the main thread, which is why it works!

That said, I think there are a few things we could do to make users "opt" into that "not as nice" option:

  • add a log when we catch this?
  • crash early with error message when we can't determine the currentQueue + add an explicit parameter either on connect or in the constructor to allow using Main thread for the web socket operations
  1. and using the default context from a flow - what do you mean?

yeah actually that's what I meant, using the default coroutine context used by a flow, which in the case my app is a custom coroutine context (not default)

Something like this:

/**
 * An implementation of [WebSocketClient] using darwin's native [NSURLSessionWebSocketTask].
 * This is only available is iOS 13.0+, tvOS 13.0+, watchOS 6.0+, macOS 10.15+
 * (see [documentation](https://developer.apple.com/documentation/foundation/urlsessionwebsockettask))
 *
 * A custom [sessionConfig] can be passed to customize the behaviour of the connection.
 * Also, if a non-null [maximumMessageSize] if provided, it will be used to configure the web socket.
 * Finally, if you're using the old memory model and are okay using the main thread for websocket callbacks,
 * set [enableMainThreadCallbacksAsFallback] as true. It will avoid a mutability exception.
 */
class DarwinWebSocketClient(
    private val sessionConfig: NSURLSessionConfiguration = NSURLSessionConfiguration.defaultSessionConfiguration(),
    private val maximumMessageSize: Long? = null,
    private val enableMainThreadCallbacksAsFallback: Boolean = false,
) : WebSocketClient {

    @OptIn(ExperimentalStdlibApi::class)
    override suspend fun connect(url: String): WebSocketConnectionWithPing {
        val socketEndpoint = NSURL.URLWithString(url)!!

        return if (isExperimentalMM()) {
            connect(url, socketEndpoint)
        } else {
            if (NSOperationQueue.currentQueue() != null) {
                connect(url, socketEndpoint)
            } else if (!enableMainThreadCallbacksAsFallback) {
                throw Exception("Unable to determine the current NSOperationQueue. Aborting connection to avoid a crash. Please consider enabling enableMainThreadCallbacksAsFallback.")
            } else {
                /**
                 * With the old memory model, if [NSURLSessionWebSocketTask] is unable
                 * to reuse the current queue for its callbacks, things will crash at runtime.
                 */
                withContext(Dispatchers.Main) {
                    connect(url, socketEndpoint)
                }
            }
        }
    }

remy-bardou-lifeonair avatar May 24 '22 15:05 remy-bardou-lifeonair

Actually, NSOperationQueue.currentQueue() does work on the main thread, therefore, NSURLSession callbacks are called on the same thread as connect (main thread), and therefore also call the sendMessage APIs on the main thread, which is why it works!

Wait, I'm confused now.

You say "and therefore also call the sendMessage APIs on the main thread" - calling sendMessage doesn't happen during the connect call nor the callbacks, it depends on whatever context the user chooses to use when calling session.sendText/sendBytes. That's the whole point of my response to 1) (this was the point of this test as well, maybe we misunderstood each other here?).

You say "which is why it works" - but you just confirmed it does NOT work. Calling connect() with old MM will always use Dispatchers.Main with your current code (because on other dispatchers currentQueue() will be null, so we use withContext(Main)). But calling session.sendText() can be done on anything by the user, and fails if not done on Main as you have just checked (or is it not what you have checked earlier? Because that's what I wanted to check 😆).

My point is about telling the user to also use Main when calling session.sendText() - either by failing early with a nice message or by using withContext(Main) at this low level, thus incurring a potential performance penalty.

For the connect call itself, I don't think the fallback to Main is a problem, because as you said only Main provides a currentQueue() and is viable anyway. And connect is not called as many times as sendText/sendBytes.

joffrey-bion avatar May 24 '22 17:05 joffrey-bion

I'm closing this PR since the new memory model is now the default and unlikely to be used in the future. I guess we'll just support the new memory model from now on.

joffrey-bion avatar Nov 20 '22 23:11 joffrey-bion

Oops my bad, I had forgot about this PR! Good call though, probably not worth pursuing this now :)

remy-bardou-lifeonair avatar Nov 21 '22 11:11 remy-bardou-lifeonair

No problem at all, and thank you so much for making the initial PR by the way, it was a very good idea at the time

joffrey-bion avatar Nov 21 '22 11:11 joffrey-bion