krossbow
krossbow copied to clipboard
Fix memory crash on iOS with old memory model
This fixes a crash in DarwinWebSocketClient
when using the old memory model.
The project I'm working on is still using the old memory model and I kept crashing upon initiating the web socket connection.
The crash is related to memory corruption when accessing the NSURLSessionWebSocketTask
in any of the callbacks from NSURLSessionDelegate
, which makes me think maybe we should be freezing it, or always access it from the same thread?
I couldn’t figure out how to fix the root cause other than enforcing access to it from the same thread, which this achieves by always using withContext(Dispatchers.Main)
.
The amount of processing done seems pretty light so it shouldn’t be too problematic to have it happen on the main thread?
FWIW, the crash doesn't happen with the new memory model.
We've been using this for several weeks on my project with no issues. Just thought it would be nice to contribute this fix back, even if it's not the most elegant way to fix it.
Stacktrace:
Uncaught Kotlin exception: kotlin.native.IncorrectDereferenceException: illegal attempt to access non-shared <object>@b00988 from other thread
at 0 IosApp 0x106cdfbff kfun:kotlin.Throwable#<init>(kotlin.String?){} + 95
at 1 IosApp 0x106cd9213 kfun:kotlin.Exception#<init>(kotlin.String?){} + 91
at 2 IosApp 0x106cd93c7 kfun:kotlin.RuntimeException#<init>(kotlin.String?){} + 91
at 3 IosApp 0x106cec04b kfun:kotlin.native.IncorrectDereferenceException#<init>(kotlin.String){} + 91
at 4 IosApp 0x106cef283 ThrowIllegalObjectSharingException + 423
at 5 IosApp 0x106e46bef _ZN12_GLOBAL__N_128throwIllegalSharingExceptionEP9ObjHeader + 27
at 6 IosApp 0x106e481c7 _ZN12_GLOBAL__N_136terminateWithIllegalSharingExceptionEP9ObjHeader + 11
at 7 IosApp 0x106e51c73 _ZNK16KRefSharedHolder3refIL11ErrorPolicy3EEEP9ObjHeaderv + 111
at 8 IosApp 0x106cbe0ff _ZL39Kotlin_Interop_unwrapKotlinObjectHolderP11objc_object + 47
at 9 IosApp 0x1071f9593 _6f72672e68696c64616e2e6b726f7373626f773a6b726f7373626f772d776562736f636b65742d636f7265_knbridge18 + 163
at 10 libdispatch.dylib 0x104668803 _dispatch_call_block_and_release + 31
at 11 libdispatch.dylib 0x10466a3a7 _dispatch_client_callout + 19
at 12 libdispatch.dylib 0x104673777 _dispatch_lane_serial_drain + 979
at 13 libdispatch.dylib 0x104674813 _dispatch_lane_invoke + 491
at 14 libdispatch.dylib 0x1046840c7 _dispatch_workloop_worker_thread + 1231
at 15 libsystem_pthread.dylib 0x10416bf83 _pthread_wqthread + 287
at 16 libsystem_pthread.dylib 0x104173a9b start_wqthread + 7
Again thanks a lot, your contributions are super valuable. I remember having to jump through some hoops when Krossbow was still using the old memory model. I didn't know it was even possible to still use Krossbow on the old model.
In any case, it's great that there is a function to distinguish the cases at runtime! I'll take a closer look at this. I wonder how I'll be able to write tests for this, though 😆
Thanks again!
The amount of processing done seems pretty light so it shouldn’t be too problematic to have it happen on the main thread?
The problem is not necessarily the amount of processing done inside those methods, but rather the context switch itself. Going back and forth between a background thread and the single-thread may degrade performance visibly in environments with high websocket traffic, because every single message will incur 2 context switches.
This is why usually the fact that things run on the main thread should be expressed at a higher level to limit context switches. I just checked the old implementation of Krossbow and with the old memory model the production code looked the same, it was on test side that we ensured everything was called from the main thread. I guess in real-life applications it might be worth calling from the main thread directly? After all this whole PR just adds withContext(Main)
to things that the user would call, not callbacks from the websocket.
Again thanks a lot, your contributions are super valuable. I remember having to jump through some hoops when Krossbow was still using the old memory model. I didn't know it was even possible to still use Krossbow on the old model.
In any case, it's great that there is a function to distinguish the cases at runtime! I'll take a closer look at this. I wonder how I'll be able to write tests for this, though 😆
Thanks again!
Yeah, I was wondering the same regarding tests. I feel like it would need a new test module where the new memory model is disabled?
(Just FYI, I'll come back to this PR a bit later)
Ok, so you're right calling connect()
from within withContext(Dispatchers.Main)
does work
but if you don't specify the context, or even use Dispatchers.Default
, it will crash!
The reason is that when we create the URLSession.sessionWithConfiguration
and capture the NSOperationQueue.currentQueue()
, this returns nil
, because it couldn't determine which iOS queue this was running on.
https://developer.apple.com/documentation/foundation/nsoperationqueue/1413097-currentqueue
Which means the URLSession is setup to use its own background queue, which is different from the one that was used to call connect()
and instantiate the webSocket
object (NSURLSessionWebSocketTask
).
I don't think the current code can work without crashing with the old memory model, unless connect
is called from the main thread.
I also don't think it's very intuitive from a developer POV to use withContext(Dispatchers.Main)
before calling an API called connect()
because that sounds like a bad idea. Unless a dev goes to the implementation and realises that almost nothing is done on the main thread and it's rather used as the callback thread.
Because of this, I updated the PR to be much simpler and basically fallback to using main thread if the currentQueue()
can't be determined. I know it's still not ideal, but what do you think?
Thanks for coming back to this, and for the detailed explanations.
My initial understanding of the problem was that the send operations needed to be performed on the same thread as the callback queue. What you seem to be suggesting is that the problem is just that the callback queue for sessionWithConfiguration
needs to be the same as the "current one" at the moment of the webSocketTaskWithURL
call, but the send operations don't have to be on that same queue, is this correct? This is quite different. I remember trying to use mainQueue()
in the past and it didn't solve the issue, and that would explain why.
As I said, it's hard for me to test iOS stuff because I have no mac, so I have to go through a CI feedback loop (combined with the absence of docker on GitHub macOS runners and the 10min required to install it, it kills productivity quite a lot). So could you please test the following for me? 🙏
-
With your current PR code, on the old MM, does it work when calling both connect and send operations on
Dispatchers.Default
? AFAIU,connect()
will implicitly switch to Main in that case (becausecurrentQueue()
will give null). If my initial assumption was correct, callingsession.sendText(...)
on another thread should fail. But in light of what you said it might just work. -
Could you please clarify in which cases you observed
currentQueue()
returning null? Are there other cases thanDispatchers.Default
? Any custom dispatcher with a custom thread pool maybe? -
Could you please attach somewhere here the actual error/stacktrace you get when it doesn't work (for instance with the current released code used in old MM). That would be helpful for posterity. For instance, you can update the description of the PR or add it in a reply.
Sorry if those are big asks but that would help me quite a lot with how to judge the proper way forward 👼
I also don't think it's very intuitive from a developer POV to use withContext(Dispatchers.Main) before calling an API called connect() because that sounds like a bad idea
I wholeheartedly agree with this. It would be best if users wouldn't have to think about any of this, especially multiplatform users. That's why I'm trying to figure out what exactly the problem is, in which cases it can occur, and what the users should be aware of or should not care about. We could also add artificial crashes with nice error messages to help them fix problems if they have things to ensure on their side.
I feel like it would need a new test module where the new memory model is disabled?
I had tried to avoid doing that, but you're bringing my old ghosts to haunt me now. I was like "let's forget about the old MM, it will soon come to pass, and users will migrate". And you come here and hit me with a taste of reality 😂
I feel like it would need a new test module where the new memory model is disabled?
I had tried to avoid doing that, but you're bringing my old ghosts to haunt me now. I was like "let's forget about the old MM, it will soon come to pass, and users will migrate". And you come here and hit me with a taste of reality 😂
Haha, well I don't know how representative I am of the community though. And tbh, we just updated to adopt the new memory model and the migration went fine, and now it doesn't crash.
The only reason why I'm trying to fix this is because I assume there might be other projects not using the new memory model? And it can't really be enforced by the library (unless maybe we want to crash explicitly when we detect it's using the old MM, but that's a bit harsh).
Anyway, I'll come back to your regarding your other questions!
- No, doesn't work. It then crashes on
receiveMessageWithCompletionHandler
with
Uncaught Kotlin exception: kotlin.native.IncorrectDereferenceException: illegal attempt to access non-shared <object>@15cc5c8 from other thread
at 0 IosApp 0x109189bb7 kfun:kotlin.Throwable#<init>(kotlin.String?){} + 95
at 1 IosApp 0x1091831cb kfun:kotlin.Exception#<init>(kotlin.String?){} + 91
at 2 IosApp 0x10918337f kfun:kotlin.RuntimeException#<init>(kotlin.String?){} + 91
at 3 IosApp 0x109196003 kfun:kotlin.native.IncorrectDereferenceException#<init>(kotlin.String){} + 91
at 4 IosApp 0x10919923b ThrowIllegalObjectSharingException + 423
at 5 IosApp 0x1092f0ba7 _ZN12_GLOBAL__N_128throwIllegalSharingExceptionEP9ObjHeader + 27
at 6 IosApp 0x1092f217f _ZN12_GLOBAL__N_136terminateWithIllegalSharingExceptionEP9ObjHeader + 11
at 7 IosApp 0x1092fbc2b _ZNK16KRefSharedHolder3refIL11ErrorPolicy3EEEP9ObjHeaderv + 111
at 8 IosApp 0x1091680b7 _ZL39Kotlin_Interop_unwrapKotlinObjectHolderP11objc_object + 47
at 9 IosApp 0x108c2063f _737061726b706c75672e6d702d65706963636f6e6e6563743a6d702d65706963636f6e6e656374_knbridge90 + 163
at 10 libdispatch.dylib 0x106b00803 _dispatch_call_block_and_release + 31
at 11 libdispatch.dylib 0x106b023a7 _dispatch_client_callout + 19
at 12 libdispatch.dylib 0x106b0b777 _dispatch_lane_serial_drain + 979
at 13 libdispatch.dylib 0x106b0c813 _dispatch_lane_invoke + 491
at 14 libdispatch.dylib 0x106b1c0c7 _dispatch_workloop_worker_thread + 1231
at 15 libsystem_pthread.dylib 0x10664bf83 _pthread_wqthread + 287
at 16 libsystem_pthread.dylib 0x106653a9b start_wqthread + 7
-
I've observed in ALL my attempts except when running on
Dispatchers.Main
thatcurrentQueue()
returns null. This includes,Dispatchers.Default
,newSingleTreadedContext()
and using the default context from a flow -
Done, I've updated the PR description with the stack trace
Thank you so much for the additional details.
-
This is very interesting (and also unfortunate). It is therefore not sufficient to start the connection on
Main
. Callers need to be aware of this and also perform send operations onMain
. As I mentioned earlier, context-switching to theMain
thread for send operations in the implementation of the session would slow down messaging if the users are still calling it from another thread. That said, maybe it's acceptable if they're using the slow memory model in the first place? The users would still be able to speed things up by switching to the main thread themselves up front, but what bothers me is that they would have no way to be made aware of this. Crashing early insendText
/sendBytes
with a nice error message might be more appropriate. WDYT? -
and using the default context from a flow - what do you mean? Flow collection happens in the current coroutine context, and previous operators too unless
flowOn
is used. I don't quite get what you mean by the "default context". Could you please elaborate? -
Thanks a lot!
This is very interesting (and also unfortunate). It is therefore not sufficient to start the connection on Main. Callers need to be aware of this and also perform send operations on Main. As I mentioned earlier, context-switching to the Main thread for send operations in the implementation of the session would slow down messaging if the users are still calling it from another thread. That said, maybe it's acceptable if they're using the slow memory model in the first place? The users would still be able to speed things up by switching to the main thread themselves up front, but what bothers me is that they would have no way to be made aware of this. Crashing early in sendText/sendBytes with a nice error message might be more appropriate. WDYT?
Actually, NSOperationQueue.currentQueue()
does work on the main thread, therefore, NSURLSession callbacks are called on the same thread as connect (main thread), and therefore also call the sendMessage
APIs on the main thread, which is why it works!
That said, I think there are a few things we could do to make users "opt" into that "not as nice" option:
- add a log when we catch this?
- crash early with error message when we can't determine the currentQueue + add an explicit parameter either on
connect
or in the constructor to allow using Main thread for the web socket operations
- and using the default context from a flow - what do you mean?
yeah actually that's what I meant, using the default coroutine context used by a flow, which in the case my app is a custom coroutine context (not default)
This is very interesting (and also unfortunate). It is therefore not sufficient to start the connection on Main. Callers need to be aware of this and also perform send operations on Main. As I mentioned earlier, context-switching to the Main thread for send operations in the implementation of the session would slow down messaging if the users are still calling it from another thread. That said, maybe it's acceptable if they're using the slow memory model in the first place? The users would still be able to speed things up by switching to the main thread themselves up front, but what bothers me is that they would have no way to be made aware of this. Crashing early in sendText/sendBytes with a nice error message might be more appropriate. WDYT?
Actually,
NSOperationQueue.currentQueue()
does work on the main thread, therefore, NSURLSession callbacks are called on the same thread as connect (main thread), and therefore also call thesendMessage
APIs on the main thread, which is why it works!That said, I think there are a few things we could do to make users "opt" into that "not as nice" option:
- add a log when we catch this?
- crash early with error message when we can't determine the currentQueue + add an explicit parameter either on
connect
or in the constructor to allow using Main thread for the web socket operations
- and using the default context from a flow - what do you mean?
yeah actually that's what I meant, using the default coroutine context used by a flow, which in the case my app is a custom coroutine context (not default)
Something like this:
/**
* An implementation of [WebSocketClient] using darwin's native [NSURLSessionWebSocketTask].
* This is only available is iOS 13.0+, tvOS 13.0+, watchOS 6.0+, macOS 10.15+
* (see [documentation](https://developer.apple.com/documentation/foundation/urlsessionwebsockettask))
*
* A custom [sessionConfig] can be passed to customize the behaviour of the connection.
* Also, if a non-null [maximumMessageSize] if provided, it will be used to configure the web socket.
* Finally, if you're using the old memory model and are okay using the main thread for websocket callbacks,
* set [enableMainThreadCallbacksAsFallback] as true. It will avoid a mutability exception.
*/
class DarwinWebSocketClient(
private val sessionConfig: NSURLSessionConfiguration = NSURLSessionConfiguration.defaultSessionConfiguration(),
private val maximumMessageSize: Long? = null,
private val enableMainThreadCallbacksAsFallback: Boolean = false,
) : WebSocketClient {
@OptIn(ExperimentalStdlibApi::class)
override suspend fun connect(url: String): WebSocketConnectionWithPing {
val socketEndpoint = NSURL.URLWithString(url)!!
return if (isExperimentalMM()) {
connect(url, socketEndpoint)
} else {
if (NSOperationQueue.currentQueue() != null) {
connect(url, socketEndpoint)
} else if (!enableMainThreadCallbacksAsFallback) {
throw Exception("Unable to determine the current NSOperationQueue. Aborting connection to avoid a crash. Please consider enabling enableMainThreadCallbacksAsFallback.")
} else {
/**
* With the old memory model, if [NSURLSessionWebSocketTask] is unable
* to reuse the current queue for its callbacks, things will crash at runtime.
*/
withContext(Dispatchers.Main) {
connect(url, socketEndpoint)
}
}
}
}
Actually, NSOperationQueue.currentQueue() does work on the main thread, therefore, NSURLSession callbacks are called on the same thread as connect (main thread), and therefore also call the sendMessage APIs on the main thread, which is why it works!
Wait, I'm confused now.
You say "and therefore also call the sendMessage APIs on the main thread" - calling sendMessage
doesn't happen during the connect
call nor the callbacks, it depends on whatever context the user chooses to use when calling session.sendText
/sendBytes
. That's the whole point of my response to 1) (this was the point of this test as well, maybe we misunderstood each other here?).
You say "which is why it works" - but you just confirmed it does NOT work. Calling connect()
with old MM will always use Dispatchers.Main
with your current code (because on other dispatchers currentQueue()
will be null, so we use withContext(Main)
). But calling session.sendText()
can be done on anything by the user, and fails if not done on Main
as you have just checked (or is it not what you have checked earlier? Because that's what I wanted to check 😆).
My point is about telling the user to also use Main
when calling session.sendText()
- either by failing early with a nice message or by using withContext(Main)
at this low level, thus incurring a potential performance penalty.
For the connect
call itself, I don't think the fallback to Main
is a problem, because as you said only Main
provides a currentQueue()
and is viable anyway. And connect
is not called as many times as sendText
/sendBytes
.
I'm closing this PR since the new memory model is now the default and unlikely to be used in the future. I guess we'll just support the new memory model from now on.
Oops my bad, I had forgot about this PR! Good call though, probably not worth pursuing this now :)
No problem at all, and thank you so much for making the initial PR by the way, it was a very good idea at the time