KeenClient-iOS
KeenClient-iOS copied to clipboard
SQLite Crash on deleteEvents
A customer reported an issue with Keen's interaction with SQLite causing their app to crash often. Given a crash log, it appears that the crash originates in [KIODBStore deleteEvent]
, or more specifically in the keen_io_sqlite3_bind_int64
statement.
They also reported an issue with duplicate events, similar to #170. This makes sense, because if we fail to delete an event from SQLite after uploading, it will be uploaded again after the app resumes.
Anonymized crash log from customer:
CrashReporter Key: 602EC7B8-403B-446B-9070-69CDFEE531AA
Hardware Model: iPhone7,2
Process: APPNAME [279]
Path: /var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/APPNAME
Identifier: BUNDLEIDENTIFER
Version: 1.7.2 (10117)
Code Type: ARM-64
Parent Process: ??? [1]
Date/Time: 2017-02-03T11:08:30Z
Launch Time: 2017-02-03T09:15:37Z
OS Version: iPhone OS 10.2.1 (14D27)
Report Version: 104
Exception Type: SIGSEGV
Exception Codes: SEGV_ACCERR at 0x80000000101c3fae
Crashed Thread: 10
Thread 0:
0 libsystem_kernel.dylib 0x00000001876f5188 mach_msg_trap + 8
1 CoreFoundation 0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2 CoreFoundation 0x00000001886f01ec __CFRunLoopRun + 1128
3 CoreFoundation 0x000000018861e2b8 CFRunLoopRunSpecific + 440
4 GraphicsServices 0x000000018a0d2198 GSEventRunModal + 176
5 UIKit 0x000000018e6657fc -[UIApplication _run] + 680
6 UIKit 0x000000018e660534 UIApplicationMain + 204
7 APPNAME 0x000000010005c06c main (main.m:19)
8 libdyld.dylib 0x00000001876015b8 start + 0
Thread 1:
0 libsystem_kernel.dylib 0x0000000187712e1c __psynch_cvwait + 8
1 libc++.1.dylib 0x00000001871013ec std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 52
2 JavaScriptCore 0x000000018cf225d0 void std::__1::condition_variable_any::wait<std::__1::unique_lock<bmalloc::Mutex> >(std::__1::unique_lock<bmalloc::Mutex>&) + 108
3 JavaScriptCore 0x000000018cf22544 bmalloc::AsyncTask<bmalloc::Heap, void (bmalloc::Heap::*)()>::threadRunLoop() + 164
4 JavaScriptCore 0x000000018cf22424 bmalloc::AsyncTask<bmalloc::Heap, void (bmalloc::Heap::*)()>::threadEntryPoint(bmalloc::AsyncTask<bmalloc::Heap, void (bmalloc::Heap::*)()>*) + 8
5 JavaScriptCore 0x000000018cf226d4 void* std::__1::__thread_proxy<std::__1::tuple<void (*)(bmalloc::AsyncTask<bmalloc::Heap, void (bmalloc::Heap::*)()>*), bmalloc::AsyncTask<bmalloc::Heap, void (bmalloc::Heap::*)()>*> >(void*) + 88
6 libsystem_pthread.dylib 0x00000001877d9850 _pthread_body + 236
7 libsystem_pthread.dylib 0x00000001877d9760 _pthread_start + 280
8 libsystem_pthread.dylib 0x00000001877d6d94 thread_start + 0
Thread 2:
0 libsystem_kernel.dylib 0x00000001876f5188 mach_msg_trap + 8
1 CoreFoundation 0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2 CoreFoundation 0x00000001886f01ec __CFRunLoopRun + 1128
3 CoreFoundation 0x000000018861e2b8 CFRunLoopRunSpecific + 440
4 WebCore 0x000000018d224490 RunWebThread(void*) + 452
5 libsystem_pthread.dylib 0x00000001877d9850 _pthread_body + 236
6 libsystem_pthread.dylib 0x00000001877d9760 _pthread_start + 280
7 libsystem_pthread.dylib 0x00000001877d6d94 thread_start + 0
Thread 3:
0 libsystem_kernel.dylib 0x00000001876f5188 mach_msg_trap + 8
1 CoreFoundation 0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2 CoreFoundation 0x00000001886f01ec __CFRunLoopRun + 1128
3 CoreFoundation 0x000000018861e2b8 CFRunLoopRunSpecific + 440
4 Foundation 0x000000018915b26c -[NSRunLoop(NSRunLoop) runMode:beforeDate:] + 300
5 Foundation 0x000000018917bdd0 -[NSRunLoop(NSRunLoop) runUntilDate:] + 92
6 UIKit 0x000000018efd9c38 -[UIEventFetcher threadMain] + 132
7 Foundation 0x0000000189258e68 __NSThread__start__ + 1020
8 libsystem_pthread.dylib 0x00000001877d9850 _pthread_body + 236
9 libsystem_pthread.dylib 0x00000001877d9760 _pthread_start + 280
10 libsystem_pthread.dylib 0x00000001877d6d94 thread_start + 0
Thread 4:
0 libsystem_kernel.dylib 0x00000001876f5188 mach_msg_trap + 8
1 CoreFoundation 0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2 CoreFoundation 0x00000001886f01ec __CFRunLoopRun + 1128
3 CoreFoundation 0x000000018861e2b8 CFRunLoopRunSpecific + 440
4 CFNetwork 0x0000000188e23a70 +[NSURLConnection(Loader) _resourceLoadLoop:] + 332
5 Foundation 0x0000000189258e68 __NSThread__start__ + 1020
6 libsystem_pthread.dylib 0x00000001877d9850 _pthread_body + 236
7 libsystem_pthread.dylib 0x00000001877d9760 _pthread_start + 280
8 libsystem_pthread.dylib 0x00000001877d6d94 thread_start + 0
Thread 5:
0 libsystem_kernel.dylib 0x00000001876f5188 mach_msg_trap + 8
1 CoreFoundation 0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2 CoreFoundation 0x00000001886f01ec __CFRunLoopRun + 1128
3 CoreFoundation 0x000000018861e2b8 CFRunLoopRunSpecific + 440
4 Foundation 0x000000018915b26c -[NSRunLoop(NSRunLoop) runMode:beforeDate:] + 300
5 Foundation 0x00000001891afaa0 -[NSRunLoop(NSRunLoop) run] + 84
6 AFNetworking 0x00000001009873b0 +[AFURLConnectionOperation networkRequestThreadEntryPoint:] (AFURLConnectionOperation.m:163)
7 Foundation 0x0000000189258e68 __NSThread__start__ + 1020
8 libsystem_pthread.dylib 0x00000001877d9850 _pthread_body + 236
9 libsystem_pthread.dylib 0x00000001877d9760 _pthread_start + 280
10 libsystem_pthread.dylib 0x00000001877d6d94 thread_start + 0
Thread 6:
0 libsystem_kernel.dylib 0x00000001876f5188 mach_msg_trap + 8
1 CoreFoundation 0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2 CoreFoundation 0x00000001886f01ec __CFRunLoopRun + 1128
3 CoreFoundation 0x000000018861e2b8 CFRunLoopRunSpecific + 440
4 CFNetwork 0x0000000188f7da40 _privateRunloopEmulationSet(void*) + 260
5 libsystem_pthread.dylib 0x00000001877d9850 _pthread_body + 236
6 libsystem_pthread.dylib 0x00000001877d9760 _pthread_start + 280
7 libsystem_pthread.dylib 0x00000001877d6d94 thread_start + 0
Thread 7:
0 libsystem_kernel.dylib 0x00000001876f5188 mach_msg_trap + 8
1 CoreFoundation 0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2 CoreFoundation 0x00000001886f01ec __CFRunLoopRun + 1128
3 CoreFoundation 0x000000018861e2b8 CFRunLoopRunSpecific + 440
4 CoreFoundation 0x000000018866bb44 CFRunLoopRun + 108
5 CoreMotion 0x000000018f4e1120 CLStartStopAdvertisingBeacon + 203192
6 libsystem_pthread.dylib 0x00000001877d9850 _pthread_body + 236
7 libsystem_pthread.dylib 0x00000001877d9760 _pthread_start + 280
8 libsystem_pthread.dylib 0x00000001877d6d94 thread_start + 0
Thread 8:
0 libsystem_kernel.dylib 0x0000000187713a88 __workq_kernreturn + 8
1 libsystem_pthread.dylib 0x00000001877d6d8c start_wqthread + 0
Thread 9:
0 libsystem_kernel.dylib 0x0000000187713a88 __workq_kernreturn + 8
1 libsystem_pthread.dylib 0x00000001877d6d8c start_wqthread + 0
Thread 10 Crashed:
0 KeenClient 0x00000001010f0a5c vdbeUnbind (keen_io_sqlite3.c:66777)
1 KeenClient 0x00000001010f0c10 keen_io_sqlite3_bind_int64 (keen_io_sqlite3.c:66875)
2 KeenClient 0x000000010115af18 __26-[KIODBStore deleteEvent:]_block_invoke (KIODBStore.m:566)
3 libdispatch.dylib 0x00000001875ce1fc _dispatch_call_block_and_release + 20
4 libdispatch.dylib 0x00000001875ce1bc _dispatch_client_callout + 12
5 libdispatch.dylib 0x00000001875dc3dc _dispatch_queue_serial_drain + 924
6 libdispatch.dylib 0x00000001875d19a4 _dispatch_queue_invoke + 648
7 libdispatch.dylib 0x00000001875de34c _dispatch_root_queue_drain + 568
8 libdispatch.dylib 0x00000001875de0ac _dispatch_worker_thread3 + 120
9 libsystem_pthread.dylib 0x00000001877d72a0 _pthread_wqthread + 1284
10 libsystem_pthread.dylib 0x00000001877d6d8c start_wqthread + 0
Thread 11:
0 libsystem_kernel.dylib 0x0000000187713a88 __workq_kernreturn + 8
1 libsystem_pthread.dylib 0x00000001877d6d8c start_wqthread + 0
Thread 12:
0 libsystem_kernel.dylib 0x0000000187713a88 __workq_kernreturn + 8
1 libsystem_pthread.dylib 0x00000001877d6d8c start_wqthread + 0
Thread 13:
0 libsystem_kernel.dylib 0x0000000187713a88 __workq_kernreturn + 8
1 libsystem_pthread.dylib 0x00000001877d6d8c start_wqthread + 0
Thread 14:
0 libsystem_kernel.dylib 0x0000000187713a88 __workq_kernreturn + 8
1 libsystem_pthread.dylib 0x00000001877d6d8c start_wqthread + 0
Thread 10 crashed with ARM-64 Thread State:
pc: 0x00000001010f0a5c fp: 0x000000016df8ec70 sp: 0x000000016df8ec40 x0: 0x0000000000000000
x1: 0x0000000000000001 x2: 0x000000000000028a x3: 0x000000018775922c x4: 0x0000000000000032
x5: 0x0000000000000000 x6: 0x0000000000000000 x7: 0x0000000000000000 x8: 0x80000000101c3f96
x9: 0x0000000000000000 x10: 0x0000000000000184 x11: 0x0000000000000004 x12: 0x00000001886e068c
x13: 0x00000000000028a3 x14: 0x00000000000028a3 x15: 0x7000000000000000 x16: 0x00000001ae02d7f0
x17: 0x000000018866fee4 x18: 0x0000000000000000 x19: 0x0000000101c3b420 x20: 0x0000000000000001
x21: 0x000000000000028a x22: 0x00000000000104d7 x23: 0x00000001700f9680 x24: 0x0000000000000000
x25: 0x0000000000000000 x26: 0x0000000000000014 x27: 0x000000016df8f0e0 x28: 0xffffffffffffffff
lr: 0x00000001010f0a34 cpsr: 0x0000000060000000
Binary Images:
0x10003c000 - 0x10075ffff +APPNAME arm64 <aa438f9c30b1326db7a44cf742389f66> /var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/APPNAME
0x100978000 - 0x1009affff +AFNetworking arm64 <2a3607079b1c3b6b8bb39a7f876c267b> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/AFNetworking.framework/AFNetworking
0x100a0c000 - 0x100a13fff +ColorUtils arm64 <161da403f1033bb18aed8d7eaad154f0> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/ColorUtils.framework/ColorUtils
0x100aa8000 - 0x100ac7fff +BlocksKit arm64 <745684cb132935afb05441c03d61832d> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/BlocksKit.framework/BlocksKit
0x100afc000 - 0x100c2bfff +SOMELIBRARYSDK arm64 <94cd1afbe4213d8480baadd645e43314> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/SOMELIBRARYSDK.framework/SOMELIBRARYSDK
0x100e48000 - 0x100edbfff +SOMELIBRARYUI arm64 <5b13fd3f7b0d3ba8b252b820aa5428e5> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/SOMELIBRARYUI.framework/SOMELIBRARYUI
0x10100c000 - 0x10102bfff +CocoaLumberjack arm64 <7b4d4f0e175031198f558e8ce50a7ed1> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/CocoaLumberjack.framework/CocoaLumberjack
0x101058000 - 0x10105ffff +DTTJailbreakDetection arm64 <55ba159d98c9314f8074353cd8203eb7> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/DTTJailbreakDetection.framework/DTTJailbreakDetection
0x10106c000 - 0x101073fff +FXNotifications arm64 <4d6f3443e17137f2b005d7ca8247cc7c> /private/var/containe...```
Seems that the closeDB has been called before the dispatch worker thread processes the deleteEvent enqueued block has been executed, meaning the delete statement has been finalized, and would cause the segmentation fault seen in the callstack.
Previously I was thinking this could be due to a race condition due to client calls to closeDB, but that isn't actually exposed to clients. Seems that closeDB could be called if a sqlite call fails for whatever reason. Not sure if the keen client has logging enabled by default, but if it could be enabled that might shed some light on if/what is failing and causing the db to be closed.
We never experienced the crashes in development and don't have logging enabled in production (and wouldn't be able to access those logs anyway).
As we see in the crash log the crash happens in keen_io_sqlite3_bind_int64
of deleteEvent
- could you perhaps be more defensive while parsing eventIds in handleEventAPIResponse
Or even better - could you check your server logs? Perhaps we see there your servers response and then can understand what happens in handleEventAPIResponse
. I can supply you with the exact time of lots of crashes
I checked our server logs for your account a few days ago. There were no errors on writes in the time period we're looking at, so the response should have been a standard {"success": true}
message.
but the responsedata would also contain collectionNames - and just wondering if anything could go wrong here: NSNumber *eid = [[eventIds objectForKey:collectionName] objectAtIndex:count];
if a collectionName returned by your API, wouldn't be found in the passed eventIds
dictionary then eid
would be nil which could perhaps lead to any problems
:( Our logs are unfortunately not that verbose as far as what is in the responses. We'd have to fill the every data center everywhere with {"success": true}
if we did that. All I get to see if # of failures and # of successes (for writes).
@rist not sure if that could potentially lead to the state we're seeing that causes the crash, but the crash itself seems to be because the SQLite database used internally has been closed by the time the internal delete operation is processed through the KIODBStore queue. We might investigate what previous SQLite operation failed which is the only way the database gets closed. @josephwegner would it be possible that a failure is generated on that level due to a nil eventId passed to the SDK? I imagine there are checks in place for that, or otherwise wouldn't lead to a SQLite failure?
Ok, if the theory ist that the DB was closed before we execute some delete statement - why do you check if the DB is open at the beginning of - (void)deleteEvent:(NSNumber *)eventId
but not later in the async thread where the delete actually happens.
The DB could have been closed in the meantime
Since the offending line is keen_io_sqlite3_mutex_enter(p->db->mutex);
either the statement has been deleted or the db is trashed, it would seem. I don't see how that line would fail due to a nil eventId.
Almost all the code paths that lead to closeDB
being called happen inside dispatch_*()
blocks, maybe all but the openAndInit...
code. Since KIODBStore uses a sequential queue, it seems to have to happen when something in the queue gets dispatched and causes a failure, which hits closeDB
after the checkOpenDB
but before the dispatch_*()
block in deleteEvent()
is executed. Scattering checkOpenDB
messages all over to every dispatch_*()
block could potentially recover or fail gracefully, but we'd want to figure out what caused the closeDB() call in the first place.
Also, if we fail to delete events from the local store, I don't know if there's code in KeenClient designed to handle that, so it might be the case that such events, which succeeded in being pushed, get pushed again later due to a failure upon attempting to delete them (@josephwegner mentioned this, and it's mentioned in Issue #170 and Issue #156). One would need a callback set on KIODBStore so that it could notifyCannotDelete(eventId)
or something along those lines, and KeenClient could set a handler to then place those event IDs in a list of events to try to purge or at the least not re-push the next time we try to flush the local store. I'm not super familiar with this particular SDK, so I'm just brainstorming.
Either way, it's important to not just try to mask the original failure which caused the closeDB
code to execute. If the DB is fundamentally in a bad state, closing and reopening all the time isn't going to necessarily mean that delete events will succeed in a future if doing so failed just now.
yes #170 is definitely an issue for us as well - we see lots of duplicated event
@baumatron has worked on creating some potential repro scenarios and identified a few issues that could be addressed around how DB corruption is handled. We could try to address those as well as how local delete failures are handled and how DB closed is handled by dispatched blocks as described above. Some combination of these patches could ameliorate the situation. Since none of that identifies the root cause, we'd still love to introduce some sort of better logging. @baumatron and I were tossing around the idea of adding some callback logging interface the client code could opt into in order to use a third-party logging solution that actually posts logs to somewhere useful. That could be turned on by client code for certain clients in order to help track down the root cause. Does that sound interesting @rist ?
I was just clicking around in our crash reporting tool and found another overview screen about our Keen related crashes - do you see anything helpful/intersting there?
Here's a bit of an update. The general issue here is that the db gets closed all over the place when encountering anything considered to be an error from the db, which is a bit dubious to begin with, but causes issues because this closing of the db isn't synchronized with code elsewhere that is responsible for opening the db again. So you can get into some call in KIODBStore that checks to be sure the db is open, then drops a block in the queue that will do some db operation, in the meantime some other operation "fails" and closes the db, and then the queued block executes and crashes since the db is now closed.
Since I last updated the bug, a couple of issues have been discovered and fixed where this "failure" condition is triggered erroneously, meaning the db is closed more often than it should be, making this issue observed more frequently. Those fixes should help mitigate this in the short term (#196, #200).
A long term fix would be to do an audit of how failures are handled in KIODBStore. I believe that in general these failures should not be closing the db, but instead surfacing a meaningful error to the SDK client when a call can't be handled. In cases where an error case should be explicitly handled because KIODBStore knows how to deal with it, such as is the case for existing corruption handling code-paths, that should all be done immediately where the error is encountered instead of partially where the error is encountered and then handled by checkOpenDB in random places salted throughout the code. I would suggest checkOpenDB gets removed and a single explicit initialization takes it's place, and all methods should be able to assume the db is open if called and should throw an exception if that isn't the case.