sqlite_orm
sqlite_orm copied to clipboard
Unexpected closing of connection in multi-threaded application
We've ran into the following issue: While multiple threads try to access the same database through the same connection repeatedly, sometimes one of the threads would drop the connection to the database, while another would try to use that same connection. This sometimes resulted in an SQLITE_NOMEM (out of memory) error. After looking into it further this seems to be an SQLITE_MISUSE error in reality. This happens even in SQLite multi-threaded mode, in which it should be safe to use the same connection with different threads.
A small snippet to reproduce the error:
struct Test {
int id;
int value;
};
inline auto create_db(const std::string& db_path) {
using namespace sqlite_orm;
return make_storage(
db_path, make_table("test", make_column("id", &Test::id, primary_key(), autoincrement()), make_column("value", &Test::value)));
}
using Database = decltype(create_db(""));
void get_all_tests(Database* db) {
while (true) {
auto every_test = db->get_all<Test>();
// Do something else
}
}
void f() {
auto db = create_db("any_path");
db.open_forever();
db.sync_schema();
std::thread t1(get_all_tests, &db);
std::thread t2(get_all_tests, &db);
}
Note: It can take minutes and thousands of iterations to run into the problem. If the threads have a bit more to do it usually comes up a lot quicker. The threads don't have to be running in busy loops, but the more frequent the database accesses are, the more likely for the error to occur.
After some investigation, what we've found was a potential race condition in connection_holder
. To me it seems retain()
and release()
may try to write _retain_count
simultaneously, which may trigger release()
to close the connection, after retain()
has already passed the check to open the connection. Thus, the thread calling retain()
would then try to use the connection, which by that point was closed.
We've created a vcpkg patch to make _retain_count
atomic, which seems to have solved the issue.
Are we using sqlite_orm properly in this scenario? Did we miss something else perhaps?
I don't understand how retain_count
can be decremented up to zero if you called db.open_forever
. open_forever
call will keep db opened until dtor is fired
Maybe it's not decremented to zero - without memory fences anything can happen, maybe the release()
ing thread thinks that the counter 0 and closes the connection, while another thread already incremented it. Counter needs to be atomic to prevent this sort of stuff.
to make it atomic first I need to be sure that the reason is that retain_count becomes zero. Logically it looks like that it cannot happen. Can you make sure that it becomes zero? Maybe add logging. Also just adding atomic will affect all other users so we need to think about more elegant solution
@fnc12 @PandarinDev is right in that anything can happen because even the increment/decrement operation itself is not atomic (it consists of read/operation/write). On top of it, the retain()
and release()
methods involve an additional read after the increment and decrement.
Consider following scenario.
-
open_forever()
->_retain_count == 1
- thread A reads
_retain_count
->1
- thread B reads
_retain_count
->1
- thread A increases
1
and sets_retain_count = 2
- thread B increases
1
and sets_retain_count = 2
- thread A decreases
_retain_count = 1
- thread B reads
_retain_count
->1
- thread B decreases
1
and sets_retain_count = 0
Hence, the counter must be converted to an atomic variable so that it can be used in multithreaded applications.
Added a log to the release()
method and using the example code provided by @Akster009 I managed to trigger the if (0 == this->_retain_count)
branch in just a minute of running the application.
guys please review my comment in the PR @trueqbit created https://github.com/fnc12/sqlite_orm/pull/1054#discussion_r898817873
@Akster009 @PandarinDev Are you able to try the PR branch whether it solves your problem? Just note that this is on latest dev.
Just ran the example code with your branch for 10+ minutes without any crashes. With the dev branch the example code typically crashes in 1-2 minutes. Thanks for the PR!
I've also tried it, seems to be working without crashing on my end as well.
Even with an atomic retain count - If the storage is not open_forever()
, one can still run into a problem where the sqlite3* db is not opened.
I think that would happen when another thread decreases the retain_count at sometime between the lines:
++retain_count
(now another thread decreases the retain_count)
if(1==retain_count)
This makes the if
statement false
, and the sqlite3_open
never gets called.
@robin-arcsite can you please provide test code for it? thanks
@robin-arcsite can you please provide test code for it? thanks
I can reproduce this by reusing @Akster009 's sample code above, simply remove the open_forever()
line. (Windows 11, Visual Studio 2022)
@robin-arcsite can you repro it with the current dev
branch?
@fnc12 Yes, I just tried the current dev
branch.
It hits the prepare_stmt()
function (line 8791) with a NULL sqlite3* db
.
here's my modified test function:
(open_forever()
line is removed and threads are join()
ed)
void f() {
auto db = create_db("D:\\Developer\\Projects\\sample1\\sample1\\my_db.db");
db.sync_schema();
std::thread t1(get_all_tests, &db);
std::thread t2(get_all_tests, &db);
t1.join();
t2.join();
}
@robin-arcsite You are absolutely right about this kind of race condition. The way connections are currently handled requires that you use open_forever()
when using multiple threads. At this stage I would say this is a feature, unless @fnc12 disagrees?
@trueqbit @robin-arcsite why races happen even after your (Klaus'es) update with std::atomic_int
refactoring?
@fnc12 Because anything can happen between increasing/decreasing the reference count and calling the sqlite APIs in a multi-threaded scenario.
@trueqbit it sounds sad and not solvable. If it is not solvable the only thing we can do is close this issue as not solvable. But IMHO we can add a mutex inside storage opening to avoid data races. It is not a desire for the next PR. I am just talking that there is always a way of solving data races with heavy weapon like mutex. @robin-arcsite @trueqbit what do you think?
@fnc12 I agree that using mutex could be a solution, and I understand your concerns about the performance inpact with that solution.
At this point the only advise I have is to document the usage of open_forever()
somewhere.
(To be honest I was not aware of the open_forever function until I hit this issue ...)
As @robin-arcsite said, just "document the use of open_forever()
somewhere" in terms of multithreading.
comment is here. While it is not merged yet please feel free to suggest your edits if you wish
comment was added. Closing. If something else appears feel free to reopen with a detailed comment