sqlite_orm Unexpected closing of connection in multi-threaded application

We've ran into the following issue: While multiple threads try to access the same database through the same connection repeatedly, sometimes one of the threads would drop the connection to the database, while another would try to use that same connection. This sometimes resulted in an SQLITE_NOMEM (out of memory) error. After looking into it further this seems to be an SQLITE_MISUSE error in reality. This happens even in SQLite multi-threaded mode, in which it should be safe to use the same connection with different threads.

A small snippet to reproduce the error:

struct Test {
    int id;
    int value;
};

inline auto create_db(const std::string& db_path) {
    using namespace sqlite_orm;
    return make_storage(
            db_path, make_table("test", make_column("id", &Test::id, primary_key(), autoincrement()), make_column("value", &Test::value)));
}

using Database = decltype(create_db(""));

void get_all_tests(Database* db) {
    while (true) {
        auto every_test = db->get_all<Test>();
        // Do something else
    }
}

void f() {
    auto db = create_db("any_path");
    db.open_forever();
    db.sync_schema();
    std::thread t1(get_all_tests, &db);
    std::thread t2(get_all_tests, &db);
}

Note: It can take minutes and thousands of iterations to run into the problem. If the threads have a bit more to do it usually comes up a lot quicker. The threads don't have to be running in busy loops, but the more frequent the database accesses are, the more likely for the error to occur.

After some investigation, what we've found was a potential race condition in connection_holder. To me it seems retain() and release() may try to write _retain_count simultaneously, which may trigger release() to close the connection, after retain() has already passed the check to open the connection. Thus, the thread calling retain() would then try to use the connection, which by that point was closed.

We've created a vcpkg patch to make _retain_count atomic, which seems to have solved the issue.

Are we using sqlite_orm properly in this scenario? Did we miss something else perhaps?

Jun 15 '22 10:06 Akster009

I don't understand how retain_count can be decremented up to zero if you called db.open_forever. open_forever call will keep db opened until dtor is fired

Jun 15 '22 10:06 fnc12

Maybe it's not decremented to zero - without memory fences anything can happen, maybe the release()ing thread thinks that the counter 0 and closes the connection, while another thread already incremented it. Counter needs to be atomic to prevent this sort of stuff.

Jun 15 '22 12:06 PandarinDev

to make it atomic first I need to be sure that the reason is that retain_count becomes zero. Logically it looks like that it cannot happen. Can you make sure that it becomes zero? Maybe add logging. Also just adding atomic will affect all other users so we need to think about more elegant solution

Jun 16 '22 03:06 fnc12

@fnc12 @PandarinDev is right in that anything can happen because even the increment/decrement operation itself is not atomic (it consists of read/operation/write). On top of it, the retain() and release() methods involve an additional read after the increment and decrement.

Consider following scenario.

open_forever() -> _retain_count == 1
thread A reads _retain_count -> 1
thread B reads _retain_count -> 1
thread A increases 1 and sets _retain_count = 2
thread B increases 1 and sets _retain_count = 2
thread A decreases _retain_count = 1
thread B reads _retain_count -> 1
thread B decreases 1 and sets _retain_count = 0

Hence, the counter must be converted to an atomic variable so that it can be used in multithreaded applications.

Jun 16 '22 06:06 trueqbit

Added a log to the release() method and using the example code provided by @Akster009 I managed to trigger the if (0 == this->_retain_count) branch in just a minute of running the application.

Jun 16 '22 06:06 PandarinDev

guys please review my comment in the PR @trueqbit created https://github.com/fnc12/sqlite_orm/pull/1054#discussion_r898817873

Jun 16 '22 08:06 fnc12

@Akster009 @PandarinDev Are you able to try the PR branch whether it solves your problem? Just note that this is on latest dev.

Jun 16 '22 08:06 trueqbit

Just ran the example code with your branch for 10+ minutes without any crashes. With the dev branch the example code typically crashes in 1-2 minutes. Thanks for the PR!

Jun 16 '22 09:06 PandarinDev

I've also tried it, seems to be working without crashing on my end as well.

Jun 16 '22 09:06 Akster009

Even with an atomic retain count - If the storage is not open_forever(), one can still run into a problem where the sqlite3* db is not opened.

I think that would happen when another thread decreases the retain_count at sometime between the lines:

++retain_count (now another thread decreases the retain_count) if(1==retain_count)

This makes the if statement false, and the sqlite3_open never gets called.

Aug 31 '22 06:08 robin-arcsite

@robin-arcsite can you please provide test code for it? thanks

Aug 31 '22 06:08 fnc12

@robin-arcsite can you please provide test code for it? thanks

I can reproduce this by reusing @Akster009 's sample code above, simply remove the open_forever() line. (Windows 11, Visual Studio 2022)

Aug 31 '22 07:08 robin-arcsite

@robin-arcsite can you repro it with the current dev branch?

Aug 31 '22 07:08 fnc12

@fnc12 Yes, I just tried the current dev branch.

It hits the prepare_stmt() function (line 8791) with a NULL sqlite3* db.

Aug 31 '22 07:08 robin-arcsite

here's my modified test function: (open_forever() line is removed and threads are join() ed)

void f() {
    auto db = create_db("D:\\Developer\\Projects\\sample1\\sample1\\my_db.db");
    db.sync_schema();
    std::thread t1(get_all_tests, &db);
    std::thread t2(get_all_tests, &db);

    t1.join();
    t2.join();
}

Aug 31 '22 08:08 robin-arcsite

@robin-arcsite You are absolutely right about this kind of race condition. The way connections are currently handled requires that you use open_forever() when using multiple threads. At this stage I would say this is a feature, unless @fnc12 disagrees?

Aug 31 '22 09:08 trueqbit

@trueqbit @robin-arcsite why races happen even after your (Klaus'es) update with std::atomic_int refactoring?

Aug 31 '22 11:08 fnc12

@fnc12 Because anything can happen between increasing/decreasing the reference count and calling the sqlite APIs in a multi-threaded scenario.

Aug 31 '22 11:08 trueqbit

@trueqbit it sounds sad and not solvable. If it is not solvable the only thing we can do is close this issue as not solvable. But IMHO we can add a mutex inside storage opening to avoid data races. It is not a desire for the next PR. I am just talking that there is always a way of solving data races with heavy weapon like mutex. @robin-arcsite @trueqbit what do you think?

Aug 31 '22 11:08 fnc12

@fnc12 I agree that using mutex could be a solution, and I understand your concerns about the performance inpact with that solution.

At this point the only advise I have is to document the usage of open_forever() somewhere. (To be honest I was not aware of the open_forever function until I hit this issue ...)

Sep 01 '22 01:09 robin-arcsite

Снимок экрана 2023-10-14 в 17 50 04 hey guys. Does the issue still persist with the current state of `connection_holder` class? Looks like the latest issue related to race condition inside `retain` and `release` has gone

Oct 14 '23 11:10 fnc12

As @robin-arcsite said, just "document the use of open_forever() somewhere" in terms of multithreading.

Oct 14 '23 20:10 trueqbit

comment is here. While it is not merged yet please feel free to suggest your edits if you wish

Oct 15 '23 06:10 fnc12

comment was added. Closing. If something else appears feel free to reopen with a detailed comment

Oct 15 '23 11:10 fnc12

sqlite_orm sqlite_orm copied to clipboard

Unexpected closing of connection in multi-threaded application

sqlite_orm
sqlite_orm copied to clipboard