liteq
liteq copied to clipboard
close #28: fix concurrency issues and deadlocks
This is my take at #28, which partially revisits #26. This PR basically
- makes
do_db_executetake the lock, because every call to it is a write; - makes
publishtake the lock, same reason; - substitutes
EXCLUSIVElocks withIMMEDIATElocks.
This works very well for my use case in #28, but please take a look to ensure that I'm not missing something. Compared to master, tests are much quicker now, but two tests fail, which I believe are spurious errors due to the fact that IMMEDIATE locks can coexist with readers:
══ Failed tests ════════════════════════════════════════════════════════════════
── Failure (test-db.R:79:3): db_lock ───────────────────────────────────────────
`db_query(con3, "SELECT * FROM meta")` did not throw an error.
── Failure (test-db.R:82:3): db_lock ───────────────────────────────────────────
`db_query(con2, "SELECT * FROM meta")` did not throw an error.
[ FAIL 2 | WARN 0 | SKIP 11 | PASS 173 ]
Thanks! TBH I am not sure about the IMMEDIATE locks. This is the sqlite3 docs:
EXCLUSIVE is similar to IMMEDIATE in that a write transaction is started immediately. EXCLUSIVE and IMMEDIATE are the same in WAL mode, but in other journaling modes, EXCLUSIVE prevents other database connections from reading the database while the transaction is underway.
While I think that IMMEDIATE will work for us, I don't understand why EXCLUSIVE did not work. AFAICT the only difference is that IMMEDIATE allows reads, but not writes. But in #28 the first failure is after a write transaction, so that should still fail the same way.
Also, why are these problems not solved by a longer busy timeout?
Thanks! TBH I am not sure about the IMMEDIATE locks. This is the sqlite3 docs:
EXCLUSIVE is similar to IMMEDIATE in that a write transaction is started immediately. EXCLUSIVE and IMMEDIATE are the same in WAL mode, but in other journaling modes, EXCLUSIVE prevents other database connections from reading the database while the transaction is underway.
While I think that IMMEDIATE will work for us, I don't understand why EXCLUSIVE did not work. AFAICT the only difference is that IMMEDIATE allows reads, but not writes. But in #28 the first failure is after a write transaction, so that should still fail the same way.
Also, why are these problems not solved by a longer busy timeout?
I don't fully grasp why either, but it's evident that there's some kind of deadlock going on (sometimes I have to kill the session with my example above). Some more info:
In order to avoid deadlocks in SQLite, programmers who want to modify a SQLite database start the transaction with BEGIN IMMEDIATE. When you begin your transaction with BEGIN EXCLUSIVE, the session is never aborted due to deadlock or lock contention with another transaction. Non-exclusive transactions are allowed to execute concurrently with the exclusive transaction, but the non-exclusive transactions will have their locks released if deadlock with the exclusive transaction occurs. If two or more exclusive transactions are running at the same time, they will be forced to execute in serial.
IMO, reads should be allowed, and thus all writers should be carefully guarded with IMMEDIATE locks. With this PR, the behaviour seems to be just as expected. Sometimes though, if I run the example multiple times, I see some tasks in FAILED state, and I don't understand why either. It doesn't matter so much thanks to liteq's recovery mechanism, but still... there's something off we don't see here.
FWIW, I added this patch to a Shiny app that manages cron jobs with ~20 workers reading from a liteq queue, and there have been no more issues.
I think I solved the issue of FAILED messages with the commit above: now make_message locks with retries, just like the other functions, using db_lock. And I have disabled the tests that expect an EXCLUSIVE lock. Those tests could be rewritten so that all connections try to write instead of reading.
I found this interesting read. As it turns out, RSQlite doesn't define HAVE_USLEEP (but, as this article argues, it's safer to assume it's not gonna be supported for some users anyway). Therefore, multiple threads in a multicore system trying to exclusively lock the database in an infinite loop, where the database is sleeping whole seconds, is doomed to failure. That's why a longer busy timeout doesn't help if the right circumstances happen.
Right. But you are saying that this is not the case for RSQLite, so we still don't have a complete explanation for our case, right?
What is not the case for RSQlite? As I said, RSQlite does not define HAVE_USLEEP. So it's sleeping whole seconds.
Right, sorry, misunderstood.
Quick question. I suppose that this is low priority among your (many) other projects, but is there any roadmap or estimate for this? Just to adjust my expectations. Also, please, let me know if there's something more I could do to help. Thanks in advance.
I'll take a look some time this week, sorry for the delay.
Do you have some tips to make the code in #28 fail more often? I cannot make it reliably fail, so far I have seen one failure out of a hundred runs.
Should I just run it longer? Increase the number of subprocesses? Decrease the sleep time in the while loop? Decrease the busy timeout?
Setting a one second timeout and increasing the number of subprocesses to 20 did it seemingly...
FYI I did compile RSQLite with -DHAVE_USLEEP and that seems to solve the issue completely, so we should submit a patch to RSQLite I think.
OTOH, unfortunately this PR does not fix the issue when RSQLite is compiled without -DHAVE_USLEEP, so we'll have to think about another workaround.
Setting a one second timeout and increasing the number of subprocesses to 20 did it seemingly...
Yes, there's a certain combination that I suppose depends on the number of cores, the scheduler...
FYI I did compile RSQLite with
-DHAVE_USLEEPand that seems to solve the issue completely, so we should submit a patch to RSQLite I think.
+1, but liteq should be robust to 1-second sleeps too I think. Plus I don't think there's any reason to avoid coexistence of readers (e.g. list_messages) and writers here.
OTOH, unfortunately this PR does not fix the issue when RSQLite is compiled without
-DHAVE_USLEEP, so we'll have to think about another workaround.
Are you sure? Did you reload the namespace properly before trying? Because I don't have any issues with the patch regardless of the number of workers and the sleep between publish calls.
Are you sure? Did you reload the namespace properly before trying? Because I don't have any issues with the patch regardless of the number of workers and the sleep between publish calls.
Yeah, totally sure. I added your code as a test case, and increased the number of subprocesses.
~/works/liteq Enchufa2-fix/28* 10s
❯ R CMD INSTALL .
* installing to library ‘/Users/gaborcsardi/Library/R/4.0/library’
* installing *source* package ‘liteq’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (liteq)
~/works/liteq Enchufa2-fix/28*
❯ R -q -e 'devtools::test(filter = "concurrency-2")'
> devtools::test(filter = "concurrency-2")
Loading liteq
Testing liteq
✔ | OK F W S | Context
⠏ | 0 | concurrency-2 ✖ | 0 1 | concurrency-2 [8.1 s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────
Error (test-concurrency-2.R:26:3): high concurrency
Error: database is locked
Backtrace:
1. base::nrow(list_messages(q)) test-concurrency-2.R:26:2
2. liteq::list_messages(q)
3. liteq::db_list_messages(queue$db, queue$name) /Users/gaborcsardi/works/liteq/R/messages.R:170:2
4. liteq::do_db(db, q, tablename = db_queue_name(queue)) /Users/gaborcsardi/works/liteq/R/db.R:397:2
5. liteq::db_query(con, query, ...) /Users/gaborcsardi/works/liteq/R/db.R:68:2
7. DBI::dbGetQuery(con, sqlInterpolate(con, query, ...))
8. DBI:::.local(conn, statement, ...)
10. RSQLite::dbSendQuery(conn, statement, ...)
11. RSQLite:::.local(conn, statement, ...)
15. RSQLite:::result_create(conn@ptr, statement) /Users/gaborcsardi/works/RSQLite/R/query.R:16:4
────────────────────────────────────────────────────────────────────────────────────────────────────────────
══ Results ═════════════════════════════════════════════════════════════════════════════════════════════════
Duration: 8.1 s
[ FAIL 1 | WARN 0 | SKIP 0 | PASS 0 ]
+1, but liteq should be robust to 1-second sleeps too I think.
Ideally yes, but that will not be easy. We would need to implement our own randomized exponential backoff.
Plus I don't think there's any reason to avoid coexistence of readers (e.g. list_messages) and writers here.
Sure. We can still merge this, but it will not solve all concurrency issues.
At least now I know what is going on. :) I think...
3. liteq::db_list_messages(queue$db, queue$name) /Users/gaborcsardi/works/liteq/R/messages.R:170:2
Isn't this running the wrong liteq?
(Sorry to insist, but note that list_messages raising "database is locked" when the writers are using IMMEDIATE locks makes no sense)
Isn't this running the wrong liteq?
There is only one, just installed it above this.
(Sorry to insist, but note that list_messages raising "database is locked" when the writers are using IMMEDIATE locks makes no sense)
Yeah, that is weird. It seems that the PRAGMA in db_set_timeout() makes it a write transaction. At least if I remove that from db_connect() then the PR works and there is no error.
Isn't this running the wrong liteq?
There is only one, just installed it above this.
Sorry, I got confused with the branch name in your prompt.
(Sorry to insist, but note that list_messages raising "database is locked" when the writers are using IMMEDIATE locks makes no sense)
Yeah, that is weird. It seems that the
PRAGMAindb_set_timeout()makes it a write transaction. At least if I remove that fromdb_connect()then the PR works and there is no error.
db_set_timeout is called in db_connect, db_query (so twice for list_messages) and db_execute. It should be enough to call it in the last one, because readers do not need it with IMMEDIATE locks.
Or maybe the other way around: just send the pragma once in db_connect, prior to any transaction.
Or maybe the other way around: just send the pragma once in db_connect, prior to any transaction.
Maybe. It is set multiple times because sqlite3 seems to reset it.
But anyway, this does not solve the concurrency issues, there might be multiple concurrent writers as well, so this needs to be solved either in RSQLite or with implementing our own backoff.
I'll merge this soon (probably today) after some changes.
Codecov Report
Merging #29 (215fd4f) into master (ef18193) will not change coverage. The diff coverage is
0.00%.
@@ Coverage Diff @@
## master #29 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 5 5
Lines 318 321 +3
======================================
- Misses 318 321 +3
| Impacted Files | Coverage Δ | |
|---|---|---|
| R/db.R | 0.00% <0.00%> (ø) |
|
| R/messages.R | 0.00% <0.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update ef18193...3d8748e. Read the comment docs.
Great, thanks. Let me know if I can help.
This gave me a good excuse to understand the DB functions (again) and refactor it, so they are easier to maintain.
I'll merge when the checks come back clean.
FYI I wrote up the concurrency issues at https://blog.r-hub.io/2021/03/13/rsqlite-parallel/