This is my take at #28, which partially revisits #26. This PR basically

makes do_db_execute take the lock, because every call to it is a write;
makes publish take the lock, same reason;
substitutes EXCLUSIVE locks with IMMEDIATE locks.

This works very well for my use case in #28, but please take a look to ensure that I'm not missing something. Compared to master, tests are much quicker now, but two tests fail, which I believe are spurious errors due to the fact that IMMEDIATE locks can coexist with readers:

══ Failed tests ════════════════════════════════════════════════════════════════
── Failure (test-db.R:79:3): db_lock ───────────────────────────────────────────
`db_query(con3, "SELECT * FROM meta")` did not throw an error.
── Failure (test-db.R:82:3): db_lock ───────────────────────────────────────────
`db_query(con2, "SELECT * FROM meta")` did not throw an error.

[ FAIL 2 | WARN 0 | SKIP 11 | PASS 173 ]

Jan 27 '21 20:01 Enchufa2

Thanks! TBH I am not sure about the IMMEDIATE locks. This is the sqlite3 docs:

EXCLUSIVE is similar to IMMEDIATE in that a write transaction is started immediately. EXCLUSIVE and IMMEDIATE are the same in WAL mode, but in other journaling modes, EXCLUSIVE prevents other database connections from reading the database while the transaction is underway.

While I think that IMMEDIATE will work for us, I don't understand why EXCLUSIVE did not work. AFAICT the only difference is that IMMEDIATE allows reads, but not writes. But in #28 the first failure is after a write transaction, so that should still fail the same way.

Also, why are these problems not solved by a longer busy timeout?

Feb 23 '21 17:02 gaborcsardi

Thanks! TBH I am not sure about the IMMEDIATE locks. This is the sqlite3 docs:

EXCLUSIVE is similar to IMMEDIATE in that a write transaction is started immediately. EXCLUSIVE and IMMEDIATE are the same in WAL mode, but in other journaling modes, EXCLUSIVE prevents other database connections from reading the database while the transaction is underway.

While I think that IMMEDIATE will work for us, I don't understand why EXCLUSIVE did not work. AFAICT the only difference is that IMMEDIATE allows reads, but not writes. But in #28 the first failure is after a write transaction, so that should still fail the same way.

Also, why are these problems not solved by a longer busy timeout?

Feb 23 '21 17:02 gaborcsardi

I don't fully grasp why either, but it's evident that there's some kind of deadlock going on (sometimes I have to kill the session with my example above). Some more info:

In order to avoid deadlocks in SQLite, programmers who want to modify a SQLite database start the transaction with BEGIN IMMEDIATE. When you begin your transaction with BEGIN EXCLUSIVE, the session is never aborted due to deadlock or lock contention with another transaction. Non-exclusive transactions are allowed to execute concurrently with the exclusive transaction, but the non-exclusive transactions will have their locks released if deadlock with the exclusive transaction occurs. If two or more exclusive transactions are running at the same time, they will be forced to execute in serial.

IMO, reads should be allowed, and thus all writers should be carefully guarded with IMMEDIATE locks. With this PR, the behaviour seems to be just as expected. Sometimes though, if I run the example multiple times, I see some tasks in FAILED state, and I don't understand why either. It doesn't matter so much thanks to liteq's recovery mechanism, but still... there's something off we don't see here.

Feb 23 '21 18:02 Enchufa2

FWIW, I added this patch to a Shiny app that manages cron jobs with ~20 workers reading from a liteq queue, and there have been no more issues.

Feb 23 '21 18:02 Enchufa2

I think I solved the issue of FAILED messages with the commit above: now make_message locks with retries, just like the other functions, using db_lock. And I have disabled the tests that expect an EXCLUSIVE lock. Those tests could be rewritten so that all connections try to write instead of reading.

Feb 23 '21 20:02 Enchufa2

I found this interesting read. As it turns out, RSQlite doesn't define HAVE_USLEEP (but, as this article argues, it's safer to assume it's not gonna be supported for some users anyway). Therefore, multiple threads in a multicore system trying to exclusively lock the database in an infinite loop, where the database is sleeping whole seconds, is doomed to failure. That's why a longer busy timeout doesn't help if the right circumstances happen.

Feb 27 '21 21:02 Enchufa2

Right. But you are saying that this is not the case for RSQLite, so we still don't have a complete explanation for our case, right?

Mar 01 '21 13:03 gaborcsardi

What is not the case for RSQlite? As I said, RSQlite does not define HAVE_USLEEP. So it's sleeping whole seconds.

Mar 01 '21 14:03 Enchufa2

Right, sorry, misunderstood.

Mar 01 '21 14:03 gaborcsardi

Quick question. I suppose that this is low priority among your (many) other projects, but is there any roadmap or estimate for this? Just to adjust my expectations. Also, please, let me know if there's something more I could do to help. Thanks in advance.

Mar 03 '21 18:03 Enchufa2

I'll take a look some time this week, sorry for the delay.

Mar 03 '21 18:03 gaborcsardi

Do you have some tips to make the code in #28 fail more often? I cannot make it reliably fail, so far I have seen one failure out of a hundred runs.

Should I just run it longer? Increase the number of subprocesses? Decrease the sleep time in the while loop? Decrease the busy timeout?

Mar 06 '21 14:03 gaborcsardi

Setting a one second timeout and increasing the number of subprocesses to 20 did it seemingly...

Mar 06 '21 14:03 gaborcsardi

FYI I did compile RSQLite with -DHAVE_USLEEP and that seems to solve the issue completely, so we should submit a patch to RSQLite I think.

Mar 06 '21 14:03 gaborcsardi

OTOH, unfortunately this PR does not fix the issue when RSQLite is compiled without -DHAVE_USLEEP, so we'll have to think about another workaround.

Mar 06 '21 15:03 gaborcsardi

Setting a one second timeout and increasing the number of subprocesses to 20 did it seemingly...

Yes, there's a certain combination that I suppose depends on the number of cores, the scheduler...

FYI I did compile RSQLite with -DHAVE_USLEEP and that seems to solve the issue completely, so we should submit a patch to RSQLite I think.

+1, but liteq should be robust to 1-second sleeps too I think. Plus I don't think there's any reason to avoid coexistence of readers (e.g. list_messages) and writers here.

OTOH, unfortunately this PR does not fix the issue when RSQLite is compiled without -DHAVE_USLEEP, so we'll have to think about another workaround.

Are you sure? Did you reload the namespace properly before trying? Because I don't have any issues with the patch regardless of the number of workers and the sleep between publish calls.

Mar 06 '21 15:03 Enchufa2

Are you sure? Did you reload the namespace properly before trying? Because I don't have any issues with the patch regardless of the number of workers and the sleep between publish calls.

Yeah, totally sure. I added your code as a test case, and increased the number of subprocesses.

~/works/liteq Enchufa2-fix/28* 10s
❯ R CMD INSTALL .
* installing to library ‘/Users/gaborcsardi/Library/R/4.0/library’
* installing *source* package ‘liteq’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (liteq)

~/works/liteq Enchufa2-fix/28*
❯ R -q -e 'devtools::test(filter = "concurrency-2")'
> devtools::test(filter = "concurrency-2")
Loading liteq
Testing liteq
✔ |  OK F W S | Context
⠏ |   0       | concurrency-2                                                                               ✖ |   0 1     | concurrency-2 [8.1 s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────
Error (test-concurrency-2.R:26:3): high concurrency
Error: database is locked
Backtrace:
  1. base::nrow(list_messages(q)) test-concurrency-2.R:26:2
  2. liteq::list_messages(q)
  3. liteq::db_list_messages(queue$db, queue$name) /Users/gaborcsardi/works/liteq/R/messages.R:170:2
  4. liteq::do_db(db, q, tablename = db_queue_name(queue)) /Users/gaborcsardi/works/liteq/R/db.R:397:2
  5. liteq::db_query(con, query, ...) /Users/gaborcsardi/works/liteq/R/db.R:68:2
  7. DBI::dbGetQuery(con, sqlInterpolate(con, query, ...))
  8. DBI:::.local(conn, statement, ...)
 10. RSQLite::dbSendQuery(conn, statement, ...)
 11. RSQLite:::.local(conn, statement, ...)
 15. RSQLite:::result_create(conn@ptr, statement) /Users/gaborcsardi/works/RSQLite/R/query.R:16:4
────────────────────────────────────────────────────────────────────────────────────────────────────────────

══ Results ═════════════════════════════════════════════════════════════════════════════════════════════════
Duration: 8.1 s

[ FAIL 1 | WARN 0 | SKIP 0 | PASS 0 ]

Mar 06 '21 15:03 gaborcsardi

+1, but liteq should be robust to 1-second sleeps too I think.

Ideally yes, but that will not be easy. We would need to implement our own randomized exponential backoff.

Plus I don't think there's any reason to avoid coexistence of readers (e.g. list_messages) and writers here.

Sure. We can still merge this, but it will not solve all concurrency issues.

At least now I know what is going on. :) I think...

Mar 06 '21 15:03 gaborcsardi

  3. liteq::db_list_messages(queue$db, queue$name) /Users/gaborcsardi/works/liteq/R/messages.R:170:2

Isn't this running the wrong liteq?

(Sorry to insist, but note that list_messages raising "database is locked" when the writers are using IMMEDIATE locks makes no sense)

Mar 06 '21 15:03 Enchufa2

Isn't this running the wrong liteq?

There is only one, just installed it above this.

(Sorry to insist, but note that list_messages raising "database is locked" when the writers are using IMMEDIATE locks makes no sense)

Yeah, that is weird. It seems that the PRAGMA in db_set_timeout() makes it a write transaction. At least if I remove that from db_connect() then the PR works and there is no error.

Mar 06 '21 15:03 gaborcsardi

Isn't this running the wrong liteq?

There is only one, just installed it above this.

Sorry, I got confused with the branch name in your prompt.

(Sorry to insist, but note that list_messages raising "database is locked" when the writers are using IMMEDIATE locks makes no sense)

Yeah, that is weird. It seems that the PRAGMA in db_set_timeout() makes it a write transaction. At least if I remove that from db_connect() then the PR works and there is no error.

db_set_timeout is called in db_connect, db_query (so twice for list_messages) and db_execute. It should be enough to call it in the last one, because readers do not need it with IMMEDIATE locks.

Mar 06 '21 16:03 Enchufa2

Or maybe the other way around: just send the pragma once in db_connect, prior to any transaction.

Mar 06 '21 16:03 Enchufa2

Or maybe the other way around: just send the pragma once in db_connect, prior to any transaction.

Maybe. It is set multiple times because sqlite3 seems to reset it.

But anyway, this does not solve the concurrency issues, there might be multiple concurrent writers as well, so this needs to be solved either in RSQLite or with implementing our own backoff.

Mar 06 '21 17:03 gaborcsardi

I'll merge this soon (probably today) after some changes.

Mar 10 '21 16:03 gaborcsardi

Codecov Report

Merging #29 (215fd4f) into master (ef18193) will not change coverage. The diff coverage is 0.00%.

@@          Coverage Diff           @@
##           master     #29   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files           5       5           
  Lines         318     321    +3     
======================================
- Misses        318     321    +3

Impacted Files	Coverage Δ
R/db.R	`0.00% <0.00%> (ø)`
R/messages.R	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ef18193...3d8748e. Read the comment docs.

Mar 10 '21 17:03 codecov-io

Great, thanks. Let me know if I can help.

Mar 10 '21 18:03 Enchufa2

This gave me a good excuse to understand the DB functions (again) and refactor it, so they are easier to maintain.

I'll merge when the checks come back clean.

Mar 13 '21 09:03 gaborcsardi

FYI I wrote up the concurrency issues at https://blog.r-hub.io/2021/03/13/rsqlite-parallel/

Mar 13 '21 11:03 gaborcsardi

liteq
liteq copied to clipboard

close #28: fix concurrency issues and deadlocks

Codecov Report

liteq liteq copied to clipboard

close #28: fix concurrency issues and deadlocks

Codecov Report

liteq
liteq copied to clipboard