LiteDB icon indicating copy to clipboard operation
LiteDB copied to clipboard

[BUG] LiteDB is locking up in the DiskWriterQueue

Open dgodwin1175 opened this issue 3 years ago • 12 comments

Version LiteDB: 5.0.16 .NET Framework: 4.6.2

Describe the bug Requests to read or write a collection hang up and get no response when being used by multiple threads. The only way to recover from this situation is to kill our process that is hosting the database. E.g. LiteDatabase.GetCollection<Type>(CollectionName)

Note that this is becoming a critical issue for our application and our customers, so would appreciate a prompt response if at all possible.

Steps to Reproduce Our application has concurrent threads that are simultaneously reading and updating documents in the database. Our LiteDatabase object is a singleton, and we use locking on the object to ensure only one thread is updating the database at a time. After some time, the threads all block waiting on a lock on the database, and the lock is currently held by the above request to GetCollection(), which never returns.

I have downloaded the latest LiteDB code and added some logging to pinpoint where the problem is and it appears to be in the DiskWriterQueue where the ExecuteQueue function is called by Task.Run(ExecuteQueue), however ExecuteQueue is never actually run, and so the queue is never processed. See below log snippets (these logs are all mine that I have added):

2023-04-11 11:56:58.054 [17] ERROR Got lock in DiskWriter.Run(), queue size is 30 2023-04-11 11:56:58.054 [17] ERROR Run(): oldValue is 0, _running is 1, task is null False 2023-04-11 11:56:58.054 [17] ERROR Creating new task to ExecuteQueue, queue size is 30 2023-04-11 11:56:58.054 [17] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.163 [41] ERROR Got lock in DiskWriter.Run(), queue size is 60 2023-04-11 11:56:58.163 [41] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:58.163 [41] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.195 [9] ERROR Got lock in DiskWriter.Run(), queue size is 62 2023-04-11 11:56:58.195 [9] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:58.195 [9] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.210 [38] ERROR Got lock in DiskWriter.Run(), queue size is 63 2023-04-11 11:56:58.226 [38] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:58.226 [38] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.226 [53] ERROR Got lock in DiskWriter.Run(), queue size is 64 2023-04-11 11:56:58.226 [53] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:58.226 [53] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.242 [32] ERROR Got lock in DiskWriter.Run(), queue size is 65 2023-04-11 11:56:58.242 [32] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:58.242 [32] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.242 [44] ERROR Got lock in DiskWriter.Run(), queue size is 66 2023-04-11 11:56:58.242 [44] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:58.242 [44] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.242 [58] ERROR Got lock in DiskWriter.Run(), queue size is 67 2023-04-11 11:56:58.257 [58] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:58.257 [58] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.257 [39] ERROR Got lock in DiskWriter.Run(), queue size is 69 2023-04-11 11:56:58.257 [39] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:58.257 [39] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.273 [4] ERROR Got lock in DiskWriter.Run(), queue size is 70 2023-04-11 11:56:58.273 [4] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:58.273 [4] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.288 [54] ERROR Got lock in DiskWriter.Run(), queue size is 71 2023-04-11 11:56:58.288 [54] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:58.288 [54] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.288 [62] ERROR Got lock in DiskWriter.Run(), queue size is 72 2023-04-11 11:56:58.304 [62] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:58.304 [62] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.320 [40] ERROR Got lock in DiskWriter.Run(), queue size is 73 2023-04-11 11:56:58.320 [62] ERROR ExecuteQueue, queue size is 73 2023-04-11 11:56:58.320 [40] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:58.351 [40] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.429 [57] ERROR Got lock in DiskWriter.Run(), queue size is 4 2023-04-11 11:56:58.429 [57] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:58.429 [57] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:58.476 [62] ERROR ExecuteQueue, page is null, queue size 0 2023-04-11 11:56:58.476 [62] ERROR Queue is empty, exiting ExecuteQueue 2023-04-11 11:56:58.714 [54] ERROR Got lock in DiskWriter.Run(), queue size is 30 2023-04-11 11:56:58.714 [54] ERROR Run(): oldValue is 0, _running is 1, task is null False 2023-04-11 11:56:58.714 [54] ERROR Creating new task to ExecuteQueue, queue size is 30 2023-04-11 11:56:58.714 [54] ERROR Releasing lock in DiskWriter.Run()

Note that after this, ExecuteQueue never runs and so the queue is never emptied and keeps on growing:

2023-04-11 11:56:59.403 [40] ERROR Got lock in DiskWriter.Run(), queue size is 60 2023-04-11 11:56:59.403 [40] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:59.403 [40] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:56:59.757 [39] ERROR Got lock in DiskWriter.Run(), queue size is 90 2023-04-11 11:56:59.772 [39] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:56:59.772 [39] ERROR Releasing lock in DiskWriter.Run() 2023-04-11 11:57:01.401 [55] ERROR Got lock in DiskWriter.Run(), queue size is 120 2023-04-11 11:57:01.401 [55] ERROR Run(): oldValue is 1, _running is 1, task is null False 2023-04-11 11:57:01.401 [55] ERROR Releasing lock in DiskWriter.Run()

If anything hits the DiskWriterQueue.Wait() function after this, it waits forever, waiting on _task to complete and empty the queue, but it never does.

The only way to recover from this is to restart our process.

Expected behavior No locked up threads

Additional context Note that we tried removing all locking on our side, but this resulted in us hitting "Lock timeout on transaction after 01:00" exceptions, followed by the "Maximum number of transactions reached". Which I assume is just different manifestation of the same problem.

Note that if we repeat the test with version 4.1.4 of LiteDB, we do not see this issue.

dgodwin1175 avatar Apr 11 '23 08:04 dgodwin1175

We have run into this issue as well and have been able to reproduce it with a unit test. The root cause is this check here. https://github.com/mbdavid/LiteDB/blob/master/LiteDB/Engine/Disk/DiskWriterQueue.cs#L95 If this condition is hit, then the _running flag is never set back to 0 and no new ExecuteQueue task is ever run.

So when does this happen? It happens if thread T1 calls EnqueuePage while T2 is running ExecuteQueue and processed the queued item exactly while T1 called Run and it was between these lines https://github.com/mbdavid/LiteDB/blob/master/LiteDB/Engine/Disk/DiskWriterQueue.cs#L59-L61. T2 set _running to 0 just before T1 tries to change it, so now T1 starts a new ExecuteQueue task, while the queue is empty and it's all dead after that.

The fix is to get rid of that if condition, but I want to see if a simpler reproduction test can be written and will then make a change request.

m-omerovic avatar Apr 26 '23 07:04 m-omerovic

This is my reproduction: https://github.com/mbdavid/LiteDB/compare/master...m-omerovic:LiteDB:master-DiskWriterQueue-multi-threading-issue This is the PR for the fix: https://github.com/mbdavid/LiteDB/pull/2317

m-omerovic avatar Apr 26 '23 08:04 m-omerovic

@m-omerovic, I'm not sure if this is the same issue as we've been seeing, but we'll test it out to see if your fix resolves it. Thanks.

dgodwin1175 avatar May 05 '23 04:05 dgodwin1175

I have a strong suspicion that it is, please let me know if it does not solve your problem.

m-omerovic avatar May 05 '23 12:05 m-omerovic

Hopefully it's ok to write here without much information.

I suspect I am having the same issue. Everything works on 4.1.4 (like OP wrote). However, issue still persists on 5.0.17. I've added the LiteDB code to my solution and use it directly. When I run it in debug mode, there is no issue. When I run it in release, however, the issue occurs again.

Since I can't reproduce the bug on debug mode, I will try to reproduce/debug using prints and update here. But anyway, hopefully the information I provided (debug vs. release) can help on its own.

Ido00 avatar Aug 06 '23 13:08 Ido00

I'll elaborate a bit on what I'm doing. I'm writing a server that concurrently receives messages, inserts them to the DB, then forwards to destination. For every client, an async task is launched to handle that client's connection and receive its messages. LiteDB is used as a singleton. I have a test of 100 clients that concurrently send a message (through the server) to each other.

On 4.1.4 (and 5.0.17 on debug) there is no issue, the test passes quickly and successfully.

However, on 5.0.17 release the issue is as follows: After around 1000-2000 messages (around ~1-2 seconds), every time this line is reached, the ReaderWriterLockSlim _transaction shows a bigger _transaction.WaitingReadCount. That is, more and more tasks are awaiting the reader lock for some reason. (I reach this place from LiteDB.Engine: LiteEngine.Insert -> LiteEngine.AutoTransaction -> TransactionMonitor.GetTransaction -> LockService.EnterTransaction.)

If I add a Console.WriteLine(...); print anywhere in this Run function, the issue does not occur. So I assume it happens somewhere in this function, however, I am not sure how I can further investigate this (since I can't use the debugger and can't use prints).

Ido00 avatar Aug 06 '23 16:08 Ido00

I have a strong suspicion that it is, please let me know if it does not solve your problem.

Unfortunately, have tested the latest code, together with your PR changes and the problem persists. For now we have had to revert back to 4.1.4.

dgodwin1175 avatar Aug 07 '23 01:08 dgodwin1175

@Ido00 @dgodwin1175 That's a bummer. I will try to run my original reproduction test on the new release and see what happens. If you guys have any reproduction code, then please share. I guess debug build and console output will alter the timing and probably decrease the chance of the issue occurring.

m-omerovic avatar Aug 07 '23 07:08 m-omerovic

@Ido00 @dgodwin1175 That's a bummer. I will try to run my original reproduction test on the new release and see what happens. If you guys have any reproduction code, then please share. I guess debug build and console output will alter the timing and probably decrease the chance of the issue occurring.

@m-omerovic, we don't have any test code that we can share at this stage, it may require some effort to write a test harness to replicate what our application is doing. If we get time to do this, we will. Note that when reproducing on our side, it doesn't make a difference if it's debug or release, and adding all of the additional logging as per my original post, didn't affect our ability to reproduce either.

dgodwin1175 avatar Aug 08 '23 01:08 dgodwin1175

Is this reproduceable with .NET Core (.NET 6 for example)?

Also, is there any difference when creating LiteDatabase instances at point of use (and disposing after) instead of a singleton?

tjmoore avatar Aug 22 '23 17:08 tjmoore

When this occurs are there any errors or is it just hung?

In particular are there any "Access denied" or "File in use by another process" exceptions occurring just before this happens?

I've been working on customer issues some time where we see these errors and then something is locking up. I've removed all possible bottlenecks in our own code but still something is locking up. Just wondering if it's the same issue.

Oddly this only occurs on certain customer servers. Always Windows Server but version varies, and so far a common factor is MsSense is active on these servers and accessing the database files, but I can't prove it has anything to do with it.

tjmoore avatar Oct 23 '23 10:10 tjmoore

Tried 5.0.17 and doesn't resolve the issue I'm seeing.

Using Process Monitor I do see a few DELETE PENDING entries just before it locks up, on the '-log.db' file. In normal operation I don't see that.

From a process dump I can see it's blocked on the mutex in the private OpenDatabase method in SharedEngine. It seems a similar issue has been found before with that - #2135

tjmoore avatar Oct 24 '23 10:10 tjmoore