paperless icon indicating copy to clipboard operation
paperless copied to clipboard

Database locked while cosnumer runs

Open robsdedude opened this issue 5 years ago • 11 comments

First of all: Thanks for the nice project!

I've got the project running on a Raspberry Pi 3 B+ (so basically a toaster). This means that the consumer takes a looong time to consume the PDFs wich per se is no problem for me. However I noticed that the consumer seems to lock the database while it's processing PDFs. So I can't edit already consumed documents while there are any left in the consumption dir. I get a 500 (OperationalError: database is locked) when I try to save any model while the consumer is working.

Is this necessary or could the consumer close/unlock the database connection until ocr, guesswork, and stuff is done?

https://github.com/the-paperless-project/paperless/blob/master/src/documents/consumer.py#L115 this seems to be the code in question.

robsdedude avatar Jun 03 '19 10:06 robsdedude

Thanks for bringing this up! I have run into this as well. I'm fairly sure this has not always been the case and I used to edit documents whilst some were still being consumed.

ddddavidmartin avatar Jun 03 '19 11:06 ddddavidmartin

I've noticed this too and this makes paperless unusable if you're doing something like an initial ingestion of a boatload of PDFs and you also want to managed them in the UI...

On Mon, Jun 3, 2019 at 3:36 AM Rouven Bauer [email protected] wrote:

First of all: Thanks for the nice project!

I've got the project running on a Raspberry Pi 3 B+ (so basically a toaster). This means that the consumer take a looong time to consume the PDFs wich per se is no problem for me. However I noticed that the consumer seems to lock the database while it's processing PDFs. So I can't edit already consumed documents while there are any left in the consumption dir. I get a 500 (OperationalError: database is locked) when I try to save any model while the consumer is working.

Is this necessary or could the consumer close/unlock the database connection until ocr and guesswork is done?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/the-paperless-project/paperless/issues/546?email_source=notifications&email_token=AAM5SMU2AW2D5RL2ERJMNL3PYTX23A5CNFSM4HSGXWG2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GXH7FRQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AAM5SMTMGG3G7GK2A7W35XLPYTX23ANCNFSM4HSGXWGQ .

stgarf avatar Jun 04 '19 23:06 stgarf

The issue seems to be specific to SQLite https://docs.djangoproject.com/en/2.2/ref/databases/#database-is-locked-errors

There is only one database write in the try_consume_file function which happens at the end of the consumption process https://github.com/the-paperless-project/paperless/blob/8e6d7cba1329c959659cd133522e30cda1ae3943/src/documents/consumer.py#L154

Is the transaction.atomic decorator necessary if your don't have document_consumption_started or document_consumption_finished signal handlers that write to the database?

Is it possible to have a setting which toggles atomic transactions that default to True then have try_consume_file_atomic

if CONSUME_FILE_ATOMIC:
    result = self.try_consume_file_atomic(file):
else:
    result = self.try_consume_file(file):

@transaction.atomic
def try_consume_file_atomic(self, file):
   self.try_consume_file(file)

def try_consume_file(self, file):

joshwizzy avatar Jun 05 '19 05:06 joshwizzy

Would it help to just decorate the _store method with the @transaction.atomic decorator instead of the whole try_consume_file consumption method? Then I'd think it would not lock the database for the whole consumption but only for actually writing the consumed file to the database.

I'd expect that there would be only one instance of the consumer running in any case.

ddddavidmartin avatar Jun 05 '19 05:06 ddddavidmartin

Isn't the @transaction.atomic only necessary if you are writing to the database in the signal handlers so that the is a set of DB operations that may be rolled back? The _store method is only one database operation so doesn't need the @transaction.atomic

joshwizzy avatar Jun 05 '19 05:06 joshwizzy

Isn't the @transaction.atomic only necessary if you are writing to the database in the signal handlers so that the is a set of DB operations that may be rolled back?

My experience with databases and django is practically zero, so no idea. My suggestion was just based on the lock being possibly too broad and that a lot of time in the initial consumption in try_consume_file seems unrelated to the database.

ddddavidmartin avatar Jun 05 '19 05:06 ddddavidmartin

I have to preprocess my documents with OCR, because of this. Then the consumer doesn't run as long and the database lock is shorter.

LorenzBischof avatar Oct 24 '19 07:10 LorenzBischof

Thinking through this the third time now, there doesn't seem to be a good way to properly handle this while keeping the same assumption: that anything connected to the pre- or post-signals may change the database and if that pre- or post-signal fails, we should roll back everything.

I'm in favor of changing the behavior and move @transaction.atomic to _store() instead of try_consume_file(). There doesn't seem to be anything happening in the DB before the pre-signal. We have to make sure all the steps in _store() are atomic, as that's saving all the document's data. In the default pre- and post-signalhandlers, we shell out into another process. Since we run in a transaction, no other process should be able to change the same table we do (especially for sqlite DBs).

MasterofJOKers avatar Feb 28 '20 23:02 MasterofJOKers

Another nasty side-effect of this: while the consumer is doing its work you cannot log into the web interface.

languitar avatar Mar 07 '20 14:03 languitar

Is there anything which can be done regarding the immensely long runtime? I remember that in my "old" installation using mariadb/mysql, I never experienced the long runtime nor the database locking. I imagine that the database locking wouldn't be much of a problem if the runtime wasn't so long.

stueja avatar May 30 '20 15:05 stueja

@stueja It could be related to the new tesseract version that was activated (v3->v4) https://github.com/the-paperless-project/paperless/commit/3050ff159466e873e3542e898e76848d6aaae3e6#diff-3254677a7917c6c01f55212f86c57fbf It uses neural networks and I noticed a decrease in performance on my system. But generally OCR is expensive and takes a while. Not much you can do...

Please open a new issue if you are having performance issues. This issue is about the database lock being too broad and this causing usability issues, because the database is locked longer than need be.

LorenzBischof avatar May 30 '20 16:05 LorenzBischof