docspell icon indicating copy to clipboard operation
docspell copied to clipboard

`docspell-consumedir` (`dsc`) processes files before scanning is complete

Open MateusMolina opened this issue 2 months ago • 4 comments

When scanning documents directly into a folder watched by the docspell-consumedir container, my scanner creates the file immediately and continues writing to it as each new page is scanned. This causes dsc to start processing the file before the scan has finished, resulting in incomplete or corrupted imports.

I can mitigate this somewhat by setting the dsc watch --delay option to a large value, but that introduces a trade-off:

  • A longer delay allows multi-page scans to finish before processing.
  • A shorter delay enables faster curation in Docspell but risks partial file processing.

At the moment, I’ve set a delay of around five minutes and avoid delete or move operations after processing to prevent data loss. However, this is not an ideal or reliable long-term solution.

Question / Request

Is there a better way to ensure that docspell-consumedir only processes files after they are fully written by the scanner?

I wish I could configure my scanner to have an extension suffix (.tmp) for incomplete files, but that doesn't seem to be possible.

MateusMolina avatar Nov 07 '25 19:11 MateusMolina

Hi @MateusMolina , this is unfortunately not so easy (to me anyways). It is not possible to tell from outside whether a file is "done" or not. It would also involve guessing a delay after which to hope no changes take place. One approach used often is, as you already suggested, to write to a temp file and do a move after its done. the move is atomic on most filesystems, so it would always observe a complete file.

I can't think of a good solution right now. Luckily for me, my scanner seems not be affected. I could imagine having two scan profiles and two dsc processes configured with different delays. For me that would be fine, because I rarely scan documents with many pages. It is not ideal, of course, but might serve as some middle ground workaround for the time being.

eikek avatar Nov 11 '25 10:11 eikek

Thanks for the quick reply!

One solution would be to periodically check whether the file size is changing: if it is, don't process and wait for another cycle; if it is not, then process it. This func. could be set as an additional parameter in dsc, e.g. --wait-for-idle where we pass the checking interval.

This wouldn't replace the --delay option, but work alongside it. We should always wait for the delay time and it could be the case that we wait a bit more because of multiple --wait-for-idle cycles.

At least, this is the way my workaround script is working: watching a folder and then moving to the folder being watched by dsc when the file becomes "idle." But, in my opinion, it would be nice if the cli could natively support this use case.

My printer/scanner is the Canon GX2050, which seems to be a common choice for these types of workflows.

I could take a shot at the impl. if you think the idea fits the project.

MateusMolina avatar Nov 12 '25 14:11 MateusMolina

Hi @MateusMolina, that sounds good to me. Just to see whether I got it: you would wait --delay and after that start the --wait-for-idle cycles, right? I think that is pretty nice. If you have time and energy to take a shot at the impl, that would be great!

eikek avatar Nov 12 '25 19:11 eikek

Yes, exactly :)

I'll see if I can start a draft PR in the next couple weeks.

BTW: Amazing project, seriously well done!

MateusMolina avatar Nov 13 '25 07:11 MateusMolina