Copying multiple files to watchedFolder causes app to grab zero byte files
I copied around 200 PDFs into the watchedFolder, and realized there were more than 350 PDFs in the processing folder which I found weird. Then I saw many of the PDFs are "duplicated" and some of them have "zero bytes" size.
As I suspected the app was starting the process before the files were completely copied over. I confirmed this by copying only 20 PDFs in the watchedFolder - same behavior.
Wish there was a way to tell the app to wait a bit before processing the file. Similar to the variable PAPERLESS_CONSUMER_INOTIFY_DELAY in paperless-ngx.
The only workaround I found so far, is to stop the container, copy over the files, and then start the container again.
A good callout and bug I will work on this over weekend
Hi @Frooodle, i think this is a good issue that can assign someone like me who wish to contribute to open source 😃
My initial idea of solving this issue is to update collectFilesForProcessing to ensure we only collects files that are fully copied, either by checking if the size is growing, or using some os level features (e.g. lsof)
let me if there is comment for the solution 😆
@kkdlau hows this going?
Here is an example. Haven't tested it. Also not an java expert. But could be leading into right direction.
PipelineDirectoryProcessor.java
// [...]
import java.util.concurrent.TimeUnit;
// [...]
public class PipelineDirectoryProcessor {
// [...]
private static final long STABILITY_CHECK_DELAY = 1000; // 1 second
private static final long STABILITY_CHECK_COUNT = 5; // Check 5 times
private File[] collectFilesForProcessing(Path dir, Path jsonFile, PipelineOperation operation) throws IOException {
try (Stream<Path> paths = Files.list(dir)) {
if ("automated".equals(operation.getParameters().get("fileInput"))) {
return paths.filter(path -> !Files.isDirectory(path) && !path.equals(jsonFile) && isFileStable(path))
.map(Path::toFile)
.toArray(File[]::new);
} else {
String fileInput = (String) operation.getParameters().get("fileInput");
return new File[] { new File(fileInput) };
}
}
}
private boolean isFileStable(Path path) throws IOException {
long initialSize = Files.size(path);
for (int i = 0; i < STABILITY_CHECK_COUNT; i++) {
try {
TimeUnit.MILLISECONDS.sleep(STABILITY_CHECK_DELAY);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new IOException("Thread interrupted during stability check", e);
}
long newSize = Files.size(path);
if (initialSize != newSize) {
return false;
}
}
return initialSize > 0; // Also ensuring the file is not zero bytes
}
// [...]
}
// [...]
@kkdlau hows this going?
Hi, was busy with my full-time work 😞 But I already have the draft of the PR Just need to go through couple of regression testing to ensure it doesn't break the existing features
Will create a PR tnt (APAC time)👍🏻
Here is an example. Haven't tested it. Also not an java expert. But could be leading into right direction.
PipelineDirectoryProcessor.java
// [...] import java.util.concurrent.TimeUnit; // [...] public class PipelineDirectoryProcessor { // [...] private static final long STABILITY_CHECK_DELAY = 1000; // 1 second private static final long STABILITY_CHECK_COUNT = 5; // Check 5 times private File[] collectFilesForProcessing(Path dir, Path jsonFile, PipelineOperation operation) throws IOException { try (Stream<Path> paths = Files.list(dir)) { if ("automated".equals(operation.getParameters().get("fileInput"))) { return paths.filter(path -> !Files.isDirectory(path) && !path.equals(jsonFile) && isFileStable(path)) .map(Path::toFile) .toArray(File[]::new); } else { String fileInput = (String) operation.getParameters().get("fileInput"); return new File[] { new File(fileInput) }; } } } private boolean isFileStable(Path path) throws IOException { long initialSize = Files.size(path); for (int i = 0; i < STABILITY_CHECK_COUNT; i++) { try { TimeUnit.MILLISECONDS.sleep(STABILITY_CHECK_DELAY); } catch (InterruptedException e) { Thread.currentThread().interrupt(); throw new IOException("Thread interrupted during stability check", e); } long newSize = Files.size(path); if (initialSize != newSize) { return false; } } return initialSize > 0; // Also ensuring the file is not zero bytes } // [...] } // [...]
Thanks for the idea 👍🏻
My draft is quite similar except for isFileStable implementation
will share more details when I open the PR