Stirling-PDF icon indicating copy to clipboard operation
Stirling-PDF copied to clipboard

Copying multiple files to watchedFolder causes app to grab zero byte files

Open seakrebel opened this issue 1 year ago • 6 comments

I copied around 200 PDFs into the watchedFolder, and realized there were more than 350 PDFs in the processing folder which I found weird. Then I saw many of the PDFs are "duplicated" and some of them have "zero bytes" size.

As I suspected the app was starting the process before the files were completely copied over. I confirmed this by copying only 20 PDFs in the watchedFolder - same behavior.

Wish there was a way to tell the app to wait a bit before processing the file. Similar to the variable PAPERLESS_CONSUMER_INOTIFY_DELAY in paperless-ngx.

The only workaround I found so far, is to stop the container, copy over the files, and then start the container again.

seakrebel avatar May 15 '24 23:05 seakrebel

A good callout and bug I will work on this over weekend

Frooodle avatar May 16 '24 05:05 Frooodle

Hi @Frooodle, i think this is a good issue that can assign someone like me who wish to contribute to open source 😃

My initial idea of solving this issue is to update collectFilesForProcessing to ensure we only collects files that are fully copied, either by checking if the size is growing, or using some os level features (e.g. lsof)

let me if there is comment for the solution 😆

kkdlau avatar May 18 '24 13:05 kkdlau

@kkdlau hows this going?

Frooodle avatar May 23 '24 18:05 Frooodle

Here is an example. Haven't tested it. Also not an java expert. But could be leading into right direction.

PipelineDirectoryProcessor.java

// [...]
import java.util.concurrent.TimeUnit;
// [...]
public class PipelineDirectoryProcessor {
    // [...]

    private static final long STABILITY_CHECK_DELAY = 1000; // 1 second
    private static final long STABILITY_CHECK_COUNT = 5; // Check 5 times

    private File[] collectFilesForProcessing(Path dir, Path jsonFile, PipelineOperation operation) throws IOException {
        try (Stream<Path> paths = Files.list(dir)) {
            if ("automated".equals(operation.getParameters().get("fileInput"))) {
                return paths.filter(path -> !Files.isDirectory(path) && !path.equals(jsonFile) && isFileStable(path))
                            .map(Path::toFile)
                            .toArray(File[]::new);
            } else {
                String fileInput = (String) operation.getParameters().get("fileInput");
                return new File[] { new File(fileInput) };
            }
        }
    }

    private boolean isFileStable(Path path) throws IOException {
        long initialSize = Files.size(path);
        for (int i = 0; i < STABILITY_CHECK_COUNT; i++) {
            try {
                TimeUnit.MILLISECONDS.sleep(STABILITY_CHECK_DELAY);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                throw new IOException("Thread interrupted during stability check", e);
            }
            long newSize = Files.size(path);
            if (initialSize != newSize) {
                return false;
            }
        }
        return initialSize > 0; // Also ensuring the file is not zero bytes
    }
    // [...]
}
// [...]

seakrebel avatar May 23 '24 18:05 seakrebel

@kkdlau hows this going?

Hi, was busy with my full-time work 😞 But I already have the draft of the PR Just need to go through couple of regression testing to ensure it doesn't break the existing features

Will create a PR tnt (APAC time)👍🏻

kkdlau avatar May 24 '24 01:05 kkdlau

Here is an example. Haven't tested it. Also not an java expert. But could be leading into right direction.

PipelineDirectoryProcessor.java


// [...]

import java.util.concurrent.TimeUnit;

// [...]

public class PipelineDirectoryProcessor {

    // [...]



    private static final long STABILITY_CHECK_DELAY = 1000; // 1 second

    private static final long STABILITY_CHECK_COUNT = 5; // Check 5 times



    private File[] collectFilesForProcessing(Path dir, Path jsonFile, PipelineOperation operation) throws IOException {

        try (Stream<Path> paths = Files.list(dir)) {

            if ("automated".equals(operation.getParameters().get("fileInput"))) {

                return paths.filter(path -> !Files.isDirectory(path) && !path.equals(jsonFile) && isFileStable(path))

                            .map(Path::toFile)

                            .toArray(File[]::new);

            } else {

                String fileInput = (String) operation.getParameters().get("fileInput");

                return new File[] { new File(fileInput) };

            }

        }

    }



    private boolean isFileStable(Path path) throws IOException {

        long initialSize = Files.size(path);

        for (int i = 0; i < STABILITY_CHECK_COUNT; i++) {

            try {

                TimeUnit.MILLISECONDS.sleep(STABILITY_CHECK_DELAY);

            } catch (InterruptedException e) {

                Thread.currentThread().interrupt();

                throw new IOException("Thread interrupted during stability check", e);

            }

            long newSize = Files.size(path);

            if (initialSize != newSize) {

                return false;

            }

        }

        return initialSize > 0; // Also ensuring the file is not zero bytes

    }

    // [...]

}

// [...]

Thanks for the idea 👍🏻 My draft is quite similar except for isFileStable implementation will share more details when I open the PR

kkdlau avatar May 24 '24 02:05 kkdlau