bagit-java icon indicating copy to clipboard operation
bagit-java copied to clipboard

How to check BagIt archive with a large number of files (~20'000) with BagVerifier

Open UkDv opened this issue 5 years ago • 3 comments

What is the best configuration of ExecutorService, parameter of BagVerifier, to check a very big BagIt archive? I used the 5.0.3 version of the library.

By default, the 'isValid' function create a thread of each file. With ~20'000 files, the process crashed.

I tried the different option:

  • ExecutorService exeService = new ThreadPoolExecutor(0, 10000, 60L, TimeUnit.SECONDS, new SynchronousQueue<Runnable>()); => crash
  • ExecutorService exeService = Executors.newFixedThreadPool(3000); => very long

What is your advice? Thx

UkDv avatar Feb 25 '20 16:02 UkDv

The validation speed is mostly determined by the IO throughput as the hashing is typically done by a specialized unit on your CPU and is therefore very low overhead.

Where is the bag located - spinning disk, SSD, NFS, Samba mount, etc.? All of those choices will dramatically affect the rate of the verification.

Also, how many threads can your CPU actively use? If it is like mine where it has 4 cores, having more than 4 threads isn't going to help very much as they will just be waiting anyway.

So, my suggestion would be these:

  • use the fastest IO media you can (RAM disk would be fastest, followed by a good SSD)
  • set to the number of cores ExecutorService exeService = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());

jscancella avatar Feb 25 '20 16:02 jscancella

It may be a good idea to have the validator default to an fixed threadpool executor like described. We ran into the same issue for a bag with 10000+ files, in our case it resulted in a FileSystemException with message "Too many open files" from gov.loc.repository.bagit.verify.CheckIfFileExistsTask.existsNormalized(CheckIfFileExistsTask.java:59).

We ended up resolving with a similar solution, but I also had to avoid calling close on the BagVerifier instance (which is AutoCloseable) since it would shut down executor it was given which we were sharing across BagVerifiers.

bbpennel avatar Aug 19 '21 20:08 bbpennel

It may be a good idea to have the validator default to an fixed threadpool executor like described. We ran into the same issue for a bag with 10000+ files, in our case it resulted in a FileSystemException with message "Too many open files" from gov.loc.repository.bagit.verify.CheckIfFileExistsTask.existsNormalized(CheckIfFileExistsTask.java:59).

We ended up resolving with a similar solution, but I also had to avoid calling close on the BagVerifier instance (which is AutoCloseable) since it would shut down executor it was given which we were sharing across BagVerifiers.

Or you could use my fork which I actually maintain and has fixed these problems.

jscancella avatar Aug 20 '21 01:08 jscancella