How to check BagIt archive with a large number of files (~20'000) with BagVerifier
What is the best configuration of ExecutorService, parameter of BagVerifier, to check a very big BagIt archive? I used the 5.0.3 version of the library.
By default, the 'isValid' function create a thread of each file. With ~20'000 files, the process crashed.
I tried the different option:
- ExecutorService exeService = new ThreadPoolExecutor(0, 10000, 60L, TimeUnit.SECONDS, new SynchronousQueue<Runnable>()); => crash
- ExecutorService exeService = Executors.newFixedThreadPool(3000); => very long
What is your advice? Thx
The validation speed is mostly determined by the IO throughput as the hashing is typically done by a specialized unit on your CPU and is therefore very low overhead.
Where is the bag located - spinning disk, SSD, NFS, Samba mount, etc.? All of those choices will dramatically affect the rate of the verification.
Also, how many threads can your CPU actively use? If it is like mine where it has 4 cores, having more than 4 threads isn't going to help very much as they will just be waiting anyway.
So, my suggestion would be these:
- use the fastest IO media you can (RAM disk would be fastest, followed by a good SSD)
- set to the number of cores
ExecutorService exeService = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
It may be a good idea to have the validator default to an fixed threadpool executor like described. We ran into the same issue for a bag with 10000+ files, in our case it resulted in a FileSystemException with message "Too many open files" from gov.loc.repository.bagit.verify.CheckIfFileExistsTask.existsNormalized(CheckIfFileExistsTask.java:59).
We ended up resolving with a similar solution, but I also had to avoid calling close on the BagVerifier instance (which is AutoCloseable) since it would shut down executor it was given which we were sharing across BagVerifiers.
It may be a good idea to have the validator default to an fixed threadpool executor like described. We ran into the same issue for a bag with 10000+ files, in our case it resulted in a FileSystemException with message "Too many open files" from
gov.loc.repository.bagit.verify.CheckIfFileExistsTask.existsNormalized(CheckIfFileExistsTask.java:59).We ended up resolving with a similar solution, but I also had to avoid calling
closeon theBagVerifierinstance (which is AutoCloseable) since it would shut down executor it was given which we were sharing across BagVerifiers.
Or you could use my fork which I actually maintain and has fixed these problems.