exiftool.js icon indicating copy to clipboard operation
exiftool.js copied to clipboard

Better parallelisation of exiftool for faster report generation

Open fbuchinger opened this issue 8 years ago • 1 comments

In https://github.com/mattburns/exiftool.js-test/blob/master/test.js#L66 you invoke a new instance of exiftool for every new image found. This is not terribly efficient, since there is a huge overhead in starting exiftool (perl interpreter warmup, load modules,...) and we are doing this for every sample image we find.

Better approaches would be a) invoke exiftool once and let it do the batch processing (e.g. exiftool <OTHER OPTIONS> *.jpg -w .jpg.json) - might require some refactoring in the report generation b) use the -stay_open option of exiftool together with an ARGFILE where we write the commands to run on each image. Here exiftool stays in memory and executes the commands written to the ARGFILE until we write a terminate command there.

Both approaches can bring speedups of up to 60 times compared to single-command invocation. Actually approach b) could even bring a better performance, since we can prefork multiple "daemonized" instances of exiftool and share the work between them.

fbuchinger avatar Sep 04 '15 07:09 fbuchinger

Evaluated the performance of the different exiftool invocation options using pyexiftool, since it had already builtin support for exiftool's faster stay_open invocation. I compared the following scenarios:

  • invoking one exiftool instance per image
  • exiftool's internal batch execution
  • "external" batch execution using stay_open mode
  • "external" batch execution with preforking multiple exiftool instances (multiprocessing.Pool in Python)

My results for the 20 sample images from the Acer directory:

Exiftool no batch took 6.37464756469 sec 
Exiftool internal batch took 0.590772722123 sec
Exiftool Stay Open/External batch took 0.575033621959 sec
Exiftool multiprocessing batch took 0.64755278114 sec

For the more complex sample images (more tags to decode) from the Nikon directory:

Exiftool no batch took 80.8621684399
Exiftool internal batch took 3.93503120808
Exiftool Stay Open/External batch took 4.23961249768
Exiftool multiprocessing batch took 4.3239334162

It turns out that using one of the batch modes can bring a 10-20 times speedup, while the multiprocessing is actually a bit slower (maybe since exiftool is mostly IO-bound). Note that this numbers might vary for node.js, since it is asynchronous per default, while python is synchronous.

Conclusion: it definitely makes sense to use the exiftool stay_open mode in the node.js test scripts instead of firing up one instance per image.

See my python test script and the this pyexiftool issue for more context.

fbuchinger avatar Sep 14 '15 11:09 fbuchinger