pyexiftool icon indicating copy to clipboard operation
pyexiftool copied to clipboard

Use multiprocessing for further speedups

Open fbuchinger opened this issue 8 years ago • 3 comments

Currently pyexiftool leverages the stay_open feature of exiftool to get a speed up compared to invoking one exiftool instance for each image. What if we took the ball further and used the python multiprocessing module to invoke multiple exiftool stay_open instances and split the load between them? Here is a script that compares the "internal" batch mode of exiftool, the "external" batchmode using pyexiftool and the "multiprocessed external" batchmode using multiprocessing:

import os
import timeit
import time
import multiprocessing

import exiftool



metadatadir = r"E:\dev\exiftool\test\sampleImages\Canon" #adapt to your path
result_queue = multiprocessing.Queue(1)

def exiftool_internal_batch():
    """internal batchmode of exiftool - ET itself finds out which files to read"""
    md = os.popen(r"exiftool.exe -j %s\\*.jpg" % metadatadir).read()
    return md

def get_filelist():
    files = []
    for sample in os.listdir(metadatadir):
        files.append(os.path.join(metadatadir, sample))
    return files

def chunk_list(l, n):
    """Yield successive n-sized chunks from l."""
    result = []
    for i in xrange(0, len(l), n):
        result.append(l[i:i+n])
    return result


def exiftool_stay_open(files, mp = None):
    """external batchmode of exiftool - ET is invoked by pyexiftool using a stay_open with a filelist
        filelist - list of files to read
        mp - make function multiprocessing aware yes/no (results will be stuffed into a queue)
    """
    with exiftool.ExifTool() as et:
        metadata = et.get_metadata_batch(files)
    if mp is not None:
        result_queue.put(metadata)
    return metadata

if __name__ == '__main__':

    internal_batch_start = time.clock()
    exiftool_internal_batch()
    internal_batch_end = time.clock()
    print ("Exiftool internal batch took %s" % (internal_batch_end - internal_batch_start))

    external_batch_start = time.clock()
    files = get_filelist()
    exiftool_stay_open(files)
    external_batch_end = time.clock()
    print ("Exiftool Stay Open/External batch took %s" % (external_batch_end - external_batch_start))

    """invoke three exiftool stay_open instances at once to split the load between them"""

    mbs_start = time.clock()
    jobs = []
    files = get_filelist()
    fl_chunks = chunk_list(files, 3)

    et1 = multiprocessing.Process(target=exiftool_stay_open, args=(fl_chunks[0],True))
    et2 = multiprocessing.Process(target=exiftool_stay_open, args=(fl_chunks[1],True))
    et3 = multiprocessing.Process(target=exiftool_stay_open, args=(fl_chunks[2],True))

    jobs.append(et1)
    jobs.append(et2)
    jobs.append(et3)

    et1.start()
    et2.start()
    et3.start()

    et1.join()
    et2.join()
    et3.join()


    mbs_end = time.clock()
    print ("Exiftool multiprocessing batch took %s" % (mbs_end - mbs_start))

When I run this script on the Canon sample images of Phil Harvey's metadata repository, I get these results on my machine:

Exiftool internal batch took 14.2978597714 sec Exiftool Stay Open/External batch took 12.7552563562 sec Exiftool multiprocessing batch took 0.714086082069 sec

I have to admit 0,71 for multiprocessing sounds to good to be true, so there might be some error in the script. nevertheless, a previous version with two workers took 7,77 seconds, so there is quite a performance gain possible.

Do you plan to add multiprocessing functionality to pyexiftool?

fbuchinger avatar Sep 13 '15 09:09 fbuchinger

@fbuchinger Thanks for your suggestion. Multiprocessing is only useful for CPU-bound computations. Reading EXIF data is almost certainly I/O-bound.

You have already noticed that a speed up of 18 is unlikely to result from using three CPUs instead of one, and indeed using three CPUs isn't the reason it's faster: Your chunk_list() function is documented correctly as returning chunks of length n, so fl_chunks contains chunks of length 3. The multiprocessing version only operates on the first 9 files, so it's hardly a surprise it's faster. (No idea how many files your test directory contains.)

A proper benchmark would also have to take caching into account. When you have already read the metadata of the same files before, chances are that the contents of the directory is already cached by the operating system. In this case, reading the files will be very fast, since no disk access is required, so you are more likely to be able to make use of more than one CPU. In real applications, it's more likely that the image files haven't been pulled into the cache yet. So for a proper benchmark, you would have to make sure that your test files haven't been cached yet.

smarnach avatar Sep 13 '15 17:09 smarnach

@smarnach: thanks for your hints! I've updated the multiprocessing part of my script and I am now using multiprocessing.Pool and its map method to do the parallelisation. Unfortunately the multiprocessing version now takes even slightly longer than the single-executable version:

Exiftool internal batch took 10.4846533613 sec
Exiftool Stay Open/External batch took 11.6878449432 sec
Exiftool multiprocessing batch took 11.5929283865 sec

I've created a gist of my script https://gist.github.com/fbuchinger/7b8d68e967e8a0600b79. Just out of interest: does the multiprocessing part look ok or is there another mistake in it? I've monitored the Windows task manager and just found one copy of exiftool actively working (i.e. increasing/decreasing its amount of consumed memory)

fbuchinger avatar Sep 13 '15 18:09 fbuchinger

@fbuchinger I had a quick look. Since you use / 3 as the chunk size, this might leave a few single items at the end of the list due to rounding that require another exiftool run, but the effect of that is likely to be minor.

I don't know much about Windows. Generally I'd look at CPU usage to decide whether a process is doing anything, but since what exiftool does is mostly I/O, this might be misleading in this particular case.

smarnach avatar Sep 13 '15 18:09 smarnach