pigallery2 icon indicating copy to clipboard operation
pigallery2 copied to clipboard

Clean up MetadataLoader.ts

Open bpatrik opened this issue 3 years ago • 12 comments

Pigallery2 is using multiple exif, iptc parers (as I could not find one that can do all at once) to get all the metadata for a photo:

  1. fs.statSync for silesize, creation date (if i do not find any better from exif) https://github.com/bpatrik/pigallery2/blob/36ecbcbd3d5d3e659ddf20add456982dcb960d3d/src/backend/model/threading/MetadataLoader.ts#L107

  2. ts-exif-parser for most of the exif data: https://github.com/bpatrik/pigallery2/blob/36ecbcbd3d5d3e659ddf20add456982dcb960d3d/src/backend/model/threading/MetadataLoader.ts#L114

  3. image-size to get the photo dimension if 2) ts-exif-parser fails: https://github.com/bpatrik/pigallery2/blob/36ecbcbd3d5d3e659ddf20add456982dcb960d3d/src/backend/model/threading/MetadataLoader.ts#L169 (The image dimension is a must for this app to work properly )

  4. ts-node-iptc for GPS/location, keywords, caption related info: https://github.com/bpatrik/pigallery2/blob/36ecbcbd3d5d3e659ddf20add456982dcb960d3d/src/backend/model/threading/MetadataLoader.ts#L177

  5. exifreader for faces and now for orientation as this lib parses that properly https://github.com/bpatrik/pigallery2/blob/36ecbcbd3d5d3e659ddf20add456982dcb960d3d/src/backend/model/threading/MetadataLoader.ts#L206

This class needs a major refactor. Probably. exifreader would be able to handle all cases now. Reducing to only one lib, photo indexing would be probably way faster (instead of 4-5 file open, 1 would do the trick)

bpatrik avatar Apr 26 '21 20:04 bpatrik

cbhushan@ in https://github.com/bpatrik/pigallery2/issues/212#issuecomment-809673428 suggested https://mutiny.cz/exifr/ but that is missing Face region support: https://github.com/MikeKovarik/exifr/issues/59

bpatrik avatar Apr 26 '21 20:04 bpatrik

Hi! If I can do anything to help ease a transition, please let me know. /Maintainer of ExifReader

mattiasw avatar Apr 30 '21 06:04 mattiasw

Thank you, @mattiasw for your library. I think that is the best candidate for pigallry2 to have "one exif reader to rule them all".

So far I'm not planning to put effort in it on my side. I opened this bug if anyone feels the power to implement it :) We already have some unit tests in place for the Metadata loader , so a refactor should be relatively safe.

bpatrik avatar Apr 30 '21 18:04 bpatrik

I "hacked up" a quick benchmark between 3 exif readers. (https://github.com/bpatrik/pigallery2/pull/294#discussion_r629451107 suggests exiftool is not as fast as we hoped)

Disclaimer: I did not double check if all three can proved all the needed metadata. But I thinks so, or at least their developers are quite responsive.

Libs to test:

  • "exiftool-vendored": "14.3.0"
  • "exifr": "7.0.0"
  • "exifreader": "3.15.0"

Test was done on Wind 10, with SSHD (it is possoble that the HDD cached the photos after a while), number of photos: 698, dir size: 3,28GB. Each test was run 10 times and an average was calculated.

Result:

Exifreader: 4116.9349999999995 ms
exiftool: 34532.78095 ms
exifr: 990.25858 ms

Reproduce the test:

MetadaLoader.ts: loadPhotoMetadata(fullPath: string): got modified to:

 /****exiftool*******/
public static loadPhotoMetadata(fullPath: string): Promise<PhotoMetadata> {
 return new Promise<PhotoMetadata>(async (resolve, reject) => {
      const exif = await exiftool.read(fullPath, ['--ifd1:all']);
      resolve({
        size: {width: 1, height: 1},
        orientation: OrientationTypes.TOP_LEFT,
        creationDate: 0,
        fileSize: 0
      });
    });
  }
 /****exifr*******/
public static loadPhotoMetadata(fullPath: string): Promise<PhotoMetadata> {
 return new Promise<PhotoMetadata>(async (resolve, reject) => {
      const exif = await exifr.parse(fullPath);
      resolve({
        size: {width: 1, height: 1},
        orientation: OrientationTypes.TOP_LEFT,
        creationDate: 0,
        fileSize: 0
      });
    });
  }
 /****exifreader test*******/
public static loadPhotoMetadata(fullPath: string): Promise<PhotoMetadata> {
return new Promise<PhotoMetadata>(async (resolve, reject) => {
      const fd = fs.openSync(fullPath, 'r');
      const data = Buffer.allocUnsafe(Config.Server.photoMetadataSize);
      fs.read(fd, data, 0, Config.Server.photoMetadataSize, 0, async (err) => {
        fs.closeSync(fd);
        if (err) {
          return reject({file: fullPath, error: err});
        }
        const exif = ExifReader.load(data); // this is a synchronous call, no await it required
        resolve({
          size: {width: 1, height: 1},
          orientation: OrientationTypes.TOP_LEFT,
          creationDate: 0,
          fileSize: 0
        });
      });
    });
  }

NOTE: for exifreader, not the whole file is read, but only part of it. The exif data is only at the beginning of the file, no need to read the actual photo. NOTE2: I could not find an api for exiftool to read only the beginning of the file NOTE3: It seems to be possbile to read only the beginning of the file with exifr, although the test were done by passing the file path.

Test sandbox: pigallrery2/test.ts

import {DiskMangerWorker} from './src/backend/model/threading/DiskMangerWorker';
import {Config} from './src/common/config/private/Config';
import {ProjectPath} from './src/backend/ProjectPath';

const run = async () => {
  Config.Server.Media.folder =<path to a folder with 698 photos>
  ProjectPath.reset();
  let time = 0;
  const TIMES = 10;
  for (let i = 0; i < TIMES; ++i) {
    console.log('round', i);
    const start = process.hrtime();
    await DiskMangerWorker.scanDirectory('/');
    const end = process.hrtime(start);
    time += (end[0] * 1000 + end[1] / 1000000);
    console.log('round time:', (end[0] * 1000 + end[1] / 1000000));
  }
  console.log(time / TIMES);
};
run().catch(console.error);

bpatrik avatar May 10 '21 16:05 bpatrik

It looks like exiftool is out of the running.

I also think that benchmark might not be fair for exifr vs exifreader: it appears exifr is very fast by default because it only imports IFD0, EXIF, GPS tags, so no IPTC or XMP(?)

I tried a benchmark with loading all tags exifr.parse(file, true) and it was of similar performance to exifreader.

mcdamo avatar May 12 '21 22:05 mcdamo

I made an experimental branch that build bpatrik/pigallery2:experimental-debian-buster

I made multiple experiments: https://github.com/bpatrik/pigallery2/blob/36d5ccbbd4c077d25c99f47c215740a46122cad4/src/backend/model/threading/MetadataLoader.ts#L93-L111

And run my benchmark.

Result:

PiGallery2 v1.8.7 (experimental), 13.05.2021

Version: v1.8.7, built at: Thu May 13 2021 10:31:47 GMT+0000 (Coordinated Universal Time), git commit:36d5ccbbd4c077d25c99f47c215740a46122cad4 System: Raspberry Pi 4 4G Model B, SandisK Mobile Ultra 32Gb CLass10, UHS-I, HDD: Western Digital Elements 1TB (WDBUZG0010BBK)

Gallery: directories: 31, photos: 2036, videos: 35, diskUsage : 22.08GB, persons : 1241, unique persons (faces): 14

Action Average Duration Result
[baseline]Scanning directory (current production method) 10 383.2 ms media: 698, directories: 0
[loadPhotoMetadata=exifr]Scanning directory (does not parse the lib output) 2 502.6 ms media: 698, directories: 0
[loadPhotoMetadata=exifrAll]Scanning directory (does not parse the lib output) 2 454.3 ms media: 698, directories: 0
[loadPhotoMetadata=exifrSelected]Scanning directory (does not parse the lib output) 2 437.0 ms media: 698, directories: 0
[loadPhotoMetadata=exifreader]Scanning directory (does not parse the lib output) 9 595.9 ms media: 698, directories: 0
[loadPhotoMetadata=exiftool]Scanning directory (copied from PR #294) 76 023.4 ms media: 698, directories: 0

*Measurements run 20 times and an average was calculated.

run for : 2316553.0ms


I do not see much difference between the ˛exifr settings. Also for exifreader, I have to read a file, which I just do by feeling. Most likely I read more bytes than absolutely necessary, that is why it is slower (I guess). I could not find an exifreader API that could read the file itself.

Based on this I find exifr the most appealing for this project.

bpatrik avatar May 13 '21 12:05 bpatrik

exifr now support sidecars: https://github.com/MikeKovarik/exifr/issues/62

bpatrik avatar Jul 07 '21 20:07 bpatrik

3. `image-size` to get the photo dimension       

A comment from a veteran user of exiftool on the exiftool forums, indicate that metadata independent determination of image size is a good idea, even if you find data in the exif: https://exiftool.org/forum/index.php?topic=13310.msg71936#msg71936

So perhaps it would be fine to keep image-size around after the cleanup.

grasdk avatar Feb 09 '24 21:02 grasdk

3. `image-size` to get the photo dimension       

A comment from a veteran user of exiftool on the exiftool forums, indicate that metadata independent determination of image size is a good idea, even if you find data in the exif: https://exiftool.org/forum/index.php?topic=13310.msg71936#msg71936

So perhaps it would be fine to keep image-size around after the cleanup.

keeping image-size sounds good to me.

bpatrik avatar Feb 20 '24 22:02 bpatrik

Howdy! I'm the author of exiftool-vendored.

Windows has especially poor child process spawn performance: I've seen Windows take upwards of 10 seconds on a lightly loaded server to just allow the process to spin. I had to increase timeouts to 30 seconds on GHA! For reference, a slow Linux server can spin up a subprocess in 3-100ms.

As exiftool is running in a separate process, it suffers from this OS performance hit, but exiftool allows you to amortize that cost over many thousands of file reads if you use it's "-stay_open" mode (which my library leverages).

I'd update your benchmark to try reading from multiple files concurrently, rather than waiting for every read -- exiftool-vendored will spawn a pool of exiftool child processes (and manage their lifecycle, shutting them down when idle and spinning them up as your demand ebbs and flows).

One possibly surprising thing with my library: in an effort to avoid bug reports saying that exiftool fork-bombed or IO-starved their box, I default the maxProcs setting to be 1/4 the number of CPUs on the current box. In other words, it's not set up to win any benchmark with default settings. Changing maxProcs to be require("node:os").cpus().length should help: in my tests, read performance on Windows is ~<200ms/file, but of course this depends on CPU speed, filesystem performance, and arguments provided to the read command--it turns out that well-written perl is quite fast: on my AMD 5950x, Ubuntu 22, and a gen-3 SSD, read() throughput is 500+ files/second (which backs into ~60ms/read())

Edit: and just to clarify, I'm not asking you to switch EXIF libraries--I'm just wanting to highlight the issues with using exiftool on fork-hostile OSes. If exifr behaves as expected for you, then that may indeed be your best option. :beers:

mceachen avatar Mar 25 '24 19:03 mceachen