pigallery2
pigallery2 copied to clipboard
Clean up MetadataLoader.ts
Pigallery2 is using multiple exif, iptc parers (as I could not find one that can do all at once) to get all the metadata for a photo:
-
fs.statSync for silesize, creation date (if i do not find any better from exif) https://github.com/bpatrik/pigallery2/blob/36ecbcbd3d5d3e659ddf20add456982dcb960d3d/src/backend/model/threading/MetadataLoader.ts#L107
-
ts-exif-parser
for most of the exif data: https://github.com/bpatrik/pigallery2/blob/36ecbcbd3d5d3e659ddf20add456982dcb960d3d/src/backend/model/threading/MetadataLoader.ts#L114 -
image-size
to get the photo dimension if 2)ts-exif-parser
fails: https://github.com/bpatrik/pigallery2/blob/36ecbcbd3d5d3e659ddf20add456982dcb960d3d/src/backend/model/threading/MetadataLoader.ts#L169 (The image dimension is a must for this app to work properly ) -
ts-node-iptc
for GPS/location, keywords, caption related info: https://github.com/bpatrik/pigallery2/blob/36ecbcbd3d5d3e659ddf20add456982dcb960d3d/src/backend/model/threading/MetadataLoader.ts#L177 -
exifreader
for faces and now for orientation as this lib parses that properly https://github.com/bpatrik/pigallery2/blob/36ecbcbd3d5d3e659ddf20add456982dcb960d3d/src/backend/model/threading/MetadataLoader.ts#L206
This class needs a major refactor. Probably. exifreader
would be able to handle all cases now. Reducing to only one lib, photo indexing would be probably way faster (instead of 4-5 file open, 1 would do the trick)
cbhushan@ in https://github.com/bpatrik/pigallery2/issues/212#issuecomment-809673428 suggested https://mutiny.cz/exifr/ but that is missing Face region support: https://github.com/MikeKovarik/exifr/issues/59
Hi! If I can do anything to help ease a transition, please let me know. /Maintainer of ExifReader
Thank you, @mattiasw for your library. I think that is the best candidate for pigallry2 to have "one exif reader to rule them all".
So far I'm not planning to put effort in it on my side. I opened this bug if anyone feels the power to implement it :) We already have some unit tests in place for the Metadata loader , so a refactor should be relatively safe.
I "hacked up" a quick benchmark between 3 exif readers. (https://github.com/bpatrik/pigallery2/pull/294#discussion_r629451107 suggests exiftool is not as fast as we hoped)
Disclaimer: I did not double check if all three can proved all the needed metadata. But I thinks so, or at least their developers are quite responsive.
Libs to test:
- "exiftool-vendored": "14.3.0"
- "exifr": "7.0.0"
- "exifreader": "3.15.0"
Test was done on Wind 10, with SSHD (it is possoble that the HDD cached the photos after a while), number of photos: 698, dir size: 3,28GB. Each test was run 10 times and an average was calculated.
Result:
Exifreader: 4116.9349999999995 ms
exiftool: 34532.78095 ms
exifr: 990.25858 ms
Reproduce the test:
MetadaLoader.ts: loadPhotoMetadata(fullPath: string):
got modified to:
/****exiftool*******/
public static loadPhotoMetadata(fullPath: string): Promise<PhotoMetadata> {
return new Promise<PhotoMetadata>(async (resolve, reject) => {
const exif = await exiftool.read(fullPath, ['--ifd1:all']);
resolve({
size: {width: 1, height: 1},
orientation: OrientationTypes.TOP_LEFT,
creationDate: 0,
fileSize: 0
});
});
}
/****exifr*******/
public static loadPhotoMetadata(fullPath: string): Promise<PhotoMetadata> {
return new Promise<PhotoMetadata>(async (resolve, reject) => {
const exif = await exifr.parse(fullPath);
resolve({
size: {width: 1, height: 1},
orientation: OrientationTypes.TOP_LEFT,
creationDate: 0,
fileSize: 0
});
});
}
/****exifreader test*******/
public static loadPhotoMetadata(fullPath: string): Promise<PhotoMetadata> {
return new Promise<PhotoMetadata>(async (resolve, reject) => {
const fd = fs.openSync(fullPath, 'r');
const data = Buffer.allocUnsafe(Config.Server.photoMetadataSize);
fs.read(fd, data, 0, Config.Server.photoMetadataSize, 0, async (err) => {
fs.closeSync(fd);
if (err) {
return reject({file: fullPath, error: err});
}
const exif = ExifReader.load(data); // this is a synchronous call, no await it required
resolve({
size: {width: 1, height: 1},
orientation: OrientationTypes.TOP_LEFT,
creationDate: 0,
fileSize: 0
});
});
});
}
NOTE: for exifreader, not the whole file is read, but only part of it. The exif data is only at the beginning of the file, no need to read the actual photo. NOTE2: I could not find an api for exiftool to read only the beginning of the file NOTE3: It seems to be possbile to read only the beginning of the file with exifr, although the test were done by passing the file path.
Test sandbox:
pigallrery2/test.ts
import {DiskMangerWorker} from './src/backend/model/threading/DiskMangerWorker';
import {Config} from './src/common/config/private/Config';
import {ProjectPath} from './src/backend/ProjectPath';
const run = async () => {
Config.Server.Media.folder =<path to a folder with 698 photos>
ProjectPath.reset();
let time = 0;
const TIMES = 10;
for (let i = 0; i < TIMES; ++i) {
console.log('round', i);
const start = process.hrtime();
await DiskMangerWorker.scanDirectory('/');
const end = process.hrtime(start);
time += (end[0] * 1000 + end[1] / 1000000);
console.log('round time:', (end[0] * 1000 + end[1] / 1000000));
}
console.log(time / TIMES);
};
run().catch(console.error);
It looks like exiftool is out of the running.
I also think that benchmark might not be fair for exifr vs exifreader: it appears exifr is very fast by default because it only imports IFD0, EXIF, GPS tags, so no IPTC or XMP(?)
I tried a benchmark with loading all tags exifr.parse(file, true)
and it was of similar performance to exifreader.
I made an experimental
branch that build bpatrik/pigallery2:experimental-debian-buster
I made multiple experiments: https://github.com/bpatrik/pigallery2/blob/36d5ccbbd4c077d25c99f47c215740a46122cad4/src/backend/model/threading/MetadataLoader.ts#L93-L111
And run my benchmark.
Result:
PiGallery2 v1.8.7 (experimental), 13.05.2021
Version: v1.8.7, built at: Thu May 13 2021 10:31:47 GMT+0000 (Coordinated Universal Time), git commit:36d5ccbbd4c077d25c99f47c215740a46122cad4 System: Raspberry Pi 4 4G Model B, SandisK Mobile Ultra 32Gb CLass10, UHS-I, HDD: Western Digital Elements 1TB (WDBUZG0010BBK)
Gallery: directories: 31, photos: 2036, videos: 35, diskUsage : 22.08GB, persons : 1241, unique persons (faces): 14
Action | Average Duration | Result |
---|---|---|
[baseline] Scanning directory (current production method) |
10 383.2 ms | media: 698, directories: 0 |
[loadPhotoMetadata=exifr] Scanning directory (does not parse the lib output) |
2 502.6 ms | media: 698, directories: 0 |
[loadPhotoMetadata=exifrAll] Scanning directory (does not parse the lib output) |
2 454.3 ms | media: 698, directories: 0 |
[loadPhotoMetadata=exifrSelected] Scanning directory (does not parse the lib output) |
2 437.0 ms | media: 698, directories: 0 |
[loadPhotoMetadata=exifreader] Scanning directory (does not parse the lib output) |
9 595.9 ms | media: 698, directories: 0 |
[loadPhotoMetadata=exiftool] Scanning directory (copied from PR #294) |
76 023.4 ms | media: 698, directories: 0 |
*Measurements run 20 times and an average was calculated.
run for : 2316553.0ms
I do not see much difference between the ˛exifr
settings. Also for exifreader, I have to read a file, which I just do by feeling. Most likely I read more bytes than absolutely necessary, that is why it is slower (I guess). I could not find an exifreader API that could read the file itself.
Based on this I find exifr
the most appealing for this project.
exifr now support sidecars: https://github.com/MikeKovarik/exifr/issues/62
3. `image-size` to get the photo dimension
A comment from a veteran user of exiftool on the exiftool forums, indicate that metadata independent determination of image size is a good idea, even if you find data in the exif: https://exiftool.org/forum/index.php?topic=13310.msg71936#msg71936
So perhaps it would be fine to keep image-size
around after the cleanup.
3. `image-size` to get the photo dimension
A comment from a veteran user of exiftool on the exiftool forums, indicate that metadata independent determination of image size is a good idea, even if you find data in the exif: https://exiftool.org/forum/index.php?topic=13310.msg71936#msg71936
So perhaps it would be fine to keep
image-size
around after the cleanup.
keeping image-size sounds good to me.
Howdy! I'm the author of exiftool-vendored.
Windows has especially poor child process spawn performance: I've seen Windows take upwards of 10 seconds on a lightly loaded server to just allow the process to spin. I had to increase timeouts to 30 seconds on GHA! For reference, a slow Linux server can spin up a subprocess in 3-100ms.
As exiftool
is running in a separate process, it suffers from this OS performance hit, but exiftool
allows you to amortize that cost over many thousands of file reads if you use it's "-stay_open" mode (which my library leverages).
I'd update your benchmark to try reading from multiple files concurrently, rather than waiting for every read -- exiftool-vendored
will spawn a pool of exiftool
child processes (and manage their lifecycle, shutting them down when idle and spinning them up as your demand ebbs and flows).
One possibly surprising thing with my library: in an effort to avoid bug reports saying that exiftool
fork-bombed or IO-starved their box, I default the maxProcs
setting to be 1/4 the number of CPUs on the current box. In other words, it's not set up to win any benchmark with default settings. Changing maxProcs
to be require("node:os").cpus().length
should help: in my tests, read performance on Windows is ~<200ms/file, but of course this depends on CPU speed, filesystem performance, and arguments provided to the read command--it turns out that well-written perl
is quite fast: on my AMD 5950x, Ubuntu 22, and a gen-3 SSD, read()
throughput is 500+ files/second (which backs into ~60ms/read()
)
Edit: and just to clarify, I'm not asking you to switch EXIF libraries--I'm just wanting to highlight the issues with using exiftool
on fork-hostile OSes. If exifr
behaves as expected for you, then that may indeed be your best option. :beers: