caesium-image-compressor icon indicating copy to clipboard operation
caesium-image-compressor copied to clipboard

Feature Request: Remember/skip already optimized files

Open Saijin-Naib opened this issue 4 years ago • 3 comments

FileOptimizer does something similar by recording the path/name of the files it has successfully optimized, but I think a better/more robust approach would be hashing all files as they are added to Caesium and storing the hash. After compression, re-hash the file and write it out to a file (database? SQLite?) as another optimized hash.

If files are re-added in the future and their hash shows up as already optimized, set the status as optimized and skip them in processing. You'd have to do a lookup/search against the optimized hash table.

This would also require the ability to force/re-optimize files.

Saijin-Naib avatar Mar 15 '21 21:03 Saijin-Naib

This is an interesting suggestion and also the approach seems reasonable. It may slow down a bit the compression process because of the lookup, but may be significantly faster if (for example) you compress the same folders with some newer files each time.

Hashing can really be the way to go, taking into account the various compression parameters you select each time. This can totally be a new feature in the 2.0.0 release. Maybe not the very first, but I really like it.

Lymphatus avatar Mar 17 '21 15:03 Lymphatus

This is an interesting suggestion and also the approach seems reasonable. It may slow down a bit the compression process because of the lookup, but may be significantly faster if (for example) you compress the same folders with some newer files each time.

Hashing can really be the way to go, taking into account the various compression parameters you select each time. This can totally be a new feature in the 2.0.0 release. Maybe not the very first, but I really like it.

I'm glad I'm not completely off the mark here, and please take my recommendation as broad-strokes only. I can't program my way out of a paper bag, haha.

Yes, what you've outlined is the main use-case I can think of: re-optimizing something like your user's Picture library after merging in new photos from cameras, cell phones, etc, without having to re-process everything (200GB+ in my case), or picking the individual files (meep) to avoid re-processing.

I guess for the hashing that there'd have to be a balance between accuracy of the hash and how long it takes, right? From what I undersand any of the SHA256 hashes shouldn't have collisions, so they are the most robust, but they could take a bit to compute... Granted, if you're talking about images that are typically 30MB or less across multiple threads... Could it really take that much longer to hash during the initial scan/import?

As for the lookup, I have nothing. All I know is that SQLite should be reasonably fast with an index on the attribute being searched, but beyond that... I got nothing.

Saijin-Naib avatar Mar 17 '21 15:03 Saijin-Naib

Totally on the mark and I was actually thinking at your specific use case - which is mine too, I also like photography, even if I'm definitely better at programming.

That said, I think the strength of the hash algorithm here is not that important. Let's say I use MD5 because it's faster and has a 1 in 1,000,000 collision chance (which is way higher than the real one). The only downside we'll have is that we compress an image again, just wasting a bit of time. I'd go for a very fast algorithm rather than a secure one. We don't really need security here.

For the storage, I don't really know right now. SQLite is an option, but I'll look for alternatives too.

Lymphatus avatar Mar 17 '21 15:03 Lymphatus