Cache image dimensions on upload

Open distantnative opened this issue 2 years ago • 25 comments

Kirby Team summary

Status quo

We don't know image dimensions (current as well as result) when we create job files for thumbs, since we do not read the image file at that point currently. Only when processing the job file.

This can lead to problems, such as wrong thumb names (incl. wrong image dimensions), when the job file expects a larger image dimension as result than the original file is. And since we don't upscale, we end up with a thumb that is labeled larger than it actually is.

We need to know the image dimensions beforehand.

Exploration

We tested different ways of caching the image dimensions (e.g. in the content file) and discussed other storage locations. None really felt good so far.

Todo

[ ] Measure performance impact if we would read the image dimensions directly from the image file each time we need them

Original post

If we resize an image, we cannot know the end results dimensions - as we don't read the image file, but one create a job file. This can lead to wrongly labeled images (name says resized version is larger than actual result - due to not upscaling). See https://github.com/getkirby/kirby/issues/2071

One first step to fix this could be to cache/store the image dimensions in the content file.

Jul 19 '21 15:07 distantnative

It makes the system more complex though. If the image is replaced, we also need to update that field in the content file. And what happens if the image is modified manually in the file system?

Jul 19 '21 19:07 lukasbestle

@lukasbestle What's the alternative? Reading the file directly would be a massive performance hit

Jul 19 '21 19:07 distantnative

Yep, absolutely. I don't have a better idea, but I'm not 100 % happy with it either.

Jul 19 '21 20:07 lukasbestle

What about hashing each file to identify? Sure we need still get matched each request or with job.

Jul 19 '21 20:07 afbora

While caching the dimensions would make the system more complex it would also heavily affect the panel performance in a positive way.

Jul 20 '21 06:07 bastianallgeier

What if we cache the dimensions in a simple SQLite DB in the cache folder? We could use the same DB as a cache for the "immutable IDs for pages" feature later. If the file is deleted, that's no big deal as all information can be regathered from the original files. This would help keeping the content dir clean of such cached data.

Jul 20 '21 19:07 lukasbestle

Based on experience from Retour and the search plugin, there are quite a few servers out there that don't ship with a proper SQLite setup by default. Could cause us some troubles

Jul 20 '21 20:07 distantnative

Why can't we cache it in the meta file?

Jul 22 '21 13:07 bastianallgeier

For manual uploads, we could cache them the first time they are tried to be read and not in the meta file. That would only be a one-time performance hit for a single visitor, I think that would be fine.

Jul 22 '21 13:07 distantnative

Yes, exactly. Then we only need to update the dimensions on replacement. When you handle files manually, you can easily remove the cached width and height from the meta file when you replace a file to refresh the cache

Jul 22 '21 13:07 bastianallgeier

I agree with you on the technical side, I just have a gut feeling that this is not as clean as it could be.

So far the meta file is still mostly used for custom content fields. The only ones Kirby uses by default are template and sort. Both of these cannot be stored elsewhere and they are actual meta information that belongs to the file. If we on the other hand now start writing simple caches to the meta file ("simple cache" in the sense that the information duplicates information that is already in the stored file itself), the meta file becomes more of a mess.

Or in more practical terms: Let's say you version control your content with Git or similar. With every change to an image file the meta file changes as well in the repo, just because of this cache. But generally you wouldn't want to have caches checked in to version control.

It sure is a tiny problem, but it makes me feel we are misusing content files for internal data that doesn't really belong there.

Jul 22 '21 19:07 lukasbestle

@lukasbestle but is there an alternative if reading the original file is so performance heavy?

Jul 22 '21 21:07 distantnative

Have you thought about adding this information to the filename by appending something like @3000x2000?

Jul 23 '21 04:07 nilshoerrmann

I think we should store the information in the cache dir. This also allows to easily clear all cached dimensions if there is an issue somewhere.

If SQLite isn't an option, we need to look for alternatives. But given that we need some kind of database for the unique page ID feature as well, I wonder if we have much choice.

Jul 23 '21 10:07 lukasbestle

@lukasbestle isn't creating a whole new place for file information increasing the complexity even more? For now you'd have the actual file and mega to be considered at file operations. Which for many operations we already successfully do. With your suggestion we'd have a whole other part to be considered and not overlooked in many operations.

Jul 23 '21 10:07 distantnative

When you handle files manually, you can easily remove the cached width and height from the meta file when you replace a file to refresh the cache

If we would also write a modified timestamp to the meta file, we could check this each time we read/process the actual file anyways and correct the dimensions if the timestamp between meta and the actual file has changed.

Jul 23 '21 10:07 distantnative

Maybe consider evaluating the use of a library that scans the image sizes from their headers (without opening the whole image). Something like https://github.com/marc1706/fast-image-size

Might be faster than opening and parsing the meta data file.

Jul 26 '21 16:07 rasteiner

@distantnative I think we are talking past each other. What I meant is that storing internal caching data in content files is not Kirby-like. Kirby content files are fully in control of the dev. It really doesn't feel right to me to suddenly store all kinds of "machine data" in them. A timestamp is even worse as it's not even really human-readable. If I were a Kirby user who edits content files manually, I'd be really confused what all these new caching fields are about.

We also need to consider that content files are often managed in Git repos. With transient data you can easily run into merge conflicts. Which is really unnecessary if the conflicting data is just a boring cache.

What I'm trying to say is: Caches belong in site/caches, nowhere else. That's what this directory is for and the workflows of our advanced users depend on this property. We shouldn't store caches in other places and we shouldn't store non-caches in site/cache. Like you also wouldn't want to store sessions in site/config all of a sudden (OK, theoretical example, but you get the idea).

I really like @rasteiner's idea. I haven't looked at this particular library, but a solution that doesn't need caching in the first place is so much better than anything else.

Jul 26 '21 20:07 lukasbestle

I ran some first tests with https://github.com/marc1706/fast-image-size with JPGs:

getimagesize
24 kb: 7.5548410415649E-6 ms
3.4 mb: 7.9078197479248E-6 ms

FastImageSize
24 kb: 1.9664580821991E-5 ms
3.4 mb: 9.0913259983063E-5 ms

This doesn't look good for the library. @lukasbestle you'll know best if I have a logic error in the test:

protected function readSize(string $filename, $method, int $repetitions = 100000): float
{
    $before = microtime(true);
    for ($i=0 ; $i<$repetitions ; $i++) {
        $size = call_user_func($method, __DIR__ . '/fixtures/image/'. $filename);
    }
    $after = microtime(true);
    return ($after-$before)/$i;
}


public function testPerformance() {
    // getimagesize
    echo "\n\ngetimagesize\n";
    echo "24 kb: " . $this->readSize('cat.jpg', 'getimagesize') . " ms\n";
    echo "3.4 mb: " . $this->readSize('test.jpeg', 'getimagesize') . " ms";

    // library
    echo "\n\nFastImageSize\n";
    $reader = new FastImageSize();

    echo "24 kb: " . $this->readSize('cat.jpg', [$reader, 'getImageSize']) . " ms\n";
    echo "3.4 mb: " . $this->readSize('test.jpeg', [$reader, 'getImageSize']) . " ms";
}

Aug 05 '21 19:08 distantnative

Ok, skip the library (even if loading the same image 100000 times probably puts it into some kind of cache (at disk, OS or PHP level) which is much faster than an actual disk - nullifying the advantage of reading only a few bytes [or kB in the worst case of JPEGs] instead of the whole file.) Or maybe getImageSize has become smarter in the meantime (the lib is 6 years old). Or you have a disk that really reads at 3.4MB / 7.9E-9s~=430TBps - in which case, congrats for your setup :)

Otherwise, if we would be able to really read an image size in 8*10^-6 ms, we could get the size of 125'000 images in 1ms; thus not being much of a real performance problem. But it's probable that the real performance hit comes from the syscalls (check if a file exist, open the file, read the file, read the file, read the file, read the file until we get that damned jpeg header, close the file), probably each of them creating an interrupt request. Therefore we don't really have a reason to believe that parsing a meta data file with Kirbydata and SPYC would be much faster (still need to load a file when currently we don't). Much faster could be using a cache driver that reads the size from memory (memcached, apcu, etc), when available and configured. If none is available, using a file in site/cache could still be faster than loading and parsing metafiles, because it does no longer need to be a "human readable" format like the current kirbydata & yaml. It could simply be a 0bytes file with the size written in the filename (no need to open it) or it could be a file that groups several image sizes together (like all images of a page) to reduce syscalls, and store it in a format that is faster to parse (don't know... PHP serialize maybe?)

Aug 06 '21 11:08 rasteiner

@rasteiner and you perfectly pointed out spots where my testing could be flawed 😄🤷‍♂️

Aug 06 '21 11:08 distantnative

I think @rasteiner has a point here. If it's really that fast to get the image size, caching the size in a text file on disk is likely even making it slower. Considering the huge complexity issue we discussed previously, it makes me wonder if it's worth it at all to cache the dimensions.

Regarding the test setup: There's definitely the file system cache involved here and likely also PHP caches the file data. Could also well be that PHP caches the result of getimagesize(), which could be another explanation why it is much faster than the library. The only way to test this is to create a directory with literally thousands of different image files and get all of their sizes in a loop. Ideally the two comparisons would be run independently (not in the same PHP process) so that PHP in-memory caching can't influence the results of the second methods that's tested.

Aug 22 '21 12:08 lukasbestle

I feel we are back at square one and won't be able to solve this. Maybe we should really test how much of a performance hit it would be to actually just read the image dimensions from the image when needed.

Sep 22 '21 19:09 distantnative

I agree.

Sep 22 '21 19:09 lukasbestle

This issue has been automatically marked as stale because it has not had recent activity. This is for us to prioritize issues that are still relevant to our community. It will be closed if no further activity occurs within the next 14 days. If this issue is still relevant to you, please leave a comment.

Aug 10 '22 10:08 stale[bot]

kirby kirby copied to clipboard

Cache image dimensions on upload

Kirby Team summary

Status quo

Exploration

Todo

Original post

kirby
kirby copied to clipboard