imagetools icon indicating copy to clipboard operation
imagetools copied to clipboard

fix: Consistent image ID hashes across machines

Open AdrianGonz97 opened this issue 1 year ago • 5 comments

This PR fixes an issue that we've been experiencing in https://github.com/huntabyte/shadcn-svelte/pull/978 where the hash for the generated image ID is different when it's computed on different machines. The goal of that PR was to cache the images so that they can be utilized across our GH runners.

The source of the issue is the mtime stat from the image file, which seemingly differs from machine to machine, causing the hash to generate a different id, resulting in a cache MISS. Instead, mtime has been replaced for the size stat, which (in conjunction with the image config and file url) should provide sufficient uniqueness for the hash.


  • Quick Checklist
  • [x] I have read the contributing guidelines
  • [x] I have written new tests, as applicable (for bug fixes / features)
  • [ ] Docs have been added / updated (for bug fixes / features)
  • [x] I have added a changeset, if applicable
  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...) Bug fix

  • What is the new behavior (if this is a feature change)? Implements consistent image ID hashing across machines

  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?) No

  • Other information: I wasn't sure the best way to express this change in a test, so I've hardcoded the expected id string. I'm sure there's a better way to do it, so please feel free to modify it!

AdrianGonz97 avatar Apr 01 '24 17:04 AdrianGonz97

🦋 Changeset detected

Latest commit: e43d13f31e1f98e95db02af1f6f045d678117d77

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
vite-imagetools Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

changeset-bot[bot] avatar Apr 01 '24 17:04 changeset-bot[bot]

Hmm, that's really weird. I'm surprised the mtime would be different across machines. Are they different OSes? Or is the mtime set based upon when you check the repo out from git. I'd love to better understand the issue before we switch away from it

benmccann avatar Apr 09 '24 22:04 benmccann

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 95.48%. Comparing base (8790098) to head (a4aec62). Report is 2 commits behind head on main.

:exclamation: Current head a4aec62 differs from pull request most recent head e43d13f. Consider uploading reports for the commit e43d13f to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #711      +/-   ##
==========================================
- Coverage   95.49%   95.48%   -0.01%     
==========================================
  Files          33       33              
  Lines        1288     1286       -2     
  Branches      226      226              
==========================================
- Hits         1230     1228       -2     
  Misses         58       58              
Flag Coverage Δ
imagetools-core 95.48% <100.00%> (-0.01%) :arrow_down:
vite-imagetools 95.48% <100.00%> (-0.01%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Apr 09 '24 22:04 codecov[bot]

Hmm, that's really weird. I'm surprised the mtime would be different across machines. Are they different OSes?

It's definitely strange 😄. The OSes are set to be the same as they're running from the same workflow, so I don't think it's that. Perhaps it's more related to timezones? 🤔

Or is the mtime set based upon when you check the repo out from git

I'd be happy to dig into this question when I get a chance!

AdrianGonz97 avatar Apr 09 '24 22:04 AdrianGonz97

Or is the mtime set based upon when you check the repo out from git

It seems like this assumption was correct, @benmccann! After checking out the repo on 2 separate instances (on the same machine), I'm ending up with two different mtime stats.

File stats 1:
{
  stats: Stats {
    dev: 2080,
    mode: 33188,
    nlink: 1,
    uid: 1000,
    gid: 1000,
    rdev: 0,
    blksize: 4096,
    ino: 768440,
    size: 159643,
    blocks: 312,
    atimeMs: 1712710759640.3115,
    mtimeMs: 1712710733650.3157,
    ctimeMs: 1712710733650.3157,
    birthtimeMs: 1712710733650.3157,
    atime: 2024-04-10T00:59:19.640Z,
    mtime: 2024-04-10T00:58:53.650Z,
    ctime: 2024-04-10T00:58:53.650Z,
    birthtime: 2024-04-10T00:58:53.650Z
  }
}
File stats 2:
{
  stats: Stats {
    dev: 2080,
    mode: 33188,
    nlink: 1,
    uid: 1000,
    gid: 1000,
    rdev: 0,
    blksize: 4096,
    ino: 1598355,
    size: 159643,
    blocks: 312,
    atimeMs: 1712711152420.249,
    mtimeMs: 1712711150750.2493,
    ctimeMs: 1712711150750.2493,
    birthtimeMs: 1712711150750.2493,
    atime: 2024-04-10T01:05:52.420Z,
    mtime: 2024-04-10T01:05:50.750Z,
    ctime: 2024-04-10T01:05:50.750Z,
    birthtime: 2024-04-10T01:05:50.750Z
  }
}

AdrianGonz97 avatar Apr 10 '24 01:04 AdrianGonz97

Sorry for the long delay. I'm afraid I'd forgotten about this PR.

I'm nervous about using the file size because you could hit false positives with it. I'm not sure how expensive it would be to compute a hash for every file, but it seems safer.

I had an idea that we could use mtime first and then fallback to the hash if it's different, but that probably wouldn't work because it would require us storing two hashes.

benmccann avatar Apr 28 '24 21:04 benmccann

I'm nervous about using the file size because you could hit false positives with it.

Fair enough!

I'm not sure how expensive it would be to compute a hash for every file, but it seems safer.

I agree. Testing locally, I modified it to use imageBuffer instead, which had no perceivable performance impact relative to the rest of build.

I also did some microbenchmarking for it (just in case), testing different image sizes to see their impact:

┌─────────┬───────────────────────────────────┬───────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name                         │ ops/sec   │ Average Time (ns)  │ Margin   │ Samples │
├─────────┼───────────────────────────────────┼───────────┼────────────────────┼──────────┼─────────┤
│ 0       │ 'hash with size - base (500kb)'   │ '231,842' │ 4313.26988522398   │ '±0.63%' │ 231843  │
│ 1       │ 'hash with buffer - base (500kb)' │ '5,979'   │ 167226.18963211094 │ '±0.23%' │ 5980    │
│ 2       │ 'hash with buffer - 11mb'         │ '165'     │ 6036291.090361429  │ '±0.43%' │ 166     │
│ 3       │ 'hash with buffer - 5mb'          │ '311'     │ 3215181.397435922  │ '±0.31%' │ 312     │
│ 4       │ 'hash with buffer - 2.5mb'        │ '1,099'   │ 909751.2799999929  │ '±0.20%' │ 1100    │
│ 5       │ 'hash with buffer - 1mb'          │ '2,599'   │ 384681.17346155015 │ '±0.40%' │ 2600    │
│ 6       │ 'hash with buffer - 500kb'        │ '5,875'   │ 170197.13359428805 │ '±0.28%' │ 5876    │
│ 7       │ 'hash with buffer - 350kb'        │ '6,425'   │ 155626.61889200658 │ '±0.25%' │ 6426    │
│ 8       │ 'hash with buffer - 100kb'        │ '16,379'  │ 61051.895238088415 │ '±0.44%' │ 16380   │
└─────────┴───────────────────────────────────┴───────────┴────────────────────┴──────────┴─────────┘

Here's the repo for it too: https://github.com/AdrianGonz97/imagetools-hashing-benchmarks

Hashing with the size (or mtime) stat is significantly faster, however, I would consider hashing with the image to be sufficiently fast enough for our needs. Especially since this operation doesn't execute hundreds of thousands of times per build.

So in practice, I think it's completely fine to hash with the image itself (even without the mtime optimization you described) as the performance impact is miniscule when compared to the rest of the build.


Side note: While writing the benchmark, I noticed that generateImageID is async when it doesn't need to be. ~~Do you have any idea why that's the case?~~ Scratch that, looking back at the history, it seems like it used to await a promise before the caching PR was introduced. Not a problem, I can fix that.

AdrianGonz97 avatar Apr 29 '24 19:04 AdrianGonz97