native [native_assets_builder] Caching strategy for hook invocations

We need to decide on a robust strategy for caching hook invocations.

Requirements

Some typical use cases and corner cases:

Modifying some files, running the hook and running the hook again should be a no-op the second time (works in the current implementation).
Running the hook, and while running the hook modifying some files, should cause the hook to be rerun afterwards (works in the current implementation).
Reverting files to an older version should cause the hook to rerun (does not work in the current implementation).

Non goal:

Running a hook, modifying a file, running the hook again, and then reverting the file and running the hook should reuse the results from the first hook. (This would require us to infinitely grow the build cache and is a strategy usually only employed in remote build farms.)

Design options

1. Use dependencies last-modified timestamp and the last-build-started timestamp (current implementation).

This is the strategy used by make and ninja.

Downsides:

This doesn't work with files changing in a way that leaves their lastModified on a timestamp that is before the moment the last build was run (use case 1).

2. Use dependencies last-modified, and save the last seen timestamp for each dependency

This strategy is used by flutter_tools with FileStoreStrategy.timestamp.

Downsides:

This doesn't work with files changing in a way that leaves their lastModified on exactly the timestamp that the previous version of the file had.

Flutter uses this only for when dependencies are created by our own tools. For example a package_config.json is the input for generating dart_plugin_registrant.dart.

3. Use file-content hashing, and save the last seen hash for each dependency

This strategy is used by flutter_tools with FileStoreStrategy.timestamp.

Downsides:

This does not support dynamically discovered dependencies, or requires preventing concurrent modification of input files during build. For example invoking dart compile kernel --package=package_config.json --output=foo.dill --depfile=dependencies.d foo.dart and modifying one of the files listed as dynamic dependencies in dependencies.d while dart compile kernel is running goes unnoticed.

Most Targets in flutter tools have static dependencies (List<Source> get inputs). I am not entirely sure how dynamic dependencies are handled (List<String> get depfiles).

4. Use file-content hashing, save the last seen hash for each dependency, and error on last-modified during build

This strategy is used by Bazel/Blaze. (Surfaces as file modified during build.)

Prevents the downside of 3 by checking the last-modified timestamps of dependencies to check if they didn't change since the build started.

Downside:

Doesn't catch reverting a file to an older last-modified timestamp during the build (use case 2 and 3 combined).

5. Use file-content hashing, save the last seen hash for each dependency, and don't cache on last-modified during build (proposed solution)

Solution 4, but don't error, simply don't cache.

Downside:

Doesn't catch reverting a file to an older last-modified timestamp during the build (use case 2 and 3 combined).

Other solutions?

We could consider having to list all dependencies upfront, but that seems too restrictive. We could consider having to output all the file content hashes from the hooks, but that seems to unfriendly for users. Maybe we say that reverting a file to an older last modified during a build is not supported.

@jonahwilliams You wrote some of the caching in flutter, is my understanding correct? Any suggestions? @mkustermann Any other build systems we should investigate? cc @blaugold @HosseinYousefi

Sep 24 '24 15:09 dcharkes

Most Targets in flutter tools have static dependencies (List<Source> get inputs). I am not entirely sure how dynamic dependencies are handled (List<String> get depfiles).

The way this is handled dynamically is that certain targets will output a depfile format during execution, and that depfile contains the set of resolved dependencies. If the depfile is missing, you assume that the target needs to be rerun, if present then the depfile contains the set of all input/output files which lets you discover things dynamically. This is used for the dart dependencies, since we don't want to assume every single dart file as an input.

Sep 24 '24 15:09 jonahwilliams

Thanks so much for your input @jonahwilliams!

Some notes from the discussion:

Flutter doesn't detect modifications of files during the build, it could potentially (going from strategy 3 to 4). But this is an engineering time vs quality tradeoff.
Flutter doesn't deal with concurrent builds (xcode/gradle don't allow it anyways). But with dart run we do need to care about this.
For hot reload, hashing file contents was too slow. So there last modified timestamps are used (at the cost of some changes not being detected). This is relevant for when we want to add data assets and hot reload support. If on current hardware (an old 2020 laptop or so) asset contents hashing of 100 MB of assets is still too slow, we should use file timestamps.
If going for file content hashing, also a hashing is needed for directories (for example from the directories listing). We already use last-modified time stamps in our data assets examples, and depending on the directory rather than files also needs to be supported.

Sep 24 '24 17:09 dcharkes

Can you defined the metric for "too slow" and the hash you're using on the files contents. We don't need sha1 or any other cryptographic hashes for build triggering; something like XXH3 / XXH64 would perform well.

Sep 25 '24 18:09 jtmcdole

@jtmcdole when I did a hot reload on a flutter project with a moderate amount of assets (flutter gallery) in 2020 , it took more than a second to hot reload due to re-hashing. That is too slow.

Sep 25 '24 18:09 jonahwilliams

Notes from discussion with @jtmcdole! 🙏

Option 5 is great, but also log a warning that the build needs to be rerun.
flutter is using md5 in lib/src/build_system/file_store.dart, which could possibly be replaced with https://pub.dev/packages/xxh3
- If so, we could use the same hashing algo in the native assets builder and also start using it in package:dartdev.

Sep 25 '24 19:09 dcharkes

@jtmcdole when I did a hot reload on a flutter project with a moderate amount of assets (flutter gallery) in 2020 , it took more than a second to hot reload due to re-hashing. That is too slow.

I'm not disagreeing that was too much. What I'm looking for is an actual measuring stick so we could validate and make informed decisions. How many files? What sizes?

@dcharkes - I just wrote a silly hash_it program in dart using md5,sha1,sha256, and that xxh4 package. I ran it over 1000 files of varying size from 0 to 32KB in size (totallying 18MB)

for i in {1..1000}; do head -c  ${(l:6::0:)$((${RANDOM}))} /dev/urandom > "$i.dat"; done

I chose to only await all the files data after their bytes has been read, i.e. await Future.wait(filereadFutures)) and then process each set of bytes one by one; I got dart-xxh3 == sha1 in speed. md5 faster than sha1, and sha256 the slowest.

They took 90ms to 140ms

If you await the file read; they are all slower, but the same results: They took 125 to 175ms.

What I have no tried: using ffi to perform xxh3 to compare speeds. Instead I just did some cli magic:

time find -type f | xargs cat | md5
time find -type f | xargs cat | sha1sum
time find -type f | xargs cat | sha256sum
time find -type f | xargs cat | xxhsum -H3

md5: 34ms sha1: 24ms sha256: 52ms xxh3: 20ms

ffi would probably be better than pure dart.

Sep 25 '24 20:09 jtmcdole

Going with a bigger fileset (random*10 ~= 164MB):

native: xxh3: 60ms sha1: 200ms md5: 270ms sha256: 400ms

dart: md5: 680ms xxh3: 860ms sha1: 920ms sha256: 1245ms

The big wins are "use the native libraries"

Sep 25 '24 21:09 jtmcdole

Thanks for the exploration @jtmcdole!

Using native libraries in dartdev/flutter_tools will require a bundling strategy for native libraries. Currently, native assets cannot yet be used by flutter_tools and dartdev itself. The easiest alternative would be to bake in xxh3 into dart:convert. However, adding something to a standard library as a temporary solution seems overkill. Another alternative would be to expose some symbol in the Dart executable and look it up with DynamicLibrary.executable(). But that's very much a hack which we wouldn't want users to start copying.

Maybe it's better to temporarily go with the fasted hashing in Dart (md5) and swap to xxh3 in a later stage when native assets can be used inside the Dart/Flutter commandline tooling.

Sep 26 '24 06:09 dcharkes

The big wins are "use the native libraries"

Although native can be faster, especially if vectorized, I'm surprised by a 10x difference. Maybe our compiler is doing bad job here. /cc @mraleph

Sep 26 '24 07:09 mkustermann

@jtmcdole do you have your benchmark somewhere?

Sep 26 '24 07:09 mraleph

FWIW with a simple fix to the package code XXH3 goes from 8ns/byte on my machine to ~0.6ns/byte in AOT mode. In JIT mode this reveals an ineffecient compilation of the x << 64 - if I fix that I get around 0.7ns/byte.

Sep 26 '24 08:09 mraleph

Ill post it here (it's the dumbest of benches).

~What was the change to get those gains?!~ - was on my phone; not on laptop. Ohhhhh.

Sep 26 '24 13:09 jtmcdole

-    int dataVal = readLE64(input, inputOffset + (8 * i));
+    int dataVal = readLE64(input, inputOffset + (i * 8));

Me in college: "don't try to outsmart the compiler". How does swapping these two make a difference?

Sep 26 '24 13:09 jtmcdole

@jtmcdole the bulk of the improvement is due to readLE64 change: avoiding repeated ByteData.bytelistView factory call. I think compiler should do a better job removing this allocation but for now I just refactored the code itself. The i * 8 change was not for performance, just to align it with the next line. (The compiler will actually automatically flip 8*i to i*8 so it does not matter how you write it).

Sep 26 '24 15:09 mraleph

import 'dart:io';
import 'dart:typed_data';
import 'package:crypto/crypto.dart';
import 'package:xxh3/xxh3.dart';
import 'package:args/args.dart';

const String version = '0.0.1';

ArgParser buildParser() {
  return ArgParser()
    ..addFlag(
      'help',
      abbr: 'h',
      negatable: false,
      help: 'Print this usage information.',
    )
    ..addOption(
      'hash',
      defaultsTo: 'xxh',
      allowed: ['xxh', 'md5', 'sha1', 'sha256'],
    )
;
}

void printUsage(ArgParser argParser) {
  print('Usage: dart hash_it.dart <flags> [arguments]');
  print(argParser.usage);
}

void main(List<String> arguments) async {
  final ArgParser argParser = buildParser();

  late final String hashOption;
  try {
    final ArgResults results = argParser.parse(arguments);

    // Process the parsed arguments.
    if (results.wasParsed('help')) {
      printUsage(argParser);
      return;
    }
    hashOption = results['hash'];
  } on FormatException catch (e) {
    // Print usage information if an invalid argument was provided.
    print(e.message);
    print('');
    printUsage(argParser);
    return;
  }

  final watch = Stopwatch()..start();

  Function(List<int> bytes) hasher;
  switch (hashOption) {
    case 'sha1':
      hasher = (bytes) {
        var digest = sha1.convert(bytes);
        return digest.bytes[1];
      };
    case 'sha256':
      hasher = (bytes) {
        var digest = sha256.convert(bytes);
        return digest.bytes[1];
      };
    case 'md5':
      hasher = (bytes) {
        var digest = md5.convert(bytes);
        return digest.bytes[1];
      };
    case 'xxh':
    default:
      hasher = (bytes) => xxh3(bytes as Uint8List);
  }

  var dir = Directory('data');
  if (true) {
    var futs = <Future<Uint8List>>[];
    await for (var fileEnt in dir.list()) {
      if (fileEnt is File) {
        futs.add(fileEnt.readAsBytes());
      }
    }

    final results = await Future.wait(futs);
    final hashs = List.filled(results.length, 0);
    int counter = 0;

    for (var result in results) {
      final hash = hasher(result);
      hashs[counter++] = hash;
    }
  } else {
    final hashes = [];

    await for (var fileEnt in dir.list()) {
      if (fileEnt is File) {
        final bytes = await fileEnt.readAsBytes();
        final hash = hasher(bytes);
        hashes.add(hash);
      }
    }
  }
  // print(hashs);
  print('done: ${watch.elapsed}');
}

Sep 26 '24 16:09 jtmcdole

@mraleph - do we want to put up a friendly PR to the xxh3 library?

Sep 26 '24 16:09 jtmcdole

dart + patches:

sha1: 900ms xxh3: 100ms

Well that's a win.

Sep 26 '24 16:09 jtmcdole

@mraleph - do we want to put up a friendly PR to the xxh3 library?

https://github.com/SamJakob/xxh3/pull/4

Sep 27 '24 08:09 mraleph

Version 1.1.0 just got published with some other optimizations as well: https://pub.dev/packages/xxh3

Sep 29 '24 13:09 jtmcdole