[native_assets_builder] Caching strategy for hook invocations
We need to decide on a robust strategy for caching hook invocations.
Requirements
Some typical use cases and corner cases:
- Modifying some files, running the hook and running the hook again should be a no-op the second time (works in the current implementation).
- Running the hook, and while running the hook modifying some files, should cause the hook to be rerun afterwards (works in the current implementation).
- Reverting files to an older version should cause the hook to rerun (does not work in the current implementation).
Non goal:
- Running a hook, modifying a file, running the hook again, and then reverting the file and running the hook should reuse the results from the first hook. (This would require us to infinitely grow the build cache and is a strategy usually only employed in remote build farms.)
Design options
1. Use dependencies last-modified timestamp and the last-build-started timestamp (current implementation).
This is the strategy used by make and ninja.
Downsides:
- This doesn't work with files changing in a way that leaves their
lastModifiedon a timestamp that is before the moment the last build was run (use case 1).
2. Use dependencies last-modified, and save the last seen timestamp for each dependency
This strategy is used by flutter_tools with FileStoreStrategy.timestamp.
Downsides:
- This doesn't work with files changing in a way that leaves their
lastModifiedon exactly the timestamp that the previous version of the file had.
Flutter uses this only for when dependencies are created by our own tools. For example a package_config.json is the input for generating dart_plugin_registrant.dart.
3. Use file-content hashing, and save the last seen hash for each dependency
This strategy is used by flutter_tools with FileStoreStrategy.timestamp.
Downsides:
- This does not support dynamically discovered dependencies, or requires preventing concurrent modification of input files during build. For example invoking
dart compile kernel --package=package_config.json --output=foo.dill --depfile=dependencies.d foo.dartand modifying one of the files listed as dynamic dependencies independencies.dwhiledart compile kernelis running goes unnoticed.
Most Targets in flutter tools have static dependencies (List<Source> get inputs). I am not entirely sure how dynamic dependencies are handled (List<String> get depfiles).
4. Use file-content hashing, save the last seen hash for each dependency, and error on last-modified during build
This strategy is used by Bazel/Blaze. (Surfaces as file modified during build.)
Prevents the downside of 3 by checking the last-modified timestamps of dependencies to check if they didn't change since the build started.
Downside:
- Doesn't catch reverting a file to an older last-modified timestamp during the build (use case 2 and 3 combined).
5. Use file-content hashing, save the last seen hash for each dependency, and don't cache on last-modified during build (proposed solution)
Solution 4, but don't error, simply don't cache.
Downside:
- Doesn't catch reverting a file to an older last-modified timestamp during the build (use case 2 and 3 combined).
Other solutions?
We could consider having to list all dependencies upfront, but that seems too restrictive. We could consider having to output all the file content hashes from the hooks, but that seems to unfriendly for users. Maybe we say that reverting a file to an older last modified during a build is not supported.
@jonahwilliams You wrote some of the caching in flutter, is my understanding correct? Any suggestions? @mkustermann Any other build systems we should investigate? cc @blaugold @HosseinYousefi
Most Targets in flutter tools have static dependencies (List<Source> get inputs). I am not entirely sure how dynamic dependencies are handled (List<String> get depfiles).
The way this is handled dynamically is that certain targets will output a depfile format during execution, and that depfile contains the set of resolved dependencies. If the depfile is missing, you assume that the target needs to be rerun, if present then the depfile contains the set of all input/output files which lets you discover things dynamically. This is used for the dart dependencies, since we don't want to assume every single dart file as an input.
Thanks so much for your input @jonahwilliams!
Some notes from the discussion:
- Flutter doesn't detect modifications of files during the build, it could potentially (going from strategy 3 to 4). But this is an engineering time vs quality tradeoff.
- Flutter doesn't deal with concurrent builds (xcode/gradle don't allow it anyways). But with
dart runwe do need to care about this. - For hot reload, hashing file contents was too slow. So there last modified timestamps are used (at the cost of some changes not being detected). This is relevant for when we want to add data assets and hot reload support. If on current hardware (an old 2020 laptop or so) asset contents hashing of 100 MB of assets is still too slow, we should use file timestamps.
- If going for file content hashing, also a hashing is needed for directories (for example from the directories listing). We already use last-modified time stamps in our data assets examples, and depending on the directory rather than files also needs to be supported.
Can you defined the metric for "too slow" and the hash you're using on the files contents. We don't need sha1 or any other cryptographic hashes for build triggering; something like XXH3 / XXH64 would perform well.
@jtmcdole when I did a hot reload on a flutter project with a moderate amount of assets (flutter gallery) in 2020 , it took more than a second to hot reload due to re-hashing. That is too slow.
Notes from discussion with @jtmcdole! 🙏
- Option 5 is great, but also log a warning that the build needs to be rerun.
- flutter is using md5 in lib/src/build_system/file_store.dart, which could possibly be replaced with https://pub.dev/packages/xxh3
- If so, we could use the same hashing algo in the native assets builder and also start using it in
package:dartdev.
- If so, we could use the same hashing algo in the native assets builder and also start using it in
@jtmcdole when I did a hot reload on a flutter project with a moderate amount of assets (flutter gallery) in 2020 , it took more than a second to hot reload due to re-hashing. That is too slow.
I'm not disagreeing that was too much. What I'm looking for is an actual measuring stick so we could validate and make informed decisions. How many files? What sizes?
@dcharkes - I just wrote a silly hash_it program in dart using md5,sha1,sha256, and that xxh4 package. I ran it over 1000 files of varying size from 0 to 32KB in size (totallying 18MB)
for i in {1..1000}; do head -c ${(l:6::0:)$((${RANDOM}))} /dev/urandom > "$i.dat"; done
I chose to only await all the files data after their bytes has been read, i.e. await Future.wait(filereadFutures)) and then process each set of bytes one by one; I got dart-xxh3 == sha1 in speed. md5 faster than sha1, and sha256 the slowest.
They took 90ms to 140ms
If you await the file read; they are all slower, but the same results: They took 125 to 175ms.
What I have no tried: using ffi to perform xxh3 to compare speeds. Instead I just did some cli magic:
time find -type f | xargs cat | md5
time find -type f | xargs cat | sha1sum
time find -type f | xargs cat | sha256sum
time find -type f | xargs cat | xxhsum -H3
md5: 34ms sha1: 24ms sha256: 52ms xxh3: 20ms
ffi would probably be better than pure dart.
Going with a bigger fileset (random*10 ~= 164MB):
native: xxh3: 60ms sha1: 200ms md5: 270ms sha256: 400ms
dart: md5: 680ms xxh3: 860ms sha1: 920ms sha256: 1245ms
The big wins are "use the native libraries"
Thanks for the exploration @jtmcdole!
Using native libraries in dartdev/flutter_tools will require a bundling strategy for native libraries. Currently, native assets cannot yet be used by flutter_tools and dartdev itself. The easiest alternative would be to bake in xxh3 into dart:convert. However, adding something to a standard library as a temporary solution seems overkill. Another alternative would be to expose some symbol in the Dart executable and look it up with DynamicLibrary.executable(). But that's very much a hack which we wouldn't want users to start copying.
Maybe it's better to temporarily go with the fasted hashing in Dart (md5) and swap to xxh3 in a later stage when native assets can be used inside the Dart/Flutter commandline tooling.
The big wins are "use the native libraries"
Although native can be faster, especially if vectorized, I'm surprised by a 10x difference. Maybe our compiler is doing bad job here. /cc @mraleph
@jtmcdole do you have your benchmark somewhere?
FWIW with a simple fix to the package code XXH3 goes from 8ns/byte on my machine to ~0.6ns/byte in AOT mode. In JIT mode this reveals an ineffecient compilation of the x << 64 - if I fix that I get around 0.7ns/byte.
Ill post it here (it's the dumbest of benches).
~What was the change to get those gains?!~ - was on my phone; not on laptop. Ohhhhh.
- int dataVal = readLE64(input, inputOffset + (8 * i));
+ int dataVal = readLE64(input, inputOffset + (i * 8));
Me in college: "don't try to outsmart the compiler". How does swapping these two make a difference?
@jtmcdole the bulk of the improvement is due to readLE64 change: avoiding repeated ByteData.bytelistView factory call. I think compiler should do a better job removing this allocation but for now I just refactored the code itself. The i * 8 change was not for performance, just to align it with the next line. (The compiler will actually automatically flip 8*i to i*8 so it does not matter how you write it).
import 'dart:io';
import 'dart:typed_data';
import 'package:crypto/crypto.dart';
import 'package:xxh3/xxh3.dart';
import 'package:args/args.dart';
const String version = '0.0.1';
ArgParser buildParser() {
return ArgParser()
..addFlag(
'help',
abbr: 'h',
negatable: false,
help: 'Print this usage information.',
)
..addOption(
'hash',
defaultsTo: 'xxh',
allowed: ['xxh', 'md5', 'sha1', 'sha256'],
)
;
}
void printUsage(ArgParser argParser) {
print('Usage: dart hash_it.dart <flags> [arguments]');
print(argParser.usage);
}
void main(List<String> arguments) async {
final ArgParser argParser = buildParser();
late final String hashOption;
try {
final ArgResults results = argParser.parse(arguments);
// Process the parsed arguments.
if (results.wasParsed('help')) {
printUsage(argParser);
return;
}
hashOption = results['hash'];
} on FormatException catch (e) {
// Print usage information if an invalid argument was provided.
print(e.message);
print('');
printUsage(argParser);
return;
}
final watch = Stopwatch()..start();
Function(List<int> bytes) hasher;
switch (hashOption) {
case 'sha1':
hasher = (bytes) {
var digest = sha1.convert(bytes);
return digest.bytes[1];
};
case 'sha256':
hasher = (bytes) {
var digest = sha256.convert(bytes);
return digest.bytes[1];
};
case 'md5':
hasher = (bytes) {
var digest = md5.convert(bytes);
return digest.bytes[1];
};
case 'xxh':
default:
hasher = (bytes) => xxh3(bytes as Uint8List);
}
var dir = Directory('data');
if (true) {
var futs = <Future<Uint8List>>[];
await for (var fileEnt in dir.list()) {
if (fileEnt is File) {
futs.add(fileEnt.readAsBytes());
}
}
final results = await Future.wait(futs);
final hashs = List.filled(results.length, 0);
int counter = 0;
for (var result in results) {
final hash = hasher(result);
hashs[counter++] = hash;
}
} else {
final hashes = [];
await for (var fileEnt in dir.list()) {
if (fileEnt is File) {
final bytes = await fileEnt.readAsBytes();
final hash = hasher(bytes);
hashes.add(hash);
}
}
}
// print(hashs);
print('done: ${watch.elapsed}');
}
@mraleph - do we want to put up a friendly PR to the xxh3 library?
dart + patches:
sha1: 900ms xxh3: 100ms
Well that's a win.
@mraleph - do we want to put up a friendly PR to the xxh3 library?
https://github.com/SamJakob/xxh3/pull/4
Version 1.1.0 just got published with some other optimizations as well: https://pub.dev/packages/xxh3