mill
mill copied to clipboard
PathRef for directories is filesystem-dependent and not reproducible
The hashCode of PathRef returns a value that depends on the file listing order of the file system.
https://github.com/com-lihaoyi/mill/blob/7d9eeac445f3153ef2c600534a40cd185d2ee9fc/main/api/src/mill/api/PathRef.scala#L38-L52
os.walk
uses DirectoryStream, which returns a list of files in an indefinite order depending on the file system.
The elements returned by the iterator are in no specific order. Some file systems maintain special links to the directory itself and the directory's parent directory. Entries representing these links are not returned by the iterator.
As a concrete example, if an out directory is copied across machines in a Linux+XFS environment, the cache will be invalidated because the order in which files are stored changes. Note that ext4 hides this problem because it returns a stable file list.
I think mill should compute hashCode by sorting instead of filesystem-dependent order.
File listing order in XFS
$ touch a
$ touch b
$ ls -f
. .. a b
$ touch b
$ touch a
$ ls -f
. .. b a
File listing order in EXT4
$ touch a
$ touch b
$ ls -f
b . a ..
$ touch b
$ touch a
$ ls -f
b . a ..
I think, there is no Java API available, which is able to return a sorted recursive directory stream. So, we need to fall back to some self implemented directory traversal which may be less performant and may need more memory. Maybe, it could be hidden behind some settings (env variable?) or detected based on the filesystem type and/or OS?
As this affects Mill cache correctness and reproducibility, we definitely will accept a PR.
Hmm... could this return different hashes in the same run if the file got pulled in two different ways? (Diamond dependency) If so, that might explain the multi-hash thing that I was experiencing here
@Sailsman63 I doubt that, as the mill dependency graph is a DAG (https://en.wikipedia.org/wiki/Directed_acyclic_graph) which is calculated once at the start of a Mill evaluation. So targets won't run more than one time. The hashCode of PathRef
isn't unstable pe se but it might not be reproducible, so it only affects the change detection for cached targets. Meaning, we may miss-interpret some cache hits as cache faults.
I changed my CI to Ext4 to mitigate this problem ande now I am not in trouble. I'll try to create a PR when i have time!