coreutils
coreutils copied to clipboard
shuf: Add `--random-seed`, make `--random-source` GNU-compatible, report write failures, optimize
-
Add a
--random-seedoption that allows using a string as a seed for deterministic output. For example:shuf --random-seed=🦀 -i1-10.This comes with a forward compatibility commitment: we have to keep using SHA-3 and ChaCha12 and the current (in-tree) algorithms whenever
--random-seedis passed. Details and motivation: https://github.com/uutils/coreutils/blob/e89aaf85b22867be86eb020016fae8032e7ab6b5/src/uu/shuf/src/random_seed.rs#L12-L61Resolves https://github.com/uutils/coreutils/issues/2528.
I'd like some input on the design. With this interface it's possible that a user passes a filename as an argument without noticing that it uses the filename itself as the seed rather than the contents of the file. Are we OK with that? I tried to make the help text as clear as possible:
--random-seed <STRING> seed with STRING for reproducible output -
Make the behavior of
--random-sourcemostly compatible with GNU. I managed to black box reverse engineer their RNG (which is luckily very simple and similar to our old implementation). So now people who use this trick for reproducibility will be able to switch to uutils and still get the old output they're relying on. (Except for one weirdly specific invocation for which I've added a#[ignore]test.) -
Use
u64instead ofusizefor--input-rangeand--head-count. It's now possible to generate integers aboveu32::MAXon 32-bit systems. -
Replace the nonrepeating integer generator. The new one has better reproducibility: it matches GNU and gives identical output to
seq | shuf. (The performance is practically the same.) -
Handle
--random-sourceI/O errors properly. They used to cause a panic (which due to buffering quirks would get printed above the output rather than at the end). Now they're reported normally. -
Flush the
BufWriterat the end. If we don't do this then write errors get ignored. -
Use a larger write buffer (8K → 64K).
-
Use the
itoacrate to format integers. With--repeatand--input-rangethis was a bottleneck.
I also experimented with I/O tricks like vectored writes but they turned out to be too situational, so not a lot of performance work. shuf -r -n10000000 -i0-10 now runs twice as fast though.
GNU testsuite comparison:
Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)
GNU testsuite comparison:
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)
Could you please make the gnu random source test pass ? (it is fine to adjust the upstream gnu test) thanks
Is there such a test? I can only find a test for passing --random-source multiple times. tests/shuf/shuf-reservoir and tests/shuf/shuf pass on my machine.
sorry, i was confusing with shred :( sorry
This MR has conflicts. @blyxxyz are you available to update the patch ?
GNU testsuite comparison:
Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Done, thanks for the heads up! I've also turned the new strings into translation keys.
I think the CI failure is spurious/unrelated?
GNU testsuite comparison:
Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
CodSpeed Performance Report
Merging #7585 will degrade performances by 15.07%
Comparing blyxxyz:clean-shuf-2 (ed5ae0d) with main (67c7612)
Summary
⚡ 1 improvement
❌ 2 regressions
✅ 123 untouched
⏩ 6 skipped[^skipped]
:warning: Please fix the performance issues or acknowledge them on CodSpeed.
Benchmarks breakdown
| Benchmark | BASE |
HEAD |
Change | |
|---|---|---|---|---|
| ⚡ | shuf_input_range[1000000] |
229.1 ms | 124.8 ms | +83.65% |
| ❌ | shuf_lines[100000] |
26.9 ms | 31.7 ms | -15.07% |
| ❌ | shuf_repeat_sampling[50000] |
4.6 ms | 5.1 ms | -9.94% |
| [^skipped]: 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. |
cargo-deny: https://github.com/uutils/coreutils/pull/9322
GNU testsuite comparison:
Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)