coreutils icon indicating copy to clipboard operation
coreutils copied to clipboard

shuf: Add `--random-seed`, make `--random-source` GNU-compatible, report write failures, optimize

Open blyxxyz opened this issue 8 months ago • 5 comments

  • Add a --random-seed option that allows using a string as a seed for deterministic output. For example: shuf --random-seed=🦀 -i1-10.

    This comes with a forward compatibility commitment: we have to keep using SHA-3 and ChaCha12 and the current (in-tree) algorithms whenever --random-seed is passed. Details and motivation: https://github.com/uutils/coreutils/blob/e89aaf85b22867be86eb020016fae8032e7ab6b5/src/uu/shuf/src/random_seed.rs#L12-L61

    Resolves https://github.com/uutils/coreutils/issues/2528.

    I'd like some input on the design. With this interface it's possible that a user passes a filename as an argument without noticing that it uses the filename itself as the seed rather than the contents of the file. Are we OK with that? I tried to make the help text as clear as possible:

    --random-seed <STRING> seed with STRING for reproducible output

  • Make the behavior of --random-source mostly compatible with GNU. I managed to black box reverse engineer their RNG (which is luckily very simple and similar to our old implementation). So now people who use this trick for reproducibility will be able to switch to uutils and still get the old output they're relying on. (Except for one weirdly specific invocation for which I've added a #[ignore] test.)

  • Use u64 instead of usize for --input-range and --head-count. It's now possible to generate integers above u32::MAX on 32-bit systems.

  • Replace the nonrepeating integer generator. The new one has better reproducibility: it matches GNU and gives identical output to seq | shuf. (The performance is practically the same.)

  • Handle --random-source I/O errors properly. They used to cause a panic (which due to buffering quirks would get printed above the output rather than at the end). Now they're reported normally.

  • Flush the BufWriter at the end. If we don't do this then write errors get ignored.

  • Use a larger write buffer (8K → 64K).

  • Use the itoa crate to format integers. With --repeat and --input-range this was a bottleneck.

I also experimented with I/O tricks like vectored writes but they turned out to be too situational, so not a lot of performance work. shuf -r -n10000000 -i0-10 now runs twice as fast though.

blyxxyz avatar Mar 26 '25 18:03 blyxxyz

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)

github-actions[bot] avatar Mar 26 '25 19:03 github-actions[bot]

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

github-actions[bot] avatar Mar 26 '25 20:03 github-actions[bot]

Could you please make the gnu random source test pass ? (it is fine to adjust the upstream gnu test) thanks

sylvestre avatar Mar 27 '25 11:03 sylvestre

Is there such a test? I can only find a test for passing --random-source multiple times. tests/shuf/shuf-reservoir and tests/shuf/shuf pass on my machine.

blyxxyz avatar Mar 27 '25 19:03 blyxxyz

sorry, i was confusing with shred :( sorry

sylvestre avatar Mar 30 '25 09:03 sylvestre

This MR has conflicts. @blyxxyz are you available to update the patch ?

RenjiSann avatar Nov 14 '25 23:11 RenjiSann

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)

github-actions[bot] avatar Nov 15 '25 08:11 github-actions[bot]

Done, thanks for the heads up! I've also turned the new strings into translation keys.

I think the CI failure is spurious/unrelated?

blyxxyz avatar Nov 15 '25 08:11 blyxxyz

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)

github-actions[bot] avatar Nov 18 '25 07:11 github-actions[bot]

CodSpeed Performance Report

Merging #7585 will degrade performances by 15.07%

Comparing blyxxyz:clean-shuf-2 (ed5ae0d) with main (67c7612)

Summary

⚡ 1 improvement
❌ 2 regressions
✅ 123 untouched
⏩ 6 skipped[^skipped]

:warning: Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
shuf_input_range[1000000] 229.1 ms 124.8 ms +83.65%
shuf_lines[100000] 26.9 ms 31.7 ms -15.07%
shuf_repeat_sampling[50000] 4.6 ms 5.1 ms -9.94%
[^skipped]: 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

codspeed-hq[bot] avatar Nov 18 '25 08:11 codspeed-hq[bot]

cargo-deny: https://github.com/uutils/coreutils/pull/9322

sylvestre avatar Nov 18 '25 08:11 sylvestre

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)

github-actions[bot] avatar Nov 18 '25 08:11 github-actions[bot]