coreutils icon indicating copy to clipboard operation
coreutils copied to clipboard

csplit: rewrite with one function per pattern type

Open jfinkels opened this issue 7 months ago • 7 comments

This pull request fixes several bugs in csplit by rewriting the code to have one function per pattern type. The functions are

  • up_to_line_repeat for patterns like 123 {45}
  • up_to_line_always for 123 {*}
  • up_to_match_pos_offset_repeat for /regex/+123 {45}
  • up_to_match_pos_offset_always for /regex/+123 {*}
  • up_to_match_neg_offset_repeat for /regex/-123 {45}
  • up_to_match_neg_offset_always for /regex/-123 {*}
  • skip_to_match_pos_offset_repeat for %regex%+123 {45}
  • skip_to_match_pos_offset_always for %regex%+123 {*}
  • skip_to_match_neg_offset_repeat for %regex%-123 {45}
  • skip_to_match_neg_offset_always for %regex%-123 {*}

The command-line options, global state, and per-chunk state live in the Splitter struct (which replaces the SplitWriter and InputSplitter structs). There are a ton of ugly corner cases where different types of patterns interact. I've tried to call those out with comments.

I added some tests that were already passing, but were missing test coverage, including

  • same regex pattern twice in a row, as in /foo/ /foo/
  • line number out of range on empty input, as in : | csplit - 1,
  • skip to match with negative offset repeating indefinitely, as in csplit %foo%-5 {*}.

Some bugs fixed include:

  • missing a regex match on the final line of a line number pattern, as in seq 20 | csplit - 10 /10/,
  • missing output 0\n when input is a directory,
  • #7286, a missing empty file with --suppress-matched.

Some bugs remain:

  • The GNU test file tests/csplit/csplit-suppress-matched remains failing. I'm hoping that the change in this pull request will make it easier to reason about csplit and fix the remaining failures.
  • Probably related to the previous item, there is some weird behavior of GNU csplit that I could not understand. This was already partially covered by a test named repeat_everything. I've extracted just the relevant part of that into its own test named test_repeat_line_after_match_suppress_matched that has TODO comments for the behavior that differs from GNU csplit.

Fixes #7286

jfinkels avatar Apr 20 '25 22:04 jfinkels

GNU testsuite comparison:

GNU test failed: tests/misc/tee. tests/misc/tee is passing on 'main'. Maybe you have to rebase?
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

github-actions[bot] avatar Apr 20 '25 23:04 github-actions[bot]

GNU testsuite comparison:

GNU test failed: tests/misc/tee. tests/misc/tee is passing on 'main'. Maybe you have to rebase?
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

github-actions[bot] avatar Apr 21 '25 15:04 github-actions[bot]

It is still in draft mode :)

sylvestre avatar Apr 23 '25 06:04 sylvestre

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

github-actions[bot] avatar Apr 27 '25 03:04 github-actions[bot]

GNU testsuite comparison:

Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

github-actions[bot] avatar Apr 27 '25 18:04 github-actions[bot]

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

github-actions[bot] avatar Apr 27 '25 21:04 github-actions[bot]

Sorry, I was hoping to find a way to make this easier to review, but I couldn't. I will come back to this and rebase it

jfinkels avatar Jun 26 '25 00:06 jfinkels

it has been draft for a long time, please reopen when ready

sylvestre avatar Oct 06 '25 20:10 sylvestre