nushell icon indicating copy to clipboard operation
nushell copied to clipboard

Split retain delimiter

Open Zoybean opened this issue 1 year ago • 4 comments

Related problem

I have some data interleaved with headings like this:

2024-01-13
event bla
event foo
event bar
2024-01-14
event baz
2024-01-15

And I would like to split it into sub-tables by heading, like so:

╭───┬────────────┬────────────────╮
│ # │    date    │    entries     │
├───┼────────────┼────────────────┤
│ 0 │ 2024-01-13 │ ╭───┬─────╮    │
│   │            │ │ 0 │ bla │    │
│   │            │ │ 1 │ foo │    │
│   │            │ │ 2 │ bar │    │
│   │            │ ╰───┴─────╯    │
│ 1 │ 2024-01-14 │ ╭───┬─────╮    │
│   │            │ │ 0 │ baz │    │
│   │            │ ╰───┴─────╯    │
│ 2 │ 2024-01-15 │ [list 0 items] │
╰───┴────────────┴────────────────╯

However, as far as I am aware, the only ways to split by a delimiting pattern also discard the matches of that pattern. e.g. `

Describe the solution you'd like

I invision this being solved by a command that splits by a delimiter but also retains that delimiter, like so:

> my-cmd | split-include --regex "^\d+-\d+-\d+$"
╭───┬────────────┬────────────────╮
│ # │ delimiter  │     items      │
├───┼────────────┼────────────────┤
│ 0 │ 2024-01-13 │ ╭───┬─────╮    │
│   │            │ │ 0 │ bla │    │
│   │            │ │ 1 │ foo │    │
│   │            │ │ 2 │ bar │    │
│   │            │ ╰───┴─────╯    │
│ 1 │ 2024-01-14 │ ╭───┬─────╮    │
│   │            │ │ 0 │ baz │    │
│   │            │ ╰───┴─────╯    │
│ 2 │ 2024-01-15 │ [list 0 items] │
╰───┴────────────┴────────────────╯

This would allow me to retain the date information in my log while also meaningfully dividing the log by date.

Describe alternatives you've considered

It would still work, though less elegantly in my case, if the proposed split-include function did not separate the delimiter from the items in the output, e.g.:

> my-cmd | split-include --regex "^\d+-\d+-\d+$"
╭───┬────────────────────╮
│ 0 │ ╭───┬────────────╮ │
│   │ │ 0 │ 2024-01-13 │ │
│   │ │ 1 │ bla        │ │
│   │ │ 2 │ foo        │ │
│   │ │ 3 │ bar        │ │
│   │ ╰───┴────────────╯ │
│ 1 │ ╭───┬────────────╮ │
│   │ │ 0 │ 2024-01-14 │ │
│   │ │ 1 │ baz        │ │
│   │ ╰───┴────────────╯ │
│ 2 │ ╭───┬────────────╮ │
│   │ │ 0 │ 2024-01-15 │ │
│   │ ╰───┴────────────╯ │
╰───┴────────────────────╯

This would require more work on my end to subsequently parse the first item of each list, but it would still be doable.

Another alternative is to do parsing in the same command that splits, e.g.:

> my-cmd | split-parse "{year}-{month}-{day}"
╭───┬────────────────────────────┬────────────────╮
│ # │         delimiter          │     items      │
├───┼────────────────────────────┼────────────────┤
│ 0 │ ╭───┬──────┬───────┬─────╮ │ ╭───┬─────╮    │
│   │ │ # │ year │ month │ day │ │ │ 0 │ bla │    │
│   │ ├───┼──────┼───────┼─────┤ │ │ 1 │ foo │    │
│   │ │ 0 │ 2024 │ 01    │ 13  │ │ │ 2 │ bar │    │
│   │ ╰───┴──────┴───────┴─────╯ │ ╰───┴─────╯    │
│ 1 │ ╭───┬──────┬───────┬─────╮ │ ╭───┬─────╮    │
│   │ │ # │ year │ month │ day │ │ │ 0 │ baz │    │
│   │ ├───┼──────┼───────┼─────┤ │ ╰───┴─────╯    │
│   │ │ 0 │ 2024 │ 01    │ 14  │ │                │
│   │ ╰───┴──────┴───────┴─────╯ │                │
│ 2 │ ╭───┬──────┬───────┬─────╮ │ [list 0 items] │
│   │ │ # │ year │ month │ day │ │                │
│   │ ├───┼──────┼───────┼─────┤ │                │
│   │ │ 0 │ 2024 │ 01    │ 15  │ │                │
│   │ ╰───┴──────┴───────┴─────╯ │                │
╰───┴────────────────────────────┴────────────────╯

But I think this is unnecessary complication - I'm not sure that parsing will always be necessary, and when it is, it can easily be done afterwards in the version I proposed above.

Finall, I am not sure if it would be better to operate on strings, on lists, or to somehow accept either. That is, whether the invocation would end up looking like:

my-cmd | split-include ...
my-cmd | lines | split-include ...

Additional context and details

No response

Zoybean avatar Feb 02 '24 08:02 Zoybean

This is a little closer but not exactly what you're looking for but you could probably manipulate these results further to get the output you're looking for.

❯ open test.txt | parse --regex '((?P<date>\d+-\d+-\d+\n)|event(?P<event>\s+\w+\n))' | reject capture0 | str trim
╭─#─┬────date────┬─event─╮
│ 0 │ 2024-01-13 │       │
│ 1 │            │ bla   │
│ 2 │            │ foo   │
│ 3 │            │ bar   │
│ 4 │ 2024-01-14 │       │
│ 5 │            │ baz   │
│ 6 │ 2024-01-15 │       │
╰─#─┴────date────┴─event─╯

fdncred avatar Feb 02 '24 12:02 fdncred

Running your command gives the wrong output (assuming test.txt contains my initial dummy dataset), but the following gives your output: open test.txt | lines | parse --regex '^(?:(?P<date>\d+-\d+-\d+)|event\s+(?P<event>\w+))$'

Working forward from the output you showed above, and assuming the lack of a split-include function, I find myself wanting to scan over it like so:

> $your_output | scan --init null {|it, acc| if ($it.date != null) {$acc = $it.date; null} else {{date: $acc, event: $it.event}}}
╭─#─┬────date────┬─event─╮
│ 0 │ 2024-01-13 │ bla   │
│ 1 │ 2024-01-13 │ foo   │
│ 2 │ 2024-01-13 │ bar   │
│ 4 │ 2024-01-14 │ baz   │
╰─#─┴────date────┴─event─╯

so that I can then run group-by to get something more useful:

> $scanned | select date event | group-by --to-table date | each {reject items.date}
╭─#─┬───group────┬─────items─────╮
│ 0 │ 2024-01-13 │ ╭─#─┬─event─╮ │
│   │            │ │ 0 │ bla   │ │
│   │            │ │ 1 │ foo   │ │
│   │            │ │ 2 │ bar   │ │
│   │            │ ╰─#─┴─event─╯ │
│ 1 │ 2024-01-14 │ ╭─#─┬─event─╮ │
│   │            │ │ 0 │ baz   │ │
│   │            │ ╰─#─┴─event─╯ │
╰─#─┴───group────┴─────items─────╯

which is not too far from the desired output:

╭─#─┬────date────┬────entries─────╮
│ 0 │ 2024-01-13 │ ╭───┬─────╮    │
│   │            │ │ 0 │ bla │    │
│   │            │ │ 1 │ foo │    │
│   │            │ │ 2 │ bar │    │
│   │            │ ╰───┴─────╯    │
│ 1 │ 2024-01-14 │ ╭───┬─────╮    │
│   │            │ │ 0 │ baz │    │
│   │            │ ╰───┴─────╯    │
│ 2 │ 2024-01-15 │ [list 0 items] │
╰─#─┴────date────┴────entries─────╯

The main downside is that the output now lacks the row with the empty list of entries. The other downside is that I don't know of a scan function in nu, though I'm sure it can be built in terms of reduce, so that shouldn't be a significant barrier.

Zoybean avatar Feb 03 '24 05:02 Zoybean

I managed to do it with regex alone, and got exactly the output I wanted (modulo simple string ops), but it's not a pretty regex:

> open test.txt | parse -r '(?<date>\d{4}-\d{2}-\d{2})(?:\s+(?<events>(?:event .*\s?)*))?' | update events {|r| $r.events | lines}
╭─#─┬────date────┬──────events───────╮
│ 0 │ 2024-01-13 │ ╭───┬───────────╮ │
│   │            │ │ 0 │ event bla │ │
│   │            │ │ 1 │ event foo │ │
│   │            │ │ 2 │ event bar │ │
│   │            │ ╰───┴───────────╯ │
│ 1 │ 2024-01-14 │ ╭───┬───────────╮ │
│   │            │ │ 0 │ event baz │ │
│   │            │ ╰───┴───────────╯ │
│ 2 │ 2024-01-15 │ [list 0 items]    │
╰─#─┴────date────┴──────events───────╯

Based on this solution, I have defined a rudimentary split-include function that works for my use-case:

def split-include [delim] {
  parse -r $"\(?<delim>($delim))\(?:\\s+\(?<rest>\(?:\(?!($delim)).*\\s?)*))?"
}

Downsides I've identified so far:

  • it only works for regex, not for the default parse syntax
  • you can't use ^ or $ to match line bounds in the delimiter, as it is forced to operate on the whole string at once, not on a per-line basis

Zoybean avatar Feb 04 '24 14:02 Zoybean

I think my current ideal solution would be to have 2 basic versions:

  • a str split-include like what I defined above (but providing --regex as an option, not as the only option)
  • a split-include that operates on tabular data and accepts a closure to determine a delimiting row

The latter form could do the necessary transformation from the simple regex-parsed table e.g.

> $your_output
╭─#─┬────date────┬─event─╮
│ 0 │ 2024-01-13 │       │
│ 1 │            │ bla   │
│ 2 │            │ foo   │
│ 3 │            │ bar   │
│ 4 │ 2024-01-14 │       │
│ 5 │            │ baz   │
│ 6 │ 2024-01-15 │       │
╰─#─┴────date────┴─event─╯
> $your_output | split-include {|r| $r.date != null}
╭─#─┬─────────delim──────────┬─────────rest─────────╮
│ 0 │ ╭───────┬────────────╮ │ ╭─#─┬─date─┬─event─╮ │
│   │ │ date  │ 2024-01-13 │ │ │ 0 │      │ bla   │ │
│   │ │ event │            │ │ │ 1 │      │ foo   │ │
│   │ ╰───────┴────────────╯ │ │ 2 │      │ bar   │ │
│   │                        │ ╰─#─┴─date─┴─event─╯ │
│ 1 │ ╭───────┬────────────╮ │ ╭─#─┬─date─┬─event─╮ │
│   │ │ date  │ 2024-01-14 │ │ │ 0 │      │ baz   │ │
│   │ │ event │            │ │ ╰─#─┴─date─┴─event─╯ │
│   │ ╰───────┴────────────╯ │                      │
│ 2 │ ╭───────┬────────────╮ │ [list 0 items]       │
│   │ │ date  │ 2024-01-15 │ │                      │
│   │ │ event │            │ │                      │
│   │ ╰───────┴────────────╯ │                      │
╰─#─┴─────────delim──────────┴─────────rest─────────╯
> $your_output | split-include {|r| $r.date != null} | select delim.date rest.event
╭─#─┬─delim_date─┬───rest_event───╮
│ 0 │ 2024-01-13 │ ╭───┬─────╮    │
│   │            │ │ 0 │ bla │    │
│   │            │ │ 1 │ foo │    │
│   │            │ │ 2 │ bar │    │
│   │            │ ╰───┴─────╯    │
│ 1 │ 2024-01-14 │ ╭───┬─────╮    │
│   │            │ │ 0 │ baz │    │
│   │            │ ╰───┴─────╯    │
│ 2 │ 2024-01-15 │ [list 0 items] │
╰─#─┴─delim_date─┴───rest_event───╯

I have no idea how I would go about implementing this tabular split-include function, and it honestly seems poorly suited to this example. But I'm thinking there may be a use for it.

Zoybean avatar Feb 04 '24 14:02 Zoybean