Add `-From` and `-To` parameters to `Select-String`
Add -FromPattern and -ToPattern parameters to Select-String cmdlet.
It is often required to select the lines between two other lines define by regulars expressions as show from e.g. the Stackoverflow Q&As:
- Extract specific text and extract
- How to select between multiple lines in power shell?
- How to read lines between 2 special characters in txt-file with Powershell
- Unable to get data between multiple keyword using powershell
- Need Regex to match multiple lines until Match is found between common delimiters
- Extract multiple lines of text between two key words from shell command in powershell
- Extract block of text using regex and powershell
- Need to search a line from a long config file with powershell
- Powershell Regex: Reading a multi-line string between two points
- How can I deleted lines from a certain position?
- Regex match multiple lines from file
- Extract data from log file and copy data to another text file using powershell
- ...
To setup a function using the Select-String cmdlet and properly stream the input and output, currently requires to get track of whether any line needs to be passed thru or not. Therefore functions for this often end up in stalling the pipeline and doing an expensive regular expression over a full multiple line text string.
-FromPattern and -ToPattern parameters would allow for a better implementation and an easier syntax to perform this specific common string selection.
Include or Exclude?
This propose raises the question whether the found line that contains the -FromPattern and -ToPattern expression should be included or excluded in the output.
In my vision for this propose the output should start right after the point where the -FromPattern expression is found (and output any succeeding string) and stop right before the point where the -ToPattern expression is found (and output any preceding string). This will allow the developer to select what needs to be passed thru using the concerned regular expressions:
Prototype
function SelectString {
[CmdletBinding()] param(
[Parameter(ValueFromPipeLine = $True)][String]$String,
[Regex]$FromPattern,
[Regex]$ToPattern
)
begin {
$PassThru = !$PSBoundParameters.ContainsKey('FromPattern') -and $PSBoundParameters.ContainsKey('ToPattern')
}
process {
$Start = 0
do { # There could be (multiple) $FromPattern/$ToPattern matches in one line
if ($PassThru) {
$ToMatch = if ($PSBoundParameters.ContainsKey('ToPattern')) { $ToPattern.Match($String, $Start) }
if ($ToMatch.Success) {
$PassThru = $False
$Length = $ToMatch.Index - $Start
if ($Length) { $String.SubString($Start, $Length) }
$Start = $ToMatch.Index + $ToMatch.Length
}
else {
$String.SubString($Start)
$Start = $String.Length
}
}
if (!$PassThru) {
$FromMatch = if ($PSBoundParameters.ContainsKey('FromPattern')) { $FromPattern.Match($String, $Start) }
if ($FromMatch.Success) {
$PassThru = $True
$Start = $FromMatch.Index + $FromMatch.Length
}
else {
$Start = $String.Length
}
}
} Until ($Start -ge $String.Length)
}
}
(Note: this prototype is case sensitive.)
Examples
In the examples below the follow string list is use for $Test:
$Test = @'
[One]
[Two]
<Start>[Three]
[Four]
[Five][Six]
[Seven]<End>
[Eight]
[Nine]
'@ -Split '[\r\n]+'
Example 1
Select everything between the -From and -To expression:
$Test | SelectString -From '\<Start\>' -To '\<End\>'
[Three]
[Four]
[Five][Six]
[Seven]
Example 2
Exclude the lines that matches the -From and -To expression:
(By matching the whole line)
$Test | SelectString -From '\<Start\>.*' -To '.*\<End\>'
[Four]
[Five][Six]
Example 3
Include the lines that matches the -From and -To expression:
(By using lookbehind and lookforward regular expressions)
$Test | SelectString -From '(?=\<Start\>)' -To '(?<=\<End\>)'
<Start>[Three]
[Four]
[Five][Six]
[Seven]<End>
Example 4
Select multiple items in a single line:
(Note: # Revisit the behavior of chaining Select-String calls #14850)
$Test | SelectString -From '\<Start\>' -To '\<End\>' | SelectString -From '\[' -To '\]'
Three
Four
Five
Six
Seven
Caveats
The Select-String cmdlet has already quite some parameters, some of the parameters (like -Pattern should be mutual exclusive with these proposed -FromPattern and -ToPattern parameters.
The -Context parameter might still apply: where the first integer refers to the number of lines before the line where the -FromPattern expression is found and the second integer to the number of lines after the line where the -ToPattern expression is found.
I like the idea; a few comments:
-
-Fromand-Tois probably a better pairing, though perhaps, given that-Patternis already[string[]]typed, a-Betweenswitch is enough (though perhaps then having to enforce exactly 2 arguments in the cmdlet itself is awkward, and the shift in logic would have to be well-documented). -
Whether to include the delimiting patterns or not could be handled similar to the
-splitoperator: exclude by default, except if capture groups are present in the patterns. This would be a simpler alternative to the look-around assertions. -
What about extracting multiple blocks? Following the current behavior, this should happen by default, except if
-Listis present, which, however, currently only applies to files as input: at most one match per file.-AllMatchesrelates to multiple matches per line (input object), and should probably not be repurposed in this context. -
What if a multiline string as a single object is provided as the input - should automatic splitting into lines then be performed?
great idea,
aother idea is to use another parameterSets Byfields, and use -FieldsSeparators parameter. the two parameter is separated by | and its must include 2 filed like this
$test | select-string -FieldsSeperator '\<Start\>|\<End\>'
You can do this sort of thing using the switch statement. For example, to output everything between "start" and "stop" do
$extractedText = switch -regex ($test) { "start" {$print=$true} {$print} {$_} "end" {$print=$false}}
It's a bit more work but this approach is ultimately more powerful (and faster). Now a nice addition to switch would be regex "ranges" like:
switch -regex ($test) { "start","end" {$_}}
which simplifies everything. (The switch statement is roughly modelled on AWK but was never as complete as I wanted it to be.)
@BrucePay
the range extension is powerfull and i hope it implemented. also select-string need -fieldsSeparators like in awk
I hate an operator named -Till - what soil is going to be tilled by it?
Why not -Until
Thanks for the comments and support,
@mklement0,
-Fromand-Tois probably a better pairing
You probably right, the reason I used Till is because in my language (Dutch) we use different conjunctions for a verb like Select (from X To Y) than a verb like Move (from X To Y) and they do have indeed a slightly different meaning.
Anyways, I don't care much with parameter names are eventually chosen. I have changed the original purpose according to this suggestion. (Also note @doctordns' suggestion: Until might also be correct).
a
-Betweenswitch is enough
Besides your own counter argument, I would also like to be able to omit one or the other parameter. The point is that the default expression for both parameters is not an empty expression but more something like (?=a)b (always false). This will work for omitting the -ToPattern parameter, but there is even a little pitfall in omitting the -FromPattern parameter: you might think this should default to something like .* to start with the first line, but the result be that it will immediately reenables the "PassTrue" again after the ToPattern is found. Knowing that I would also expect to receive multiple From/To blocks when they exist, e.g.:
($Test + $Test) |SelectString -From '\<Start\>.*' -To '.*\<End\>'
[Four]
[Five][Six]
[Four]
[Five][Six]
In other words, To receive just the last block, I would like to be able to do this:
(Select everything from the first End marker and then select everything between Start and End)
($Test + $Test) |SelectString -From '\<End\>.*' |SelectString -From '\<Start\>.*' -To '.*\<End\>'
I have updated the prototype a little for being able to omit the -FromPattern or -ToPattern parameters.
You can of cause still do something similar with a single -Between parameter but I think it is less clearer with omitting the From or To.
-Listand-AllMatches
As in the example in the above response, I think that these parameters are indeed a little redundant taken the flexibility of these purposed parameters. Anyways, if the purpose is accepted, I am happy to give it some deeper thoughts on how these existing parameters could be combined (or mutual excluded).
What if a multiline string as a single object is provided as the input - should automatic splitting into lines then be performed?
I don't think they should be automatically be split. Instead, I would like to be able to do this:
... |Out-String -Stream |Select-String ...
(According to your issue Out-String -Stream unexpectedly does not split multi-line input strings into individual lines too #14638)
@p0W3RH311,
use
-FieldsSeparatorsparameter
I think that regular expressions are to complicated by itself to be joined to a single string (what if "|" is part of my search pattern?).
@bpayette,
It's a bit more work but this approach is ultimately more powerful (and faster)
I like your alternative, but I don't see why it is "more powerful (and faster)". Besides as you mentioned yourself, it is more work and apparently requires more programming skills. Taken your own example, you should initiate $print, if you neglect to do so and coincidently define $print at a higher scope (which in most cases trutifies), it will fail and start to output from the beginning.
@iRon7
I think that regular expressions are to complicated by itself to be joined to a single string (what if "|" is part of my search pattern?).
the solution is to umbed the 2 regex inside quote/dquote like:
$test | select-string -FieldsSeperator "'^\|'|'\|$'"
or maybe more clean and powershell-way with hash
$test | select-string -FieldsSeperator @{
First = 'regex'
Last = 'regex'
}
@p0W3RH311, I don't think we want to refashion Select-String into an awk-like utility, which is both much more open-ended in its purpose while offering convenient features for splitting each line into fields, which doesn't really apply to Select-String.
@iRon7, understood re separate -FromPattern and -ToPattern parameters and splitting multi-line input.
I don't see why it is "more powerful (and faster)".
switch is indeed much more open-ended and flexible than Select-String and, as a language statement that doesn't (typically) process pipeline input object by object, it is generally significantly faster (especially with collections already in memory in full); the only case in which Select-String can (slightly) outperform switch -File is when Select-String is passed a filename rather than object-by-object pipeline input, in which case it reads the file itself, efficiently.
But, as in previous conversations, this is not an either-or scenario:
Both your proposal and @BrucePay's switch enhancement sound like they may be worthwhile.
(If that is the consensus, then the new features should at least work similarly, and one challenge that comes to mind is that the '<regex>', '<regex>' syntax - familiar from both awk and sed - has inclusive logic; that is, the lines matching the patterns are selected too. Also, given the context, the syntax may inspire the expectation that '<regex>', '<regex>' will select any single line matching any one of the (two) specified regexes. switch offers great flexibility already, but does have complexity, and adding to that may be a concern.)
@mklement0
hello, thanks for your precision. but the second example is a powershell-way and its much elegant and flexible than awk or sed , perhaps we can extented like:
$test | select-string -FieldsSeperator @{
First = 'regex'
Last = 'regex'
Inclusive = $true/$false
}
select-string is inspired from grep but its limited and miss a lot of thing
@bpayette,
@mklement0, is right (as usual😊) "it is not an either-or scenario". Therefore, I hidded my previous comment and rewrote it below)
$extractedText = switch -regex ($test) { "start" {$print=$true} {$print} {$_} "end" {$print=$false}}
Using the current switch statement, as you mentioned yourself and an argument for this propose; it is more work and requires more programming skills. Taken your own example, you should initiate $print, if you neglect to do so and coincidently define $print at a higher scope (which in most cases trutifies), it will fail as it will start to output from the beginning.
switch -regex ($test) { "start","end" {$_}}
I definitely, like the switch range suggestion. One thing to consider thou, is how to being able to omit one or the other pattern (to select everything from "start" marker to the end-of-Stream or select everything from the Start-of-Stream to the "end" marker, see also my comment on: a -Between switch is enough).
Anyways, I would recommend to open a seperate Feature Request/Idea 🚀 for this.
This issue Needs-Triage (removed) and either marked with something like Resolution-Declined or put it on a kind of wishlist "Issue we would like to prioritize, but we can't commit we will get to it yet"