language
language copied to clipboard
Introduce ability to remove leading indents in a multiline string after a newline.
In dart if I create a multiline string like this:
String itemList = """
1. Item A
2. Item B
3. Item C
""";
I expect the output of print(itemList)
to be as:
1. Item A
2. Item B
3. Item C
But the actual output comes out to be:
1. Item A
2. Item B
3. Item C
It would be helpful to introduce a method called itemList.trimLeadingIndents()
that would give us the output as expected after removing the indents after every newline character.
I would like to contribute fixing this issue and providing a method but before that if the community can approve if they would like to have such method in the Dart
language.
Any help would be appreciable.
I would much rather change the language to make the initial example work.
Adding a trimming function is "easy", at least if we can agree on what it should do. (treat CR, CR+LF and LF as line terminators or only LF, remove any common prefix consisting of only spaces, or only whitespace, from each "line", not treating a trailing line terminator as introducing an empty line, do remove a final line containing only spaces, maybe add a tabSize
optional which allows expanding tabs to spaces). Definitely doable.
It's just a very specific function. If the only use-case for it is to fix literals, then I don't think it carries its own weight. At least not in the SDK, but it's fairly easily added with extension methods if you want it.
import "dart:convert";
String trimLeadingWhitespace(String text) {
var lines = LineSplitter.split(text);
String commonWhitespacePrefix(String a, String b) {
int i = 0;
for (; i < a.length && i < b.length; i++) {
int ca = a.codeUnitAt(i);
int cb = b.codeUnitAt(i);
if (ca != cb) break;
if (ca != 0x20 /* spc */ && ca != 0x09 /* tab */) break;
}
return a.substring(0, i);
}
var prefix = lines.reduce(commonWhitespacePrefix);
var prefixLength = prefix.length;
return lines.map((s) => s.substring(prefixLength)).join("\n");
}
main() {
var x = trimLeadingWhitespace("""
1.x
2.y
""");
print("$x");
}
The language change would be something like:
- If the last line of a multiline string literal's content (the line leading up to the
"""
or'''
quote) contains only whitespace characters (syntactically, no escapes or interpolations), - then all other lines (not including an entirely empty first line, which is already not included in the resulting string) must start with the same whitespace, and
- then that whitespace is not included in the resulting string.
It does mean that tabs and spaces are different, Dart does not have a canonical way to convert between tabs and spaces, which is why I'd make it an error if the other lines do not match. That ensures that accidental mismatches are caught early.
@lrhn Can we just use a Regex
pattern which can detect the leading indents after a newline and replaces the matches with ''
?
If that's possible, I think it will be more efficient.
(I'm sure I can optimize the code to not split first, but do all the work on the original string, that will make that code more efficient as well).
If you just use one RegExp
to detect the leading whitespace, then it cannot check that all the lines have the same leading whitespace.
If you do:
var something = """
* foo.
* bar
- baz
* qux
""";
you don't want to remove the extra indent from - baz
, only the shared indent that is on all lines.
Let's try:
final RegExp _commonLeadingWhitespaceRE = RegExp(r"([ \t]+)(?![^]*^(?!\1))", multiLine: true);
String trimLeadingWhitespace(String text) {
var commonWhitespace = _commonLeadingWhitespaceRE.matchAsPrefix(text);
if (commonWhitespace != null) {
return text.replaceAll(RegExp("^${commonWhitespace[1]}", multiLine: true), "");
}
return text;
}
This can obviously be simplified if we assume that all line terminators are LF characters, then the final replace would just be:
return text.replaceAll("\n${commonWhitespace[0]}", "\n");
and we wouldn't have to allocate a new RegExp
each time (just a new string).
It's been tested very little, but the logic seems correct :grin:.
It's still not particularly efficient because the RegExp
checks each possible length of leading whitespace of the first line against all later line starts. The algorithm above knows to only use the common prefix in later checks.
Hmm, if we require the final line to be only whitespace, and all other lines starting with the same whitespace, then we can change the RegExp to:
final RegExp _commonLeadingWhitespaceRE = RegExp(
r"(?=[^]*^([ \t]+)$(?![^]))(?![^]*^(?!\1))", multiLine: true);
It would then start by finding the final line of only whitespace, and then check that all lines start with that whitespace. Might be more efficient, but less general. Still not massively efficient, though (I could probably implement that more efficiently in Dart code too).
Do you have a more efficient RegExp
-based approach on mind?
(RegExps are not necessarily efficient just because they are compact - and hard to read).
So, just for completeness, I've written a benchmark using the second RegExp above and hand-written code to do the same thing: https://dartpad.dev/?id=701db852e0a0c001786d82f04c87357c (Bigger score is better).
The hand-written code is ~30% faster in dartpad, and 150% faster when run on the VM.
(For good measure, I also added a version using a single RegExp
replace, but it's ~two orders of magnitude slower than the other approaches, and it's also a wrong implementation because it allows initial lines with different leading whitespace).