carbon-lang
carbon-lang copied to clipboard
toolchain string literal parser mishandles trailing \t in multiline string literals
The toolchain string literal parser incorrectly treats
var s: String = """
\t
""";
as a string with contents "\n" instead of as a string with contents "\t\n", because the \t gets expanded before whitespace at the end of the line is removed.
This code in common/string_helpers.cpp treats a line with just spaces or tabs as empty
static constexpr llvm::StringRef HorizontalWhitespaceChars = " \t";
auto ParseBlockStringLiteral(llvm::StringRef source, const int hashtag_num)
-> ErrorOr<std::string> {
const size_t first_non_ws =
line.find_first_not_of(HorizontalWhitespaceChars);
if (first_non_ws == llvm::StringRef::npos) {
// Empty or whitespace-only line.
line = "";
}
}
I believe the bug is right here:
auto end_of_regular_text = contents.find_if([](char c) {
return c == '\n' || c == '\\' ||
(IsHorizontalWhitespace(c) && c != ' ');
});
result += contents.substr(0, end_of_regular_text);
contents = contents.substr(end_of_regular_text);
if (contents.empty()) {
return result;
}
if (contents.consume_front("\n")) {
// Trailing whitespace before a newline doesn't contribute to the string
// literal value.
while (!result.empty() && result.back() != '\n' &&
IsSpace(result.back())) {
result.pop_back();
}
result += '\n';
// Move onto to the next line.
break;
}
When we hit a \n, we back up over trailing horizontal whitespace to remove it from the line, but we will also back up across tab characters inserted by \t escapes. I think it would suffice to ensure we never back up by more than end_of_regular_text characters.
Hello, I am interested in contributing. This would be my first contribution so I might need some extra guidence, would that be alright?
Hello, I am interested in contributing. This would be my first contribution so I might need some extra guidence, would that be alright?
Yep, this is a good first issue, feel free to ask questions as needed here or in the #toolchain channel of our Discord server.
@zygoloid The design here says: "All trailing whitespace on each line, including the line terminator, is replaced with a single line feed (U+000A) character."
According to whitespace, \t is horizontal whitespace.
Isn't this working as designed? [trailing \t removed] (though maybe it should specify "All trailing horizontal whitespace")
If you're changing this to allow some trailing whitespace, maybe remove the rule from the design entirely? But that sounds a bit like a design change, not just a small issue.
To note here, the answer to my confusion was that the \t is unescaped to content, then the back-search code goes over content and so is examining the unescaped result rather than the original content. Thus the comments about end_of_regular_text, because it should only be looking at the just-added content.
If you're changing this to allow some trailing whitespace, maybe remove the rule from the design entirely? But that sounds a bit like a design change, not just a small issue.
This code is within the ExpandEscapeSequencesAndRemoveIndent() auto. The process appears to require separation of code indentation, from the Carbon developers intended string literal content, especially where the string literal content may have an indented blank line.
Doesn't sound like a design change. Organizing code a little more, starting from this auto may simplify a whole chain of string literal parsing concerns along this line.
To note here, the answer to my confusion was that the
\tis unescaped to content, then the back-search code goes over content and so is examining the unescaped result rather than the original content. Thus the comments about end_of_regular_text, because it should only be looking at the just-added content.
ExpandEscapeSequencesAndRemoveIndent can be split into two autos, ExpandEscapeSequences and RemoveIndent.
Maybe that will help make the discussion easier to follow in this issue, and also help to separate fixes and tests moving forward with this issue.
The removal of indents and the expansion of escape sequences may then be handled in whatever order by an encompassing process.
Additional thought is that the parsing of code blocks, and string literals alike, may both call upon these more well organized functions.
To note here, the answer to my confusion was that the
\tis unescaped to content, then the back-search code goes over content and so is examining the unescaped result rather than the original content.ExpandEscapeSequencesAndRemoveIndent can be split into two autos, ExpandEscapeSequences and RemoveIndent.
The removal of indents and the expansion of escape sequences may then be handled in whatever order by an encompassing process.
The following example should be enough to clear many potential issues. In managed code, splitting ExpandEscapeSequencesAndRemoveIndent() looks like
String ExpandEscapeSequencesAndRemoveIndent(stringLiteral) {
retString = ExpandEscapeSequences(stringLiteral);
return RemoveIndent(retString);
}
where each method is passed a string argument and returns a string argument. I believe the C++ autos are using a referenced string, so the implementation is a little different in C++.
When we hit a
\n, we back up over trailing horizontal whitespace to remove it from the line, but we will also back up across tab characters inserted by\tescapes. I think it would suffice to ensure we never back up by more thanend_of_regular_textcharacters.
This comment features mention of two characters and a character typing. So much talk of individual characters within this one string literal issue.
The proposal for character literal has resurfaced in discussion this week.
https://github.com/carbon-language/carbon-lang/issues/1934
My recent findings on C++ code organizing in this string literal issue, manage to allow a possible conclusion that new architectural code design in character literal will theorize as essential in subtle advances on this string literal issue. Attacking the C++ code autos revealed in this string literal issue is currently relevant and essential considering yet to be tested progress completion.
If string literal and character literal are going to proceed as separate design concepts, combining strings and characters using an advance in architectural code design concurrently persists with design, stlyle, and issue management progress.
While there are probably some opportunities for code sharing with character literal parsing, I think for this issue we're fine with a solution that focuses on just fixing the multiline string literal issue.
Yeah, it would be good to get this issue resolved.
Considering the effort of potentially reassigning this issue, I'm wanting to fix something else. I only want to do a small effort every week. Maybe something a little higher level code.
Until I'm reacquainted with C++ some more, I'm probably better at attacking .h files more aggressively than .cpp files. My work lately has involved a good amount of code architecture and managed code writing.
My analysis of this can be easily translated into talk of C++ header definitions. Thanks for your feedback. 😀
I only want to do a small effort every week.
I think I need to clarify the estimated size of this issue...
The root problem is the literal text \t is turned into a tab character and incorrectly removed by block string literal indentation handling (described in the string literals design). It's a tab character, but the removal as indent is a bug.
I'd estimate a fix as ~10 LoC (preferably also adding a couple regression tests), and zygoloid noted the place to look at. We think this is a good first issue because fixing it only requires an understanding of the string parsing code plus the multi-line string literal indent handling.
(feel free to look at other things, but I want to avoid confusion about the size and scope of this issue)
I only want to do a small effort every week.
I think I need to clarify the estimated size of this issue...
The root problem is the literal text
\tis turned into a tab character and incorrectly removed by block string literal indentation handling (described in the string literals design). It's a tab character, but the removal as indent is a bug.I'd estimate a fix as ~10 LoC (preferably also adding a couple regression tests), and zygoloid noted the place to look at. We think this is a good first issue because fixing it only requires an understanding of the string parsing code plus the multi-line string literal indent handling.
(feel free to look at other things, but I want to avoid confusion about the size and scope of this issue)
Awesome! My mind is in project management and a fresh Ubuntu Carbon install right now. Please give me a chance on at least one other issue, maybe a docs update to design or style; and then I can fire straight here and steal this base if no one else dives in.
Decided to get going on this one. Sorry about referencing from my fork. Didn't know that would show up here.
The draft PR following this comment should be more clean and easier to follow.
Thanks everyone for your comments and suggestions. I'm happy to be working on this issue.
Recent comment by leads recommend solving more than only the \t\n mishandling issue. There's likely some more escape sequences and spacing issues that need to be handled. Regression testing has been recommended also.
I am gradually setting up my development environment to meticulously handle all of these requests to the best of my abilities. I should be able to start doing simple manual tests soon.
I apologize for not having much activity on GitHub lately. In my fork, I have created a roadmap for this long term issue under the guise of an unofficial sub-team, where I am the only sub-team member. I am thinking my GitHub activity will increase when the early steps of this roadmap are completed.
My new draft pr has linked text written that should also handle the rtl version of this issue. Hopefully by the end of 2023, we can make this issue go out of style.
Just noting that I've hidden some discussion as off-topic because it was distracting from this issue. Generally, this issue is open and anyone interested should feel free to take a look at fixing it. Our hope is that it'll be a relatively small and isolated fix in the code.
The toolchain string literal parser incorrectly treats
var s: String = """ \t """;as a string with contents
"\n"instead of as a string with contents"\t\n", because the\tgets expanded before whitespace at the end of the line is removed.
Please edit the triple double-quotes into single-quotes because it maybe confusing for the beginners like me :)
Please edit the triple double-quotes into single-quotes because it maybe confusing for the beginners like me :)
The triple quotes are deliberate; see https://github.com/carbon-language/carbon-lang/tree/trunk/docs/design#string-literals
Please edit the triple double-quotes into single-quotes because it maybe confusing for the beginners like me :)
The triple quotes are deliberate; see https://github.com/carbon-language/carbon-lang/tree/trunk/docs/design#string-literals
I don't think so.. It produces an error here (https://compiler-explorer.com/z/x5enh4aWb) also read this part of "string literal doc" (https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/string_literals.md#simple-and-block-string-literals)
I think what happened is: the language was updated, someone updated a part of the doc and missed to update another.. I think it's another problem needs to be solved, the multi-documenting (so to say) for one feature.. I think it's better to be just a one place, to avoid situations like this..
Thanks, that is indeed a doc error that it should use ''' not """. I'll fix that (#2638). https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/string_literals.md is correct.
Thanks for the fix @clavin!