Support control characters in `@CsvSource` and `@CsvFileSource`
Description
I am writing unit tests where test cases have input strings with (non-printable) control characters. These characters generally occupy code points U+0000 through U+001F. When using @CsvSource I am finding that using control character literals in strings behaves differently from printable characters.
For example:
- An unquoted
\u0000literal is translated tonull. - A quoted
\u0000literal is translated to an empty string"".
This behavior is observed with both Eclipse's internal JUnit 5 test runner and with Maven's Surefire plugin. I have considered the possible impact of the nullValues parameter of @CsvSource. This attribute defaults to {}, so translation to null or an empty string is therefore not expected.
Steps to reproduce
The test below should pass, but unexpectedly fails for @CsvSource test cases that have \u0000 literals.
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNotNull;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.CsvSource;
class Reproduction {
@Test
void proveThatStringWithControlCharacterLiteralIsNotNullAndHasLenghtOfOne() {
assertEquals(1, "\u0000".length());
}
@ParameterizedTest
@CsvSource(delimiterString = "||", textBlock = """
A || 1
\u0000 || 1
B\u0000 || 2
""")
void testWithUnquotedInput(String testcase, Integer expectedLength) {
assertNotNull(testcase);
assertEquals(expectedLength, testcase.length());
}
@ParameterizedTest
@CsvSource(delimiterString = "||", textBlock = """
'A' || 1
'\u0000' || 1
'B\u0000' || 2
""")
void testWithQuotedInput(String testcase, Integer expectedLength) {
assertNotNull(testcase);
assertEquals(expectedLength, testcase.length());
}
}
Context
Used versions
- Jupiter 5.11.0-M2
- Platform 1.11.0-M2
Build Tool/IDE
- Eclipse 2024.03
- Maven 3.9.6
- JVM: Java HotSpot(TM) 64-Bit Server VM (build 17+35-LTS-2724, mixed mode, sharing)
Hi @xazap,
Thanks for raising the issue.
I edited your description to clarify quoting vs. escaping.
In addition, I confirmed the behavior you have reported.
This may be an issue with the CSV parsing library that we use to support @CsvSource.
In any case, we'll investigate what our options are.
This may be an issue with the CSV parsing library that we use to support
@CsvSource.
That's indeed the case.
It looks like the Univocity CSV parser ignores control characters by default.
When I add the following to our CsvParserFactory.createParserSettings(...) method, all invocations of your testWithUnquotedInput() parameterized test pass.
settings.setSkipBitsAsWhitespace(false);
However, the latter two invocations of testWithQuotedInput() still fail, and I'm not yet sure if we can influence that.
I assumed the skipBitsAsWhitespace would apply to both quoted and unquoted text, but that appears not to be the case.
OK, after a bit more experimentation, I got your Reproduction test cases (and the rest of the JUnit 5 suite) passing with the following additions to our CsvParserFactory.createParserSettings(...) method.
settings.getFormat().setCharToEscapeQuoteEscaping('\\');
settings.setSkipBitsAsWhitespace(false);
Although these changes in the settings do not cause any of the tests in our test suite to fail, I'm a bit hesitant to change them for all users.
We may wish to introduce attributes in @CsvSource and @CsvFileSource to allow users to opt into these features; however, we would ideally like to keep the number of attributes in those annotations to a minimum.
In light of that, we'll discuss this topic during one of our upcoming team calls.
Please note that control characters are ignored in your testWithUnquotedInput() test case, because they are considered leading or trailing whitespace.
Thus, the following passes without any modifications to JUnit Jupiter, since C\u0000D contains \u0000 between other non-whitespace characters.
@ParameterizedTest
@CsvSource(delimiterString = "||", textBlock = """
A || 1
C\u0000D || 3
""")
void testWithUnquotedInput(String testcase, Integer expectedLength) {
assertNotNull(testcase);
assertEquals(expectedLength, testcase.length());
}
Similarly, the following also passes without any modifications to JUnit Jupiter by setting ignoreLeadingAndTrailingWhitespace = false and removing all whitespace between columns and the delimiters.
@ParameterizedTest
@CsvSource(ignoreLeadingAndTrailingWhitespace = false, textBlock = """
A,1
\u0000,1
B\u0000,2
""")
void testWithUnquotedInput(String testcase, Integer expectedLength) {
assertNotNull(testcase);
assertEquals(expectedLength, testcase.length());
}
Please note that control characters are ignored in your
testWithUnquotedInput()test case, because they are considered leading or trailing whitespace.
Thank you for explaining! For me it's confusing if @CsvSource has a different definition of whitespace than the Java SDK itself. Since java.lang.Character.isWhitespace('\u0000') returns false, I would not have thought it to be considered whitespace. Still, a third party library could maintain its own definition. Looking at the Javadoc of @CsvSource it mentions the term whitespace but doesn't clarify which code points it considers whitespace. Maybe this could be added to the Javadoc?
Thus, the following passes without any modifications to JUnit Jupiter, since
C\u0000Dcontains\u0000between other non-whitespace characters.
Ah, this works around the issue, but makes test less readable because I have to deliberately insert characters I don't want to test for.
Similarly, the following also passes without any modifications to JUnit Jupiter by setting
ignoreLeadingAndTrailingWhitespace = falseand removing all whitespace between columns and the delimiters.@ParameterizedTest @CsvSource(ignoreLeadingAndTrailingWhitespace = false, textBlock = """ A,1 \u0000,1 B\u0000,2 """) void testWithUnquotedInput(String testcase, Integer expectedLength) { assertNotNull(testcase); assertEquals(expectedLength, testcase.length()); }
This works! The formatting is not as I would like, but it is an acceptable workaround. I am confused about the meaning of ignoreLeadingAndTrailingWhitespace=false though. If leading whitespace is not to be ignored, why is the leading whitespace before A not part of the testcase?
Also, I noticed something odd: if I move the closing """) in your fixed example one tab to the left, the test fails for all three cases:
@ParameterizedTest
@CsvSource(ignoreLeadingAndTrailingWhitespace = false, textBlock = """
A,1
\u0000,1
B\u0000,2
""")
void testWithUnquotedInput(String testcase, Integer expectedLength) {
assertNotNull(testcase);
assertEquals(expectedLength, testcase.length());
}
Why would one less trailing whitespace character matter in this case?
Thank you for explaining!
You're welcome!
For me it's confusing if
@CsvSourcehas a different definition of whitespace than the Java SDK itself. Sincejava.lang.Character.isWhitespace('\u0000')returnsfalse, I would not have thought it to be considered whitespace. Still, a third party library could maintain its own definition.
I understand how that can be confusing.
To be honest, I was not aware of the difference with the Univocity parser's default behavior, and I doubt anyone else on the JUnit team was aware of that either.
If I understood the documentation correctly, the difference is due to the fact that some databases include control characters in their exported CSV files which are typically ignored when importing or working with those CSV files.
Looking at the Javadoc of
@CsvSourceit mentions the term whitespace but doesn't clarify which code points it considers whitespace.
As I mentioned above, we were unaware of that difference.
Maybe this could be added to the Javadoc?
Yes, we can definitely update the Javadoc to make that explicit.
However, I'd first like to discuss these topics within the team before committing to anything concrete.
Thus, the following passes without any modifications to JUnit Jupiter, since
C\u0000Dcontains\u0000between other non-whitespace characters.Ah, this works around the issue, but makes test less readable because I have to deliberately insert characters I don't want to test for.
I was not suggesting that you use that as a workaround. Rather, I was merely pointing out how things work with the default CSV parser settings.
Similarly, the following also passes without any modifications to JUnit Jupiter by setting
ignoreLeadingAndTrailingWhitespace = falseand removing all whitespace between columns and the delimiters.This works! The formatting is not as I would like, but it is an acceptable workaround.
I'm glad to hear that's a suitable workaround for you. 👍
I am confused about the meaning of
ignoreLeadingAndTrailingWhitespace=falsethough. If leading whitespace is not to be ignored, why is the leading whitespace beforeAnot part of the testcase?
There is no whitespace before A in that example. See below.
Also, I noticed something odd: if I move the closing
""")in your fixed example one tab to the left, the test fails for all three cases:Why would one less trailing whitespace character matter in this case?
If you move the closing """, you have introduced intentional whitespace in the String.
This is simply how text blocks in Java work.
The documentation in the User Guide for @CsvSource states the following:
It is therefore recommended that the closing text block delimiter (""") be placed either at the end of the last line of input or on the following line, left aligned with the rest of the input.
And:
Java’s text block feature automatically removes incidental whitespace when the code is compiled. However other JVM languages such as Groovy and Kotlin do not. Thus, if you are using a programming language other than Java and your text block contains comments or new lines within quoted strings, you will need to ensure that there is no leading whitespace within your text block.
I suggest you read that link which points to the Programmer's Guide to Text Blocks.
Hopefully that clarifies things!
I suggest you read that link which points to the Programmer's Guide to Text Blocks.
Hopefully that clarifies things!
Cheers, there is a lot more to text blocks than I knew! It makes perfect sense now.
Due to #4339, the tests shown in this issue no longer fail.
While the uniVocity implementation sometimes silently remove \u0000 characters (regardless of their appearance), FastCSV does not.
The following tests demonstrate this behavior. While both are succeeding in JUnit 6, the binary0SeparatedByMultiChar testcase fails in JUnit 5.
@ParameterizedTest
@CsvSource(delimiterString = ",", ignoreLeadingAndTrailingWhitespace = false, value = "'\u0000a\u0000b\u0000',5")
void binary0(String testcase, Integer expectedLength) {
assertEquals(expectedLength, testcase.length());
}
@ParameterizedTest
@CsvSource(delimiterString = "||", ignoreLeadingAndTrailingWhitespace = false, value = "'\u0000a\u0000b\u0000'||5")
void binary0SeparatedByMultiChar(String testcase, Integer expectedLength) {
assertEquals(expectedLength, testcase.length());
}
I'd like to share more about trimming whitespaces, as this topic has various pitfalls and misunderstandings.
The mechanism (as of JUnit 6.0.0-M1) is as follows:
For quoted fields:
- JUnit configures trimWhitespacesAroundQuotes in FastCSV to remove whitespaces before opening and after closing quotes. This allows reading data like
' foo' || 'bar 'as' foo'||'bar ', which wouldn't be possible otherwise, as the RFC does not allow whitespaces around quotes. Whitespace in this context is defined as any character <=U+0020(the space character) – the same logic as in Java's String.trim() method. - The value inside a quoted field remains unchanged. The two fields in the example above would be read as
_fooandbar_, where_represents a whitespace character.
For unquoted fields:
- FastCSV itself does not remove any character from unquoted fields. Reading data like
foo || barwould result in the values____foo____and____bar____, where_represents a whitespace character. - If the
ignoreLeadingAndTrailingWhitespaceargument of@CsvSourceis set totrue(the default), leading and trailing whitespaces are removed via String.strip(), which treats whitespaces as defined by Character.isWhitespace().
Possible unwanted behavior
In JUnit 5, the ignoreLeadingAndTrailingWhitespace argument was used to configure ignoreLeadingWhitespaces and ignoreTrailingWhitespaces in uniVocity. While I couldn't find any documentation about this, it seems that uniVocity uses the "trim logic" rather than the "strip logic" for unquoted fields. While trim simply removes characters <= U+0020, strip removes characters defined as whitespace by Character.isWhitespace(), which is based on the Unicode standard but does not include control characters like \u0000.[^1]
In the vast majority of test cases, the two approaches yield the same result, as the characters to remove are mostly space and tab characters. Neither approach is inherently better, but it is important to understand their differences, as they can lead to unexpected results. Maybe JUnit 6 should switch back to trim() for unquoted fields to maintain compatibility with the previous version. Also, the whitespace treatment should be documented and tested.
@vdmitrienko WDYT?
[^1]: A comprehensive overview of the differences between these two methods can be found here: https://stackoverflow.com/a/79629431.
@osiegmar, thanks for pointing this out!
The core issue here, I believe, is that JUnit and FastCSV use different whitespace detection algorithm. As you mentioned, whitespaces in quoted fields are handled on the FastCSV side, which uses String.trim() to remove them, whereas unquoted fields handled by JUnit via String.strip().
As a result, whitespace handling becomes inconsistent and depends on whether a field is quoted or not.
I’ve created a PR to switch back to String.trim(), which aligns JUnit’s behavior with FastCSV and restores the original whitespace handling logic.
cc: @marcphilipp @sbrannen
Maybe JUnit 6 should switch back to
trim()for unquoted fields to maintain compatibility with the previous version. Also, the whitespace treatment should be documented and tested.
I’ve created a PR to switch back to
String.trim(), which aligns JUnit’s behavior with FastCSV and restores the original whitespace handling logic.
@sbrannen Are you good with this change since you addressed #4697 today?
Maybe JUnit 6 should switch back to
trim()for unquoted fields to maintain compatibility with the previous version. Also, the whitespace treatment should be documented and tested.I’ve created a PR to switch back to
String.trim(), which aligns JUnit’s behavior with FastCSV and restores the original whitespace handling logic.@sbrannen Are you good with this change since you addressed #4697 today?
If we are only switching back to trim() for CSV field parsing support, I'm fine with that (as long as it's documented).
For all other purposes, I think we should use strip() going forward (a la #4697).
I suggest closing this issue as the tests shown no longer fail as 2a52a06 proves.
Assigned to 6.0.0 milestone to ensure we verify this issue can be closed before the release.
For example:
- An unquoted
\u0000literal is translated tonull.- A quoted
\u0000literal is translated to an empty string"".
Due to switching to FastCSV (#4339) and the switch back to using String#trim in #4692, the following is now the behavior in JUnit Jupiter 6.
- An unquoted
\u0000literal is removed if it is leading or trailing whitespace and otherwise remains\u0000 - A quoted
\u0000literal remains\u0000
In the original Reproduction test case for this issue, testWithQuotedInput() now passes without modification. Whereas, testWithUnquotedInput() passes if modified as follows (as explained in the last example in https://github.com/junit-team/junit-framework/issues/3824#issuecomment-2120606802).
@ParameterizedTest
@CsvSource(delimiterString = "||", ignoreLeadingAndTrailingWhitespace = false, textBlock = """
A||1
\u0000||1
B\u0000||2
""")
void testWithUnquotedInput(String testcase, Integer expectedLength) {
assertNotNull(testcase);
assertEquals(expectedLength, testcase.length());
}
Thus, in light of #4339 and the documentation changes in #4692, I am closing this issue as resolved.
Note as well that we now have more thorough and comprehensible tests in place.
https://github.com/junit-team/junit-framework/blob/91ae0269f0c61e8f61f1a1c507816eefb40d4b9f/jupiter-tests/src/test/java/org/junit/jupiter/params/provider/CsvArgumentsProviderTests.java#L116-L193
Since #4692 was resolved in 6.0 M2, I have retroactively assigned this issue to 6.0 M2 as well.