[BUG]The data was wrong when reading CSV with double quotation marks in some case
Describe the bug When reading CSV with double quotation marks, there're some cases, the data became wrong even I read the whole line as single string column. If user expects a single string column for a CSV file, it's strange that we have some parsing behaviors about double quotation marks. There're different symptoms but all point to the problem in double quotation handling.
Steps/Code to reproduce bug Here's the C++ code to read CSV, it's simply reading a CSV file with single column schema, then print each line. i created a test case in cpp/tests/io/csv_test.cpp with below code to run my test.
// read a CSV file with single string column schema
std::string filepath = "test.csv";
cudf::io::csv_reader_options in_opts =
cudf::io::csv_reader_options::builder(cudf::io::source_info{filepath})
.header(-1) // No header
.names({"c0"}) // Single column named c0
.dtypes({dtype<cudf::string_view>()}) // String type
.delimiter('\t'); // Tab delimiter instead of comma
auto result = cudf::io::read_csv(in_opts);
auto const view = result.tbl->view();
std::cout << "Number of columns: " << view.num_columns() << std::endl;
std::cout << "Number of rows: " << view.num_rows() << std::endl;
std::cout << "--- Column Data ---" << std::endl;
for (cudf::size_type col_idx = 0; col_idx < view.num_columns(); ++col_idx) {
auto const& col = view.column(col_idx);
std::cout << "Column [" << col_idx << "] "
<< result.metadata.schema_info[col_idx].name << ":" << std::endl;
if (col.type().id() == type_id::STRING) {
// For string columns, we can print the data
auto result = cudf::test::to_strings(col);
for (size_t line_num = 0; line_num < result.size(); ++line_num) {
std::cout << result[line_num] << std::endl;
}
} else {
std::cout << " (Type " << cudf::type_to_name(col.type())
<< " - data not printed in this test)" << std::endl;
}
std::cout << "Column [" << col_idx << "] end" << std::endl;
}
Case 1: Additonal "\n" at the end of the row when reading a line with odd number of double quotes CSV file content:
lt_qeury=o5K","last_ts" end
The wrong output:
lt_qeury=o5K","last_ts" end \n
Case 2: Only show 1 double quote when reading 2 continuous double quotes, and it causes some following lines missing until other double quotes show up. CSV file content:
"packageName":"test","type":"test","url_scheme":false,"referer":"",test
Below line will be empty
test
test
Until this line with another quote, "test=test" "test"
This line will be shown
The output:
"packageName":"test","type":"test","url_scheme":false,"referer":",test
This line will be shown
Expected behavior cuDF can output as same as the orignal content in the CSV file. Pandas outputs the same content as the original CSV files.
Environment overview (please complete the following information)
- Environment location: [Bare-metal]
- Method of cuDF install: [from source]
Environment details GPU: Titan V Driver: 575.57.08 CUDA: 12.9
Additional context Same problem in spark-rapids since spark-rapids calls the cuDF API to read CSV. This is reported from one of our customers.
@GaryShen2008 Thank you for opening the issue
Could you please clarify what the content of the test.csv is? I could not find this in the issue.
I put the CSV file content for 2 cases after the code. You can just copy the content to create the "test.csv".
CSV file content: Case 1:
lt_qeury=o5K","last_ts" end
Case 2:
"packageName":"test","type":"test","url_scheme":false,"referer":"",test
Below line will be empty
test
test
Until this line with another quote, "test=test" "test"
This line will be shown
I can't repro this issue. The following test passes:
TEST_F(CsvReaderTest, ReadCsvWithLtQueryAndLastTs)
{
auto filepath = temp_env->get_temp_dir() + "lt_query_last_ts.csv";
{
std::ofstream outfile(filepath, std::ofstream::out);
outfile << R"(lt_qeury=o5K","last_ts" end)";
}
cudf::io::csv_reader_options in_opts =
cudf::io::csv_reader_options::builder(cudf::io::source_info{filepath})
.header(-1)
.names({"c0"})
.dtypes({dtype<cudf::string_view>()})
.delimiter('\t');
auto result = cudf::io::read_csv(in_opts);
auto const view = result.tbl->view();
EXPECT_EQ(1, view.num_columns());
ASSERT_EQ(type_id::STRING, view.column(0).type().id());
expect_column_data_equal(std::vector<std::string>{R"(lt_qeury=o5K","last_ts" end)"},
view.column(0));
}
I checked the hex of my file. There is '0x0a' at the end always after I saved the file. It might be that problem? Yes, it seems so. In your test code, the binary looks as below:
6c 74 5f 71 65 75 72 79 3d 6f 35 4b 22 2c 22 6c 61 73 74 5f 74 73 22 20 65 6e 64
But in my test, since I wrote to a file in Linux, it shows:
6c 74 5f 71 65 75 72 79 3d 6f 35 4b 22 2c 22 6c 61 73 74 5f 74 73 22 20 65 6e 64 0a
But I think both we should output correctly as Pandas does.