OpenRefine icon indicating copy to clipboard operation
OpenRefine copied to clipboard

Add test case to validate CSV parser handling of quoted fields with commas

Open RAD1AN1TE opened this issue 7 months ago • 3 comments

This PR adds a new test case testCsvParserHandlesCommasCorrectly in ImportingUtilitiesTests to validate that the CSV parser correctly handles fields with commas enclosed in quotes.

Changes Introduced:

  • Simulates CSV input similar to movie datasets used in OpenRefine.
  • Verifies correct parsing of:
    • Quoted fields containing commas (e.g., movie titles).
    • Headers and associated values.
  • Aligns with Apache Commons CSV parsing behavior.

Rationale:

Proper handling of quoted fields with commas is critical for accurate data import, especially in datasets with complex string fields like movie titles, addresses, and product descriptions.

Testing:

  • Verified that the test passes successfully.
  • Confirms correct record count, headers, and field values.

Fixes part of the parsing edge cases but does not address malformed CSV inputs, which can be handled in future improvements if needed.

RAD1AN1TE avatar May 18 '25 22:05 RAD1AN1TE

Does this test anything OpenRefine specific? As I read it, it tests Apache Commons CSV.

Abbe98 avatar May 18 '25 22:05 Abbe98

Does this test anything OpenRefine specific? As I read it, it tests Apache Commons CSV.

While it uses Apache Commons CSV, this test ensures OpenRefine correctly handles quoted fields with commas during imports, preventing misconfigurations or regressions in parsing.

RAD1AN1TE avatar May 18 '25 23:05 RAD1AN1TE

@RAD1AN1TE Welcome to the OpenRefine project!

Thank you for your contribution, but to expand on what @Abbe98 said, this new test doesn't actually test any OpenRefine code, but instead calls the Apache CSV parser directly. Since OpenRefine doesn't use the Apache Commons CSV parser (and never has), this doesn't test functionality which affects users.

Perhaps I shouldn't have introduced the use of this parser in the test code a couple years ago, but it's only used for a quick and dirty record count to make sure that the decompressing of concatenated compressed streams works (for compressors which support concatenated streams).

You can see here that this is the only place in the code base that it's used: https://github.com/search?q=repo%3AOpenRefine%2FOpenRefine%20org.apache.commons.csv&type=code

If you'd like to help improve the CSV importer tests, the place to look is: https://github.com/OpenRefine/OpenRefine/blob/master/main/tests/server/src/com/google/refine/importers/SeparatorBasedImporterTests.java

tfmorris avatar May 19 '25 03:05 tfmorris

Given the feedback above, I think it's time to close this. We can always reopen it later if we want to.

SoryRawyer avatar Jun 30 '25 14:06 SoryRawyer