jackson-dataformat-csv icon indicating copy to clipboard operation
jackson-dataformat-csv copied to clipboard

Two doubles quotes in columns causes Unexpected character exception

Open youribonnaffe opened this issue 6 years ago • 7 comments

I have a CSV file with the following content (just a limited extract here):

route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color
OCE669711,OCESN,"",""Cars Réguliers ""L 11""  (Nantes - St Gilles Croix de Vie)"",,3,,,

Parsing this CSV content with CsvMapper causes the following error:

com.fasterxml.jackson.core.JsonParseException: Unexpected character ('C' (code 67)): Expected separator ('"' (code 34)) or end-of-line
 at [Source: java.io.StringReader@279ad2e3; line: 2, column: 23]

	at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1702)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:558)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:456)
	at com.fasterxml.jackson.dataformat.csv.CsvParser._reportUnexpectedCsvChar(CsvParser.java:1089)
	at com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder._nextQuotedString(CsvDecoder.java:838)
	at com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.nextString(CsvDecoder.java:601)
	at com.fasterxml.jackson.dataformat.csv.CsvParser._handleNextEntry(CsvParser.java:678)
	at com.fasterxml.jackson.dataformat.csv.CsvParser.nextFieldName(CsvParser.java:575)
	at com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringKeyMap(MapDeserializer.java:505)
	at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:362)
	at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:27)
	at com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:277)
	at com.fasterxml.jackson.databind.MappingIterator.readAll(MappingIterator.java:317)
	at com.fasterxml.jackson.databind.MappingIterator.readAll(MappingIterator.java:303)

Here is a unit test to reproduce the issue:

    @Test
    public void doubleQuotes() throws Exception {
        String content =
                "route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color\n" +
                        "OCE669711,OCESN,\"\",\"\"Cars Réguliers \"\"L 11\"\"  (Nantes - St Gilles Croix de Vie)\"\",,3,,,";

        CsvSchema schema = CsvSchema.emptySchema().withHeader();
        MappingIterator<Map<String, String>> it = new CsvMapper().readerFor(Map.class)
                .with(schema)
                .readValues(content);

        assertEquals(1, it.readAll().size());
    }

Is there a way to configure the parser to be more flexible about this usage of quotes? Unfortunately the CSV file is not under my control and I won't be able to change it's format.

Parsing this file with OpenCSV was working but I was hoping to switch to Jackson for better performances.

youribonnaffe avatar Aug 15 '17 14:08 youribonnaffe

@youribonnaffe Thank you for reporting this problem. From code and example it seems to me this should just work as is.

Just one question: which version of Jackson are you using? Latest stable versions are 2.9.0 / 2.8.9.

cowtowncoder avatar Aug 15 '17 17:08 cowtowncoder

I'm using 2.8.9

youribonnaffe avatar Aug 15 '17 19:08 youribonnaffe

Thank you for confirming. That sounds odd as I am pretty sure this functionality has been around and tested for a long time.

cowtowncoder avatar Aug 15 '17 20:08 cowtowncoder

Hmmh. Actually, I am not sure this is a bug after all.

The problem is that the first double-quote is taken to mean that the column value is quoted. This leaves the second quote, which is taken as the end quote because it is NOT doubled -- for proper behavior here, there should be 3 double-quotes, which would be interpreted as expected. So it would seem like code that generated this CSV did not handle this aspect properly, based on my understanding of CSV.

Having said that, CSV "specification" is quite loose, as there isn't really an official specification. So I would be interested in finding if something was said of this behavior. It is possible that I have not considered some corner case.

cowtowncoder avatar Aug 16 '17 17:08 cowtowncoder

Ok, reading RFC 4180, I see:

   5.  Each field may or may not be enclosed in double quotes (however
       some programs, such as Microsoft Excel, do not use double quotes
       at all).  If fields are not enclosed with double quotes, then
       double quotes may not appear inside the fields.  For example:

       "aaa","bbb","ccc" CRLF
       zzz,yyy,xxx

   6.  Fields containing line breaks (CRLF), double quotes, and commas
       should be enclosed in double-quotes.  For example:

       "aaa","b CRLF
       bb","ccc" CRLF
       zzz,yyy,xxx

   7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

which I think spells out why the test case is invalid -- field must be quoted (as it contains double-quotes itself) and each double-quote within must be doubled itself.

cowtowncoder avatar Aug 16 '17 17:08 cowtowncoder

I agree, the value is probably malformatted according to the RFC. Still do you think there is an interest to support such usage if that could be done without breaking the existing implementation?

youribonnaffe avatar Aug 16 '17 19:08 youribonnaffe

@youribonnaffe if that could be supported (perhaps via optional CsvParser.Feature), that could be useful. I have no objections to such support.

cowtowncoder avatar Aug 16 '17 20:08 cowtowncoder