jackson-dataformats-text icon indicating copy to clipboard operation
jackson-dataformats-text copied to clipboard

Catch and report duplicate column names in header line

Open cowtowncoder opened this issue 3 years ago • 2 comments

(note: offshoot of a comment in #285)

It looks like code does not currently check that the column names included in the header line are unique; meaning that one name may be occur more than once, and in that case the last one is used. This should not be allowed: column names in header line should not have duplicates.

cowtowncoder avatar Aug 21 '22 03:08 cowtowncoder

I have a case that may change your mind. This is a real case at work, but I did some simplification.

We get csv files from other companies to parse:

ID,Name
1,"Bob"
2,"David"

, with a schema like this:

@JsonPropertyOrder({"id", "name"})
class A {
  @JsonProperty("Id")
  @JsonAlias({"Identification", "ID"})
  public int id;

  @JsonProperty("Name")
  @JsonAlias({"Caption", "Title"})
  public String name;
}

It worked fine. Pay attention to the aliases, because there are several possible variations.

And one day, their schema changed, adding another column "ID". The csv files could be like this:

ID,Name,ID
1,"Bob","UA2940"
2,"David","IM3592"

Although there are now two columns named "ID", we can tell the difference. We know exactly what the new column means. Now we want the schema to be like this:

@JsonPropertyOrder({"idNumber", "name", "idString"})
class A2 {
  @JsonProperty("Id")
  @JsonAlias({"Identification", "ID"})
  public int idNumber;

  @JsonProperty("Name")
  @JsonAlias({"Caption", "Title"})
  public String name;

  @JsonProperty("IdString")
  @JsonAlias({"IDString", "ID"})
  public String idString;
}

This will cause exceptions. "ID" will always map to "IdNumber", but the new column is not int type.

Because there are two duplicated columns, now the order of the headers matters. We have given the order, hoping it would take effect, but it didn't.

The schema of the headers is out of our control. We can only adapt to it. Now we could only preprocess the header in advance, renaming those conflict column names. Is there any better solution to this situation?

As jackson csv is a library for general purposes, I don't think the order of the headers should be totally ignored. There should be some elegant way to support this.

BTW, we are using jackson-dataformat-csv-2.10.3, which is not the latest version. I'm going to catch up with new changes and see.

workingenius avatar Sep 15 '22 10:09 workingenius

Sorry, we met this issue because we incorrectly used mapper.schemaWithHeader(). After changing it to mapper.schemaFor(xxx), our class with annotation worked fine for the case I described.

workingenius avatar Sep 19 '22 01:09 workingenius