jackson-dataformats-text icon indicating copy to clipboard operation
jackson-dataformats-text copied to clipboard

FAIL_ON_MISSING_COLUMNS doesn't work with column reordering, header line

Open bjmi opened this issue 4 years ago • 9 comments

When reading given CSV with jackson-dataformat-csv 2.11.4

name
Roger
Chris

using following snippet

CsvMapper csvMapper = new CsvMapper();
csvMapper.configure(CsvParser.Feature.FAIL_ON_MISSING_COLUMNS, true);
CsvSchema csvSchema = CsvSchema.builder().setUseHeader(true).setReorderColumns(true)
        .addColumn("name").addColumn("age").build();
List<Person> persons = csvMapper
        .readerFor(Person.class)
        .with(csvSchema)
        .<Person> readValues(csv)
        .readAll();
...
class Person {
    public String name;
    public int age;
}

doesn't throw a CsvMappingException although age column is missing in CSV.

bjmi avatar Aug 23 '21 09:08 bjmi

The same thing is true of the strict headers feature. I really don't care about the order of the columns, or any additional ones, I just care that the ones I want are present.

This should be a common requirement it is a shame it cannot be enforced

mamos-helloself avatar Nov 24 '21 15:11 mamos-helloself

There is also no way to have an error if a column name is repeated in the header when reordering is in use. It silently chooses one.

mamos-helloself avatar Nov 24 '21 16:11 mamos-helloself

I have hoped to look into this for a while now but for past 3 months my paying job has taken over almost all of my coding time, since I do not get paid to work on Jackson (not part of my daytime job).

So, @mamos-helloself PRs would obviously be welcome.

I may be able to at least add the failing test soon now tho.

cowtowncoder avatar Nov 24 '21 17:11 cowtowncoder

Hmmh. Looking at this now I wonder if the reason is due to "use header" -- header really only has 1 column and if that is to be considered authoritative, there is nothing missing? So perhaps it goes to precedence/definition of how explicit schema should interact with header line...

I'll have to look into this more and will first add a failing test just to reproduce what is seen here.

cowtowncoder avatar Nov 24 '21 19:11 cowtowncoder

Hmmh. Looking at this now I wonder if the reason is due to "use header" -- header really only has 1 column and if that is to be considered authoritative, there is nothing missing?

The one column example was probably to simple. What about this two column CSV:

name,age
Roger,27
Chris,53
CsvMapper csvMapper = new CsvMapper();
csvMapper.configure(CsvParser.Feature.FAIL_ON_MISSING_COLUMNS, true);
CsvSchema csvSchema = CsvSchema.builder().setUseHeader(true).setReorderColumns(true)
        .addColumn("name").addColumn("weight").addColumn("age").build();
List<Person> persons = csvMapper
        .readerFor(Person.class)
        .with(csvSchema)
        .<Person> readValues(csv)
        .readAll();
...
class Person {
    public String name;
    public int weight;
    public int age;
}

doesn't throw a CsvMappingException although weight column is missing in CSV.

bjmi avatar Nov 25 '21 07:11 bjmi

Sorry I misunderstood your last comment, but the example is valid.

bjmi avatar Nov 25 '21 09:11 bjmi

My daily use case has 3 requirements

  • Column order shouln't matter - mapping is done by column headers
  • Fail when columns are missing
  • Ignore superfluous columns i.e. column headers that aren't defined previously - #286

Is there a real chance that this issue will be fixed?

bjmi avatar Jan 07 '22 08:01 bjmi

Yes, if and when someone has time to look into it. I have been hoping to address it but I am hopelessly swamped at work and helping others with PRs with Jackson, releases all that stuff. But I do hope to help here in some form.

It is likely that for this to move in near future, a PR would be needed.

On plus side, as per my earlier comment, a failing test is now included in repo (has to be run via IDE, excluded from default test run).

cowtowncoder avatar Jan 08 '22 22:01 cowtowncoder

Hmmm. I suspect that the issue is with the actual header line only having one column, and thereby "extra" columns of CsvSchema being discarded. I think it is reasonable to expect either that:

  1. If header line has fewer columns than schema, exception would be thrown OR
  2. If header line has fewer columns the "missing" columns from CsvSchema would simply be appended in the order they were given

I think I'll proceed with (2), although I wonder if some configurability might be needed anyway, to alter behavior.

And yes, I am finally circling back here, hoping to resolve this for 2.14(.0).

EDIT: re-reading full discussion, it does sound like (1) is what is wanted actually. So I will try doing that instead; this would also make the test I added pass.

cowtowncoder avatar Aug 08 '22 04:08 cowtowncoder

Ok: changed the behavior to optionally throw an exception if column headers are missing (compared to registered CsvSchema), controller by new

CsvParser.Feature.FAIL_ON_MISSING_HEADER_COLUMNS

which defaults to true (that is, by default this problem is reported).

Will be in 2.14.0 release

cowtowncoder avatar Aug 21 '22 23:08 cowtowncoder

Perhaps I'm misunderstanding something, but I'm not seeing anywhere in which the new FAIL_ON_MISSING_HEADER_COLUMNS feature is actually checked to decide whether this new behavior should occur?

norrisjeremy avatar Nov 10 '22 13:11 norrisjeremy

I.e., should there be something like this?

diff --git a/csv/src/main/java/com/fasterxml/jackson/dataformat/csv/CsvParser.java b/csv/src/main/java/com/fasterxml/jackson/dataformat/csv/CsvParser.java
index b7b050e4..c3621ae9 100644
--- a/csv/src/main/java/com/fasterxml/jackson/dataformat/csv/CsvParser.java
+++ b/csv/src/main/java/com/fasterxml/jackson/dataformat/csv/CsvParser.java
@@ -862,14 +862,16 @@ public class CsvParser
             }
         }
         // [dataformats-text#285]: Are we missing something?
-        int diff = schemaColumnCount - newColumnCount;
-        if (diff > 0) {
-            Set<String> oldColumnNames = new LinkedHashSet<>();
-            _schema.getColumnNames(oldColumnNames);
-            oldColumnNames.removeAll(newSchema.getColumnNames());
-            _reportCsvMappingError(String.format("Missing %d header column%s: [\"%s\"]",
-                    diff, (diff == 1) ? "" : "s",
-                            String.join("\",\"", oldColumnNames)));
+        if (CsvParser.Feature.FAIL_ON_MISSING_HEADER_COLUMNS.enabledIn(_formatFeatures)) {
+            int diff = schemaColumnCount - newColumnCount;
+            if (diff > 0) {
+                Set<String> oldColumnNames = new LinkedHashSet<>();
+                _schema.getColumnNames(oldColumnNames);
+                oldColumnNames.removeAll(newSchema.getColumnNames());
+                _reportCsvMappingError(String.format("Missing %d header column%s: [\"%s\"]",
+                        diff, (diff == 1) ? "" : "s",
+                                String.join("\",\"", oldColumnNames)));
+            }
         }
 
         // otherwise we will use what we got

norrisjeremy avatar Nov 10 '22 13:11 norrisjeremy