simple-excel
simple-excel copied to clipboard
Just re-saving XLSX file without making changes causes cell count mismatch
Sorry for this support-type of question. I have saved an XLSX file as the expected result and the sameWorkbook() assertion passes whem comparing against a programmatically created workbook. However, if I just open and re-safe the XLSX file (without making any changes) with Excel on Windows, I get
java.lang.AssertionError:
Expected: entire workbook to be equal
but: got <2> cell(s) on row <1> expected <6> sheet "Summary",
got <2> cell(s) on row <2> expected <6> sheet "Summary",
got <2> cell(s) on row <5> expected <6> sheet "Summary",
got <2> cell(s) on row <6> expected <6> sheet "Summary",
got <2> cell(s) on row <7> expected <6> sheet "Summary",
got <2> cell(s) on row <8> expected <6> sheet "Summary"
So somehow the number of (non-physical) cells seems to have changed. Any idea why, and how to deal with it? I wanted to update the XLSX file with new test expectations, but now I cannot due to the above issue which always make the test fail.
Looks like is has something to do with merged cells in my case: At the example of row 1, there is cell A1, and cells B1 to F1 are merged into one "visual cell".
Indeed, in the original file generated via Apache POI 3.17, xl/worksheets/sheet1.xml has
<row r="1">
<c r="A1" s="2" t="s">
<v>0</v>
</c>
<c r="B1" s="2" t="s">
<v>1</v>
</c>
</row>
But after resaving with Excel it has
<row r="1" spans="1:6">
<c r="A1" s="2" t="s">
<v>0</v>
</c>
<c r="B1" s="5" t="s">
<v>1</v>
</c>
<c r="C1" s="6"/>
<c r="D1" s="6"/>
<c r="E1" s="6"/>
<c r="F1" s="6"/>
</row>
So it looks like Apache POI is not writing out cell spans correctly 😢 Edit: Or phrased more correctly, Apache POI does not seem to make getLastCellNum() return the correct number in case of this new (?) format of defining cell spans.
Awesome spot. Thanks for posting and apologies for not looking sooner.
So I'm clear, is this an issue with POI over simple-excel? Anything you think we should do in simple-excel now? Think we can get a (java) test in simple-excel to at least demonstrate the issue? I'm never confident in upgrading POI dependency but that might be worth considering?
I believe by now this is an issue in POI rather than in simple-excel, yes. The question is, can simple-excel do anything to work around it? Something like "if there is a cell count mismatch and merged cells are involved do X to double-check", but me not knowing the POI API very well I have not idea what that X could be.
Anyway, it would certainly be interesting to see whether POI 4.0.1 would fix the issue.
From memory, POI was limited with merged cells. I may be wrong though. I'll try to take a look but a JUnit test demonstrating it would be super helpful (wink wink)
I'll try to come up with a test, @tobyweston. As a small preparation for that, see https://github.com/tobyweston/simple-excel/pull/15 for some minor fixes.
Thanks. Perhaps the first thing to look at is a general dependency update in the pom.xml. We're only poi 3.17 from memory and it looks like 4.1.0 is the latest and greatest (https://poi.apache.org/download.html). Not sure what else would need looking at.
I'd also like to split the matcher stuff from the general poi wrapper. If no one finds my DSL/wrapper around poi useful, I'd be happy to remove it and leave this as a Hamcrest Matcher library only... any thoughts?
Actually, my plan was to first try to come up with a test that reproduces my issue and fails initially. And then see what needs to be done to fix the issue; maybe upgrading to POI 4.1.0 would fix it. However, I'm currently running into some strange issue where a test passes that should actually fail. I'll probably create a draft PR to demonstrate the issue and ask for your help there.
Sorry for the delay, @tobyweston. I've now created PR #16 to demonstrate how a test about merged cells succeeds that I'd expect to fail. Could you please have a look at the PR and comment on it?
See my comment in https://github.com/tobyweston/simple-excel/pull/16#issuecomment-494944941 for a potential fix
Tracking on a branch merged_regions (which has @sschuberth PR merged in #16).
Hmmm, rereading your comments. I've been focusing on including merged regions in the sheet comparisons and realise this probably doesn't address the original issue. I'll carry on (simple-excel will now fail when comparing sheets when two sheets have differing merged regions, regardless of content within those regions).
Let me know if that helps / hinders you.
Have you tried raising an issue with Apache POI?
I'm sorry if my comments were setting you on the wrong track, but differences in the way merged regions are specified really were the only difference I could obverse.
No, I haven't raised the issue upstream with Apache POI, mainly because I wasn't sure enough whether it really is an upstream issue, and I did come up with an MWE yet.
No worries, I’ll have a closer look and see what I can do.
Does this fix help you though?
Does this fix help you though?
Unfortunately not. I just tried simple-excel 1.2 with Apache POI 4.1.0 / 4.1.1, but I still get
java.lang.AssertionError:
Expected: entire workbook to be equal
but: got <2> cell(s) on row <1> expected <6> sheet "Summary",
got <2> cell(s) on row <2> expected <6> sheet "Summary",
got <2> cell(s) on row <5> expected <6> sheet "Summary",
got <2> cell(s) on row <6> expected <6> sheet "Summary",
got <2> cell(s) on row <7> expected <6> sheet "Summary",
got <2> cell(s) on row <8> expected <6> sheet "Summary"
for Excel sheets that are actually the same, but where one has been re-saved with Excel.
If you want to reproduce this, you can checkout the excel-poi branch of https://github.com/heremaps/oss-review-toolkit and then
cp reporter/src/funTest/assets/file-counter-expected-scan-report.actual.xlsx reporter/src/funTest/assets/file-counter-expected-scan-report.xlsx./gradlew :reporter:funTest --tests com.here.ort.reporter.reporters.ExcelReporterTest
See how the test succeeds. Then open file-counter-expected-scan-report.xlsx in Excel an re-safe it (e.g. by making a dummy change like removing a character and adding it again). Then again run ./gradlew :reporter:funTest --tests com.here.ort.reporter.reporters.ExcelReporterTest and see how the test fails.
Thanks,
just tried simple-excel 1.2 with Apache POI 4.1.0 / 4.1.1
The tweaks to merged cells is on 1.3 which isn’t released yet. It currently resides in the merged_refions branch. At least we know POI 4.x doesn’t help.
I’ll look in more detail at your case and see if I can help before I actually release 1.3. If the merged regions are an authogonal issue, I may not include that “fix” (as I’m not sure how I feel about it).
The mismatch occurs in CellNumberMatcher.matchesSafely() for the expected row being
<xml-fragment r="1" spans="1:6" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" xmlns:xr3="http://schemas.microsoft.com/office/spreadsheetml/2016/revision3" xmlns:main="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<main:c r="A1" s="2" t="s">
<main:v>0</main:v>
</main:c>
<main:c r="B1" s="5" t="s">
<main:v>1</main:v>
</main:c>
<main:c r="C1" s="6"/>
<main:c r="D1" s="6"/>
<main:c r="E1" s="6"/>
<main:c r="F1" s="6"/>
</xml-fragment>
(for the above 6 cells are reported despite the spans="1:6" attribute) and the actual row being
<xml-fragment r="1" xmlns:main="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<main:c r="A1" t="s" s="2">
<main:v>0</main:v>
</main:c>
<main:c r="B1" t="s" s="2">
<main:v>1</main:v>
</main:c>
</xml-fragment>
(for the above 2 cells are reported). Which is essentially the same that I had found out earlier by looking at the unzipped contents of the xlsx files.
It currently resides in the merged_refions branch.
Ah, right, sorry, will try that branch later.
So I've tried your merged_regions branch and the output is already much more telling:
com.here.ort.reporter.reporters.ExcelReporterTest > com.here.ort.reporter.reporters.ExcelReporterTest > executionError FAILED
java.lang.AssertionError:
Expected: entire workbook to be equal
but: got "B1:F1, B2:F2, B5:F5, B6:F6, B7:F7, B8:F8" as a merged region(s) in sheet "Summary" expected "B8:F8, B1:F1, B2:F2, B5:F5, B6:F6, B7:F7"
Note that the merged region actually are the same, they are just not reported in the same order! That is, in the expected output for some reason B8:F8 does not go last. If you could sort the merged regions before comparison, I believe we would be set 😄
Cool easily done!
I was also thinking that if we need some custom behaviour, like a workaround to some of the matching, we could enable it with a flag so it doesn’t affect anyone unless they opt in...
FYI, I was finally able to solve this by writing my own code which "de-duplicates" cells that are actually part of merged cells before comparing cell contents, see
https://github.com/oss-review-toolkit/ort/blob/40ca07dec80d9b2c49e220975906d6953cd9256a/reporter/src/funTest/kotlin/reporters/ExcelReporterFunTest.kt#L105-L110
Awesome but I feel bad we didn't resolve this in simple-excel. We were close no? I've re-read the trail but am fuzzy about the details. If I ever get some time (unlikely atm) I can remind myself and try and merge the merged_regions branch.
TBH, in retrospect I don't believe the merged_regions branch was on the right track. The solution to the problem was not to compare the merged regions themselves (as these are identical in both original and re-saved workbooks), but to consider the merged regions when iterating over the cells.
Here's an example: Let's assume there's a merged region at B1:D1. In the original workbook generated by Apache POI itself, the cell iterator for the first row would return A1, B1, C1, D1, E1 etc. I.e. it returns the individual cells of a merged region. When re-saving the workbook created by Apache POI in Excel, without doing any modifications, then the cell iterator suddenly returns A1, B1, E1 etc. So cells of a merged region other than the first cell are skipped.
That's why I'm changing the cell iterator to always skip cells of a merged region other than the first cell of a merged region.
In any case, I'd consider this to be a bug in Apache POI.