ObjectExtractorStreamEngine reads beyond the end of stroke paths
ObjectExtractorStreamEngine.java contains this code
while (!pathIterator.isDone()) {
pathIterator.next();
// This can be the last segment, when pathIterator.isDone, but we need to
// process it otherwise us-017.pdf fails the last value.
try {
currentSegment = pathIterator.currentSegment(coordinates);
} catch (IndexOutOfBoundsException ex) {
continue;
}
I was looking at doing a little refactoring to that code[^1] and got curious about what was happening there. Changing the code to not read past pathIterator.isDone() does in fact cause the unit test TestSpreadsheetExtractor.testNaturalOrderOfRectanglesDoesNotBreakContract (which reads us-017.pdf) to fail. The failure suggests it is not reading the bottom right table cell.
I added some System.err.println debugging at the top of that while loop to see what it is doing. The debugging for the first line segment in us-017.pdf looks like:
segment: SEG_LINETO; done? false startPoint=Point2D.Float[91.75, 694.92] endPoint=null last_move=Point2D.Float[91.75, 694.92] coordinates=Point2D.Float[323.17, 694.92]
segment: SEG_LINETO; done? true startPoint=Point2D.Float[323.17, 694.92] endPoint=Point2D.Float[323.17, 694.92] last_move=Point2D.Float[91.75, 694.92] coordinates=Point2D.Float[323.17, 462.2]
From loading up us-017.pdf in the pdfbox debugger, if I am not mistaken that comes from this chunk of pdf:
q
1.1040955 0 0 -1 91.74725 352 cm
0.153 0.145 0.145 RG
1 w
10 M
0 j
0 J
[ ] 0 d
/GS2 gs
0 254.924 m
209.606 254.924 l
S
Q
There is in fact only 1 line to segment in that pdf content (the l operator at the bottom). Of the 73 line segments extracted from that page, for every single one of them the coordinates pulled from pathIterator.currentSegment(_) when pathIterator.isDone() == true are that same point: Point2D.Float[323.17, 462.2]
Where does that coordinate come from? I think it's the second corner in the rectangles drawn (filled) as the background for the header row in the table on that page (page 2 of us-017.pdf).
So...it looks to me like that loop is reading coordinates that it shouldn't (conceptually: it's reading uninitialized memory), and thus also generating lines that don't actually exist in the PDF. However, the comment in the code is correct: fixing the loop apparently breaks the extraction for the us-017.pdf file. I think that means there is a bug elsewhere in the extraction machinery that the extra loop iteration is hiding.
[^1]: I have a pdf where it seems table extraction would be improved by tracking shaded regions, so I was looking to add support for filled rectangles.
It looks like the change to Utils.java proposed in #242 fixes the problem with us-017.pdf that over-reading the path iterator was intended to fix.