tabula-java when cell content exceeds cell boundaries, next cell gets messed up (exmples)

When 2 of the cells in the PDF continue beyond the cell's boundary, the next cell's content goes "crazy" (i.e. is totally different than expected)

in the example sample:

I assume the PDF source is EXCEL, where it's common to see long text cut at the border of the cell. I don't know for sure.

PDF file (download)
TSV output file (The document is RTL, ie. right to left; therefore, the 2nd cell is the 2nd from the right)

Command line used: java -Dfile.encoding=UTF8 -jar tabula-1.0.5-jar-with-dependencies.jar sample.pdf -f TSV > sample.tsv

The bogus lines are identified / starts with: 1068, 1103 Output lines with the problem: 43 E2U9 A10L YCPCT "ש""א אקליפטוס סיטריאדורה SCITRIADORA/" 1068 60 43 10 CEUCC "ש""א אקליפטוס רדיאטה LYPTUSRADIATA/" 1103

In the output, i see 2 phenomena:

the wrong text "A10L YCPCT" should've been: "10 CC"
the wrong text "E209" should've been: "29". etc.
the word "EUCALIPTUS" is cut in these lines. This makes sense, since it's not visible, and therefore, not a real bug.

in the attache sample.df > converted text file in the 3rd field shoud've been the text "10 CC".

My setup:

windows 10
java version "1.8.0_401"
tabula 1.0.5

Feb 20 '24 17:02 shula

Hi @shula Unfortunately this is expected behavior for a PDF with this kind of problem. The "extra"/unexpected characters (for example AL YPT in line 1068) are present, but under the text for the next cell to the left. So Tabula is correctly extracting the characters.

Feb 20 '24 18:02 jeremybmerrill

I'm running into this problem too, attempting to "free" data locked inside some legal filings.

@jeremybmerrill Can you comment as to whether the data still exists in uncorrupted form inside the PDF, and where in the code I might attempt to patch the extraction logic? Or barring that, suppress the overflow text so it doesn't corrupt its neighbors? I'm experienced in Java, but not PDF or this particular codebase. Thanks.

Dec 14 '24 15:12 apgove

@apgove It's difficult to generalize about PDFs; they're a bit like Dostoyevsky's idea of families. Happy PDFs are all alike, but unhappy PDFs are unhappy in their own ways. Except it's you that's unhappy, not the PDF.

Unfortunately, there's likely not a generalizable way to "fix" this problem. There's no sense of a table; it's just a bunch of lines that happen to be arranged in a way that humans perceive as a table. Those lines comprising the table are stored totally separately from the text. So the text doesn't know what cell it's in; the cells don't know what text they are supposed to contain. From the PDF's perspective, the only sense in which the text is "in" a cell is whether it is contained within the bounding box created by the lines, and, thus, if text appears to overflow into the "next" cell, it IS in the next cell.

You could perhaps do something funny with detecting overlapping text and then using language modeling to predict which overlapping character belongs in the "previous" cell, but that would be nondeterministic and probably wouldn't work! In certain cases, you could be lucky enough that you could heuristically determine based on (for instance) text size or Y-position which overlapping characters belong in which cell for your specific document, but that wouldn't generalize.

Dec 15 '24 01:12 jeremybmerrill

Thanks for the explanation @jeremybmerrill , it's much appreciated to avoid wasting [even more] time on a wild goose chase. I'm guessing, given that the overlapping characters get intermingled in the output, that each glyph is represented independently, not as strings, making this quite challenging.

Just spitballing (please forgive the uninformed speculation, I've been on the receiving end and it can be really annoying!), but there must be some sort of z-order information stored somewhere in order for the overrunning characters to get hidden; I wonder whether that could be used as a signal that the hidden characters belong in the previous cell, not the one they're contained by. Of course, that would still fail if it overran more than 1 column, but it could be a big improvement. Or... Maybe if we have enough information to know that a character is being suppressed due to overlap, it can just be skipped over and not added to the output for any table cell. Information would be lost, but it would at least avoid corruption and accurately reflect the visual representation of the original PDF.

Anyways, maybe I'll dive down this rabbit hole later, but for now I'll pursue other leads.

Dec 15 '24 02:12 apgove

@apgove Feel free to give it a try! I don't think there's likely to be a free lunch in the general case, but it's possible that a there may be heuristics for the PDFs in your use case that would resolve your problem.

TextPosition is the underlying representation of a character (or a string of a few characters) in the underlying PDFBox library. Some PDFs have one TextPosition per glyph; others have a few characters in a single TextPosition. Because PDFs are primarily a display format not a data transfer format, different PDF generators do different things (and I don't think standards, at least not early ones, defined a right way to do it).

There's no explicit Z-index, although there's an underlying ordering that might correspond to what you're looking for, for your specific documents. Only way to know is to try.

I should add this caveat: Unless you've got thousands of PDFs or need to build a pipeline that ingests PDFs into the future, you might spend more time investigating this than it would take to fix the misbehaving cells manually!

Dec 16 '24 18:12 jeremybmerrill