Horizontal two cells merge into one
First of all, please forgive me for not providing pdf files.
-
pdf content

-
parse result
{'top': 270.97, 'left': 107.18, 'width': 193.15365600585938, 'height': 11.1899995803833, 'text': '1 营业收入'}{'top': 297.22, 'left': 107.18, 'width': 193.12179565429688, 'height': 11.1899995803833, 'text': '2 营业成本'}....{'top': 717.22, 'left': 101.55, 'width': 969.3975219726562, 'height': 11.1899995803833, 'text': '17 其中:总机构分摊应补(退)所得税额(15×总机构分摊比例)%'} -
I read the java source code and made some modifications, and passed all tests.
{'top': 270.97, 'left': 107.18, 'width': 16.875, 'height': 11.1899995803833, 'text': '1'}, {'top': 270.97, 'left': 205.35, 'width': 94.983642578125, 'height': 11.1899995803833, 'text': '营业收入'}{'top': 297.22, 'left': 107.18, 'width': 16.312896728515625, 'height': 11.1899995803833, 'text': '2'}, {'top': 297.22, 'left': 205.35, 'width': 94.9517822265625, 'height': 11.1899995803833, 'text': '营业成本'} -
Here is where I modified and content https://github.com/tabulapdf/tabula-java/blob/eec86f517e9782c7534283cfde9cf0e4bad4bf1f/src/main/java/technology/tabula/extractors/BasicExtractionAlgorithm.java#L81
Integer count = 0;
Integer[] allJ = new Integer[elements.size()];
TextChunk previous = null;
for (TextChunk tc: elements) {
if (tc.isSameChar(Line.WHITE_SPACE_CHARS)) {
continue;
}
int j = 0;
if (count.equals(0)) {
previous = elements.get(count);
} else {
previous = elements.get(count-1);
}
boolean found = false;
for(; j < columns.size(); j++) {
// To determine whether it is in the same cell,
// I think it needs to be compared with the previous cell, such as the distance is less than,
// or whether the number of columns is equal,
// the best should be to determine whether there is a vertical line
if (tc.getLeft() <= columns.get(j)) {
if (!tc.equals(previous) && Math.abs(previous.getRight()-tc.getLeft()) > 12
&& !tc.getText().equals("") && !previous.getText().equals("")
) {
if (j < (columns.size()-1) && Math.abs(j - allJ[count==0?0:count-1]) == 0) { j += 1;}
}
found = true;
break;
}
}
table.add(tc, i, found ? j : columns.size());
allJ[count] = j;
count += 1;
}
- The code written by Rookie is just for reference, looking forward to your reply