tabula-java icon indicating copy to clipboard operation
tabula-java copied to clipboard

Horizontal two cells merge into one

Open Single430 opened this issue 5 years ago • 0 comments

First of all, please forgive me for not providing pdf files.

  • pdf content

  • parse result {'top': 270.97, 'left': 107.18, 'width': 193.15365600585938, 'height': 11.1899995803833, 'text': '1 营业收入'} {'top': 297.22, 'left': 107.18, 'width': 193.12179565429688, 'height': 11.1899995803833, 'text': '2 营业成本'} .... {'top': 717.22, 'left': 101.55, 'width': 969.3975219726562, 'height': 11.1899995803833, 'text': '17 其中:总机构分摊应补(退)所得税额(15×总机构分摊比例)%'}

  • I read the java source code and made some modifications, and passed all tests. {'top': 270.97, 'left': 107.18, 'width': 16.875, 'height': 11.1899995803833, 'text': '1'}, {'top': 270.97, 'left': 205.35, 'width': 94.983642578125, 'height': 11.1899995803833, 'text': '营业收入'} {'top': 297.22, 'left': 107.18, 'width': 16.312896728515625, 'height': 11.1899995803833, 'text': '2'}, {'top': 297.22, 'left': 205.35, 'width': 94.9517822265625, 'height': 11.1899995803833, 'text': '营业成本'}

  • Here is where I modified and content https://github.com/tabulapdf/tabula-java/blob/eec86f517e9782c7534283cfde9cf0e4bad4bf1f/src/main/java/technology/tabula/extractors/BasicExtractionAlgorithm.java#L81

Integer count = 0;
Integer[] allJ = new Integer[elements.size()];
TextChunk previous = null;
for (TextChunk tc: elements) {
    if (tc.isSameChar(Line.WHITE_SPACE_CHARS)) {
        continue;
    }

    int j = 0;
    if (count.equals(0)) {
        previous = elements.get(count);
    } else {
        previous = elements.get(count-1);
    }
    boolean found = false;
    for(; j < columns.size(); j++) {
        // To determine whether it is in the same cell, 
        // I think it needs to be compared with the previous cell, such as the distance is less than, 
        // or whether the number of columns is equal, 
        // the best should be to determine whether there is a vertical line
        if (tc.getLeft() <= columns.get(j)) {
            if (!tc.equals(previous) && Math.abs(previous.getRight()-tc.getLeft()) > 12
                    && !tc.getText().equals("") && !previous.getText().equals("")
            ) {
                if (j < (columns.size()-1) && Math.abs(j - allJ[count==0?0:count-1]) == 0) { j += 1;}
            }
            found = true;
            break;
        } 
    }
    table.add(tc, i, found ? j : columns.size());
    allJ[count] = j;
    count += 1;
}
  • The code written by Rookie is just for reference, looking forward to your reply

Single430 avatar Apr 30 '20 07:04 Single430