PDF not parsed in a proper format when the a specific column value spans multiple rows

Open shanky249 opened this issue 6 years ago • 0 comments

Hi,

There are certainly a few issues that I have noticed when trying to use the jar file to convert pdf to csv. First issue is that if there are any invisible new line or other characters, this jar is not able to remove those, because of which a single row content is split into multiple rows. You can use below few lines to get rid of the unwanted characters.

    ```
    // strips off all non-ASCII characters
    text = text.replaceAll("[^\\x00-\\x7F]", "");

    // erases all the ASCII control characters
    text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
     
    // removes non-printable characters from Unicode
    text = text.replaceAll("\\p{C}", "");
   ```

Also, another issue is that when a specific column value spans multiple rows, then the output csv sheet is not in proper format (for e.g., column 2 values are getting shifted to column 1). Refer the attached zip file which contains both the pdf and the resultant csv sheet obtained from the tabula jar.

Please try to resolve these issues at the earliest as this would be really helpful for all.

TSHBD290001.pdf TSHBD290001.zip

Jul 01 '19 07:07 shanky249