tika-python icon indicating copy to clipboard operation
tika-python copied to clipboard

portions of strings getting cut off with "..."

Open BCorbeek opened this issue 1 year ago • 6 comments

Hi, I've gotten tika to work great for a while parsing PDFs - but realised recently that paragraphs longer than 240 characters or so (including spaces) are getting cut off/truncated. Is there any way to increase the substring size that is output by parser.from_file()?

Here's an example of my output:

5.8 abcd some words here, the sentence ends now
6.1 xyz a few words here, this is also fine
6.2 This paragraph happents to be more than 200 characters long, but gets cut off at around 240 characters. I need all the characters/words to be included – not excluded, so I can run functions on the output. Right now the regular expressions are not  running on the text foll…

The above issue with item 6.2 is what I'm struggling to figure out - I haven't found any way to change the maximum string length that's output.

parsed = parser.from_file(file) parsed["content"][:-1]

adding [:-1] to be explicit doesn't work, I believe that affects the string as a whole, not the substrings.

Any help would be greatly appreciated!

BCorbeek avatar Dec 22 '22 16:12 BCorbeek