tika-python
tika-python copied to clipboard
portions of strings getting cut off with "..."
Hi, I've gotten tika to work great for a while parsing PDFs - but realised recently that paragraphs longer than 240 characters or so (including spaces) are getting cut off/truncated. Is there any way to increase the substring size that is output by parser.from_file()?
Here's an example of my output:
5.8 abcd some words here, the sentence ends now
6.1 xyz a few words here, this is also fine
6.2 This paragraph happents to be more than 200 characters long, but gets cut off at around 240 characters. I need all the characters/words to be included – not excluded, so I can run functions on the output. Right now the regular expressions are not running on the text foll…
The above issue with item 6.2 is what I'm struggling to figure out - I haven't found any way to change the maximum string length that's output.
parsed = parser.from_file(file) parsed["content"][:-1]
adding [:-1] to be explicit doesn't work, I believe that affects the string as a whole, not the substrings.
Any help would be greatly appreciated!