tika-python
tika-python copied to clipboard
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Hi there, I see a parsing issue with landscape PDFS. For example, [This one](https://pub-edmonton.escribemeetings.com/filestream.ashx?DocumentId=24237). When I run ``` parser.from_file("https://pub-edmonton.escribemeetings.com/filestream.ashx?DocumentId=24237")['content'] ``` I get a bunch of short words that look like:...
We added -spawnChild mode to tika-server to defend against catastrophic failures -- oom, infinite loops etc. We should make this the default in tika-python. We'll need to make the python...
Thanks a lot for tika-python. its fast and awesome! 🥇 I suggest the following change to make the command line tool `$ tika-python parse all file.pdf` behave more similarly to...
Closes https://github.com/chrismattmann/tika-python/issues/305 Bump Tika to 1.24.1 Add tests for benchmarking gzip compression
Hello, Tika released [1.24.1](https://tika.apache.org/1.24.1/index.html) which allows gzip compression of input and output streams for tika-server. What do you think of making it the default for the output stream? Since [requests](https://requests.readthedocs.io/en/master/user/quickstart/#binary-response-content)...
Currently the `.from_file()` methods only accept urls and filepaths as input. I would like it to accept file objects as well so applications that have already opened the files don't...
(I can provide PR with files and pytest cases that correct this behavior). I've noticed when I call tika.unpack() with a file, or buffer that includes an email that contains...
Upon installation, ```sh pip install tika ``` When attempting: ```python In [21]: import tika ...: tika.initVM() ...: from tika import parser In [22]: parsed = parser.from_file(file_path) ``` I get ```sh...
[SHA1 has been deprecated](https://csrc.nist.gov/projects/hash-functions) in FIPS and there are suggested steps to move away from the algorithm, but it is [still supported for "Non-digital-signature applications"](https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-131Ar2.pdf) (CTRL-F for SHA-1 to find...
Hi, I've gotten tika to work great for a while parsing PDFs - but realised recently that paragraphs longer than 240 characters or so (including spaces) are getting cut off/truncated....