tika-python icon indicating copy to clipboard operation
tika-python copied to clipboard

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Results 48 tika-python issues
Sort by recently updated
recently updated
newest added

Hi there, I see a parsing issue with landscape PDFS. For example, [This one](https://pub-edmonton.escribemeetings.com/filestream.ashx?DocumentId=24237). When I run ``` parser.from_file("https://pub-edmonton.escribemeetings.com/filestream.ashx?DocumentId=24237")['content'] ``` I get a bunch of short words that look like:...

We added -spawnChild mode to tika-server to defend against catastrophic failures -- oom, infinite loops etc. We should make this the default in tika-python. We'll need to make the python...

enhancement
help wanted

Thanks a lot for tika-python. its fast and awesome! 🥇 I suggest the following change to make the command line tool `$ tika-python parse all file.pdf` behave more similarly to...

Closes https://github.com/chrismattmann/tika-python/issues/305 Bump Tika to 1.24.1 Add tests for benchmarking gzip compression

enhancement

Hello, Tika released [1.24.1](https://tika.apache.org/1.24.1/index.html) which allows gzip compression of input and output streams for tika-server. What do you think of making it the default for the output stream? Since [requests](https://requests.readthedocs.io/en/master/user/quickstart/#binary-response-content)...

enhancement
help wanted
question

Currently the `.from_file()` methods only accept urls and filepaths as input. I would like it to accept file objects as well so applications that have already opened the files don't...

enhancement
help wanted

(I can provide PR with files and pytest cases that correct this behavior). I've noticed when I call tika.unpack() with a file, or buffer that includes an email that contains...

bug
enhancement
help wanted

Upon installation, ```sh pip install tika ``` When attempting: ```python In [21]: import tika ...: tika.initVM() ...: from tika import parser In [22]: parsed = parser.from_file(file_path) ``` I get ```sh...

[SHA1 has been deprecated](https://csrc.nist.gov/projects/hash-functions) in FIPS and there are suggested steps to move away from the algorithm, but it is [still supported for "Non-digital-signature applications"](https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-131Ar2.pdf) (CTRL-F for SHA-1 to find...

enhancement
help wanted

Hi, I've gotten tika to work great for a while parsing PDFs - but realised recently that paragraphs longer than 240 characters or so (including spaces) are getting cut off/truncated....

bug
enhancement
help wanted
question