tika-python
tika-python copied to clipboard
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
With the advent of [TIKA-3329](https://github.com/apache/tika/pull/419/files), we can now have a full translation engine in Tika-Python that supports over 300+ languages to English. Standardize on this. It requires Tika 2.0 though,...
I wanted to use the library with a file that I get from another server, thus I already had the file in memory. It took me a while to understand...
Hi @chrismattmann , Fantastic library! I was wondering if you have near plans/roadmap to make it compatible with Apache Tika version 2.1.0 I used the `tika-server-standard-2.1.0.jar` file from `https://tika.apache.org/download.html` to...
Even after Tika server is started, the while body will keep being executed until max retries is reached. It should break out of the loop upon successful startup.
Hi, Tika works fine until I restart the machine I need to reinstall it or I will get this error message: ``` 2022-01-16 20:09:48,737 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar...
This pull request enables headers specification in `unpack.from_file`.
I'm using Apache Tika to OCR a bunch of PDFs. When I use the GUI (by doing java -jar tika-app-1.22.jar) everything works fine: I go to "Recursive JSON" on the...
I found that when parsing compressed files, the content of each file in the subdirectory is mixed in the content field. eg. test.zip => test/a.txt test/b.txt, after `parsed = parser.from_file('test.zip')...
Fixes of #167, #124, #225 and #285 only mask the error, but never generate the correct Content-Disposition header. With those fixes: when rfc6266 is installed, we get TypeError as reported...
Checkboxes from Word documents convert to the text "FORMCHECKBOX" and lose any info about whether or not they are checked. Is it possible to render those differently and ideally maintain...