tika-python
tika-python copied to clipboard
Tika 1.24.1 and gzip compression
Hello, Tika released 1.24.1 which allows gzip compression of input and output streams for tika-server.
What do you think of making it the default for the output stream? Since requests automatically decodes gzip and deflate transfer-encodings it's just adding the header Accept-Encoding: gzip, deflate to services rmeta, tika, rmeta/text.
I can provide a PR.
Cheers, Carina
Optionally we can also provide it by default for input to services rmeta, tika, rmeta/text, what do you think?
Something like:
with urlOrPath if _is_file_object(urlOrPath) else open(path, 'rb') as file_obj:
with NamedTemporaryFile(delete=True) as gzfile:
gzip.GzipFile(fileobj=gzfile, mode="wb").write(file_obj.read())
gzfile.seek(0, 0)
response = parse1(gzfile)
@carantunes I like your first suggested option. What is the advantage that requests will be faster since they will be using gzip? PR and test case showing whatever improvement for review would be appreciated.
ping @carantunes happy to discuss
Hi,
After some tests and benchmarks I've reconsidered if it should be changed by default. Gzip compression has the upside of improving transfer speed and bandwidth utilisation (~75%), at the cost of some cpu utilisation. For large files may be an improvement.
Another difference is that files sent to Tika with compression will have a different Content-Type returned (ie, from 'application/pdf' to ['application/gzip', 'application/pdf'])
Instead I believe it would be sufficient to support sending/receiving gzip format by releasing 1.24.1
Input compression can be achieved with gzip or zlib:
with open(file, 'rb') as file_obj:
return tika.parser.from_buffer(zlib.compress(file_obj.read()))
...
with open(file, 'rb') as file_obj:
return tika.parser.from_buffer(gzip.compress(file_obj.read()))
And output with the header:
with open(file, 'rb') as file_obj:
return tika.parser.from_file(file_obj, headers={'Accept-Encoding': 'gzip, deflate'})
A sample of benchmark (using pytest-benchmark) results using a ppt (100MB), run with the default timer first and lastly with --benchmark-timer=time.process_time which doesn’t include sleeping time or waiting for I/O:

awesome @carantunes i can get going on releasing 1.24.1. I also see you have a great tika/tests/test_benchmark.py is this something you could contribute? :)
Another difference is that files sent to Tika with compression will have a different Content-Type returned (ie, from 'application/pdf' to ['application/gzip', 'application/pdf'])
If I understand correctly, if I curl with -H "Content-Encoding: gzip" Tika should only see the PDF.
https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-Transfer-LayerCompression
I added this unit test to confirm this behavior: https://github.com/apache/tika/commit/839d3187b93822dc7b7a8c269f00ac7ebfacddbd
@tballison Pardon my delay, I was on vacation. Thanks for the input, I had not notice there was a difference.
After some debugging it looks like if I send -H "Content-Encoding: application/gzip" to rmeta I get a different result that if I send with -H "Content-Encoding: gzip" or with no header at all. I've created a ticket with more details if you want to further investigate/explain it https://issues.apache.org/jira/browse/TIKA-3169.