tika-python icon indicating copy to clipboard operation
tika-python copied to clipboard

Tika 1.24.1 and gzip compression

Open carantunes opened this issue 5 years ago • 7 comments

Hello, Tika released 1.24.1 which allows gzip compression of input and output streams for tika-server.

What do you think of making it the default for the output stream? Since requests automatically decodes gzip and deflate transfer-encodings it's just adding the header Accept-Encoding: gzip, deflate to services rmeta, tika, rmeta/text.

I can provide a PR.

Cheers, Carina

carantunes avatar Apr 30 '20 19:04 carantunes

Optionally we can also provide it by default for input to services rmeta, tika, rmeta/text, what do you think?

Something like:

with urlOrPath if _is_file_object(urlOrPath) else open(path, 'rb') as file_obj:
    with NamedTemporaryFile(delete=True) as gzfile:
    gzip.GzipFile(fileobj=gzfile, mode="wb").write(file_obj.read())
    gzfile.seek(0, 0)

    response = parse1(gzfile)

carantunes avatar Apr 30 '20 19:04 carantunes

@carantunes I like your first suggested option. What is the advantage that requests will be faster since they will be using gzip? PR and test case showing whatever improvement for review would be appreciated.

chrismattmann avatar May 16 '20 17:05 chrismattmann

ping @carantunes happy to discuss

chrismattmann avatar May 24 '20 16:05 chrismattmann

Hi,

After some tests and benchmarks I've reconsidered if it should be changed by default. Gzip compression has the upside of improving transfer speed and bandwidth utilisation (~75%), at the cost of some cpu utilisation. For large files may be an improvement.

Another difference is that files sent to Tika with compression will have a different Content-Type returned (ie, from 'application/pdf' to ['application/gzip', 'application/pdf'])

Instead I believe it would be sufficient to support sending/receiving gzip format by releasing 1.24.1

Input compression can be achieved with gzip or zlib:

    with open(file, 'rb') as file_obj:
        return tika.parser.from_buffer(zlib.compress(file_obj.read()))

...

    with open(file, 'rb') as file_obj:
        return tika.parser.from_buffer(gzip.compress(file_obj.read()))

And output with the header:

    with open(file, 'rb') as file_obj:
        return tika.parser.from_file(file_obj, headers={'Accept-Encoding': 'gzip, deflate'})

A sample of benchmark (using pytest-benchmark) results using a ppt (100MB), run with the default timer first and lastly with --benchmark-timer=time.process_time which doesn’t include sleeping time or waiting for I/O: Screenshot 2020-06-29 at 20 44 41

carantunes avatar Jun 29 '20 19:06 carantunes

awesome @carantunes i can get going on releasing 1.24.1. I also see you have a great tika/tests/test_benchmark.py is this something you could contribute? :)

chrismattmann avatar Jun 30 '20 22:06 chrismattmann

Another difference is that files sent to Tika with compression will have a different Content-Type returned (ie, from 'application/pdf' to ['application/gzip', 'application/pdf'])

If I understand correctly, if I curl with -H "Content-Encoding: gzip" Tika should only see the PDF.

https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-Transfer-LayerCompression

I added this unit test to confirm this behavior: https://github.com/apache/tika/commit/839d3187b93822dc7b7a8c269f00ac7ebfacddbd

tballison avatar Jul 20 '20 20:07 tballison

@tballison Pardon my delay, I was on vacation. Thanks for the input, I had not notice there was a difference.

After some debugging it looks like if I send -H "Content-Encoding: application/gzip" to rmeta I get a different result that if I send with -H "Content-Encoding: gzip" or with no header at all. I've created a ticket with more details if you want to further investigate/explain it https://issues.apache.org/jira/browse/TIKA-3169.

carantunes avatar Aug 17 '20 11:08 carantunes