parallelgzip icon indicating copy to clipboard operation
parallelgzip copied to clipboard

ParallelGZIPInputStream

Open simlu opened this issue 9 years ago • 9 comments

I'm currently working on parallelizing this in my own fork. Do you have any advice or resources to share? Didn't get very far unfortunately.

How do you detect how many bytes to "chop off" and pass to a thread for decompression?

simlu avatar Apr 30 '16 04:04 simlu

Making progress.. I'm thinking we can store the segment sizes for splitting up the inflation in the FEXTRA header field. The problem is that we can only write them once the file has been written... Wish there was a better way to append information.

My testing works nicely so far, just have to find a better way to communicate the block sizes. Any suggestions?

Edit: This article uses a similar method: http://www.ebaytechblog.com/2015/10/09/gzinga-seekable-and-splittable-gzip/

simlu avatar May 02 '16 04:05 simlu

I don't actually know a good answer to your question. TBH, I wrote this as a tutorial exercise for some other programmers in how to write and profile good code in the JUC APIs. However, I have been using it in anger in various projects, and it's been a great boon. Those projects tend to be typified by very large data sets, so one option is just to hand off multimegabyte blocks to something like an FJP and pay the cost of passing the leading/trailing bits between threads... although can a thread "resync" if it's given the second megabyte at random?

Again, I confess, I didn't read the spec, I didn't do dictionary pre-seeding or anything. This was really about 30 minutes' work as a tutorial exercise which turned out to be publishable. I'll help as much as I can.

shevek avatar May 09 '16 01:05 shevek

Makes sense. I really like this paper on the topic: http://prof.icc.skku.ac.kr/~jaewlee/pubs/lctes13_vld.pdf

simlu avatar May 09 '16 04:05 simlu

Right now, gzip decompression is costing me 45 seconds per unit test in one of my products, and my system has 4 cores. The data might or might not have been compressed using this parallelgzipoutpustream, but either way, would I love to get that down to 12 seconds: "yes!"

shevek avatar May 09 '16 20:05 shevek

Same, I will save many hours run time on my project... I'll try to get the parallel part going next weekend. Boundary guessing is then a separate task.

simlu avatar May 09 '16 20:05 simlu

Also, for linear scaling, the largest boxes I have are a 4(8 w/HT) core E5620 and a similar Core2. It seems not to get much benefit from the HT cores. I'd be very interested in any results from larger boxes. I did get some suggestions from concurrency-interest at some point. That said, my primary use case is saving human time on 4-core laptops, not saving real-time on 64-core servers.

shevek avatar May 10 '16 16:05 shevek

:+1:

axelfontaine avatar Jul 11 '16 15:07 axelfontaine

Hi, Any progress on this task?

marcadella avatar Jan 29 '20 07:01 marcadella

I have solid 64-core hardware now. That's all.

shevek avatar Jan 29 '20 23:01 shevek