Greg Lindahl
Greg Lindahl
I 100% agree that I didn't take this into account when writing the check code! Thanks for the analysis.
We have two projects on our roadmap that do most of what you want. One is a WARC-to-zip tool that will give you a zip file containing ~ 30,000 webpages...
@ibnesayeed you had a positive comment about compression efficiency, I would love to have some examples.
See also: https://github.com/iipc/warc-specifications/issues/53 (thanks @ikreymer)
I think wrapping the dictionary in a WARC record addresses that part what I pointed out.
@willmhowes are you saying we shouldn't discuss it now? I was working in radio astronomy when this proposal was introduced, and it has not come up until now since I...
> This could also address the long standing issue for not being able to seek directly into larger, already compressed payloads (like video/audio), without decompressing from the beginning, and could...
I'm not sure how helpful that is for this task, it's more something you might do after I can write down the croissant for 1 dataset.
@benjelloun can this be reviewed before 1.1 ships?
cdx_toolkit is already too fast, not too slow. Also, you really don't want to reuse http1.1 connections.