Sawood Alam

Results 409 comments of Sawood Alam

CDX Summary comparison of the two captures (Good and Bad): ``` $ zcat good/indexes/index.cdx.gz | ~/bin/cdxj2cdx.py | cdxsummary Summarizing piped data: STDIN CDX Overview ─────────────────────────────── Total Captures in CDX 202...

There are 84 URLs that are captured in the good one, but not in the bad one and 8 in the bad one, but not in the good one: ```...

URLs that were archived in the Good capture, but not in the Bad one include the following: ``` $ comm -12 /tmp/good-surts.txt /tmp/bad-surts.txt com,ads-twitter,static)/uwt.js com,brave,dict)/edgedl/chrome/dict/en-us-10-1.bdic com,cdn-apple,appleid)/appleauth/static/jsapi/appleid/1/en_us/appleid.auth.js com,google,accounts)/gsi/client com,google,accounts)/gsi/style com,twimg,abs)/favicons/twitter.3.ico com,twimg,abs)/responsive-web/client-serviceworker/serviceworker.c378aaea.js...

Thanks for looking into this. Being able to store raw blocks of desired sizes has other potential benefits of sub-resource deduplication. This ability has been available in the JavaScript library...

> @ibnesayeed you had a positive comment about compression efficiency, I would love to have some examples. We have done some real-world statistical analysis of a use-case where WARCs created...

> Another comment is that the zstd dictionary frame is not a WARC record. That might not be a good choice when some (most?) warc tools don't support zstd, and...

> One comment brought up already is that the possible dictionary frame at the start of every warc might make playback slower. This is fair point, but there are practical...

> Perhaps there advantages to wrapping the zstd dictionary in a WARC record because it means metadata about the dictionary can be added? For example, could a unique ID (or...

> I think a better approach, which is a compromise, would be to place the dictionary, if one is used at all, at the beginning of each WARC record. Yes,...