cc_net icon indicating copy to clipboard operation
cc_net copied to clipboard

Variance of hash files sizes in newer crawls

Open var926 opened this issue 3 years ago • 1 comments

Hello, I noticed that hash files that I've produced from the dump of January 21 (and several others months in 2020) are much smaller (x100) than hashes from dump of April and May 2019, even though original wet files were the same size.

In both cases there are 2 shards per one hash and all the other parameters are the same.

Trying to understand why, tnx:)

var926 avatar Apr 18 '21 05:04 var926

Same here, but for dump 22-05 :) And each of my *_log.err files reach sizes of 5GB showing repeteadly this (which might be the reason for small hashes size) Message: "Can't parse header:". It is probably related to #16 . I did not found any solution yet.

chirico85 avatar May 02 '22 09:05 chirico85