Greg Lindahl

Results 182 comments of Greg Lindahl

There are multiple possible bugs here, one possibility is that the copy is writing the wrong block digest, perhaps because it changed the block and kept the same digest. If...

Thanks, I didn't notice `input.warc.gz` was a link. The difference between input.warc and output.warc is that they have the same digest, but the content length is one octet shorter for...

While I'm here I'll also mention that `Transfer-Encoding: chunked` is in both input and output headers, but it's not actually chunked. This is a common problem in warcs. warcio happily...

@ikreymer I see two choices, you probably have an opinion: 1. Save the verbatim header bytes while reading, and use them on writing if the user has not modified anything,...

Didn't the header continuation lines originally have whitespace at the start? Can you mention the actual url so we can look at the actual headers?

Requests can have multiple cookies in one header https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cookie but not responses. The thing quoted is a response.

I do like the overall idea of having some error recovery. This code depends on read_to_end() working correctly, so its only going to handle a limited set of possible corruptions....

Is this a bug in the standard instead of a bug in warcio and many other tools? Can you point out any common tool that actually obeys the standard? There...

Well, if you'd like to look farther, the develop branch of warcio supports "warcio check", which can be used to also check digests -- albeit with warcio's current opinion that...

These issues are exactly why I suspect the standard is what's going to change. Those "bad" digests are all over petabytes of response and revisit records, and the cdx indices...