Greg Lindahl
Greg Lindahl
The original intent for www\d+ was to fix a problem with load balancing implementations in the early web. I tried to find some examples in Common Crawl's 2007+ archive, but...
Common Crawl wants to stop normalizing www\d+ and eventually rebuild our entire indexes that way. I support adding that kind of option to the whole tool ecosystem.
The streaming interface does the right thing: if you call response.content.read(1000) and then response.close(), looks like only the first 256k of the response is read (a single read). I observed...
One thing that's missing is that our filenames in warc.paths.gz are relative to the bucket. There are 2 possible prefixes: https://data.commoncrawl.org/ and s3://commoncrawl/ This is a pretty common pattern for...
Tessa @tw4l is working on getting rid of the httpbin version dependency in PR 153 https://github.com/webrecorder/warcio/pull/153 -- and she's setting up Github Actions so we'll have CI again. Once that's...
I think this suggestion predates RFC 9309, which is the new hotness in the robots.txt space.
@sebastian-nagel is getting good at reading my mind! I was suggesting "rfc9309" as a value, since it's different from "classic".
SemVer has a "build identifier" that we can use -- it looks like 1.0.0+1.0.2, which means the version is 1.0.0 and the build is 1.0.2. We would keep 1.0.0 constant...
We are going to use this semver feature in our Croissants. @benjelloun I think it would be useful to mention this semver feature in the 1.1 🥐 spec
Sounds great! Thank you.