Greg Lindahl
Greg Lindahl
1.0 and 1.1 specify ``` labelled-digest = algorithm ":" digest-value ``` and `digest-value` is a `token`. "/" and "=" are not valid characters for a token. "/" is in the...
I am converting a webpage which is not utf8. There's no way to specify an input encoding to your tool, so only utf8 is permissible. Most of my favorite webpage...
Please see https://github.com/commoncrawl/cc-webgraph/issues/18 for an example problem and solution. @sebastian-nagel you were out when I wrote the advice about wget/curl/awscli that's at https://status.commoncrawl.org/
Can you stop attacking the Common Crawl CDX API?
This works: `$ cdxt --cc --limit 1 iter www.pbm.com/* --all-fields` This does not: `$ cdxt --cc --limit 1 iter www.pbm.com/* --all-fields --json`
In all of that flurry of CI work: while the CI explicitly installs setuptools for >=py3.12, ... after we release a new pypi version, are py3.12-using end-users going to have...
Related to #738 I would like to create any necessary new controlled language necessary to describe a crawled dataset. I propose: - [ ] I will write up a single...
We recently noticed that someone crawling from idris.fr was impersonating CCBot with the user-agent string. We contacted IDRIS and they said they would stop doing it. It would be good...