Greg Lindahl
Greg Lindahl
The huge pullreq https://github.com/newhouse/url-tracking-stripper/pull/71 and the request to let users add custom things to strip https://github.com/newhouse/url-tracking-stripper/issues/54 both cry out for this change to be done.
Just noticed this one, a little googling says it's been around for a while, and that it's common enough that some reddit subs have banned using it: https://www.youtube.com/attribution_link?a=dRBqlLWtf5U&u=%2Fwatch%3Fv%3Dpogq2tZFKKo%26feature%3Dshare It's not...
Looks like the fb_* ones are only relevant to facebook.com? In which case it'd be nice to have that be part of the stripping algorithm, so it wouldn't work on...
Thanks to Postel's Law, many webservers emit invalid http. This is similar to many webpages being invalid html. Yet browsers display these pages. The html5 standard now standardizes how everyone...
I'm happy to do a PR for the above suggestions, but I hesitate to do so in communities where there is no discussion of issues.
The loop could be one iteration... in fact the example you're looking at just loops once (`limit=1`)
Turn up the verbose level and you'll see what's going on -- if you are not limiting your time span, the cdx code has to talk to every Common Crawl...
Can you give some examples? The bug I was complaining about shouldn't affect any real usage.
OK, so Common Crawl is doing the right thing, and the closest on wayback issue is a problem `on` the Internet Archive side, something I can't control.
One interface thing to keep in mind is that looping over an iterator cannot be continued if the iterator raises. That's why warcio's digest verification has a complicated interface, with...