Greg Lindahl comments

Results 182 comments of


                                            Greg Lindahl

Move configuration out of source into a data file

The huge pullreq https://github.com/newhouse/url-tracking-stripper/pull/71 and the request to let users add custom things to strip https://github.com/newhouse/url-tracking-stripper/issues/54 both cry out for this change to be done.

More tokens to strip

Just noticed this one, a little googling says it's been around for a while, and that it's common enough that some reddit subs have banned using it: https://www.youtube.com/attribution_link?a=dRBqlLWtf5U&u=%2Fwatch%3Fv%3Dpogq2tZFKKo%26feature%3Dshare It's not...

More tokens to strip.

Looks like the fb_* ones are only relevant to facebook.com? In which case it'd be nice to have that be part of the stripping algorithm, so it wouldn't work on...

aiohttp client throws http errors for the following redirect

Thanks to Postel's Law, many webservers emit invalid http. This is similar to many webpages being invalid html. Yet browsers display these pages. The html5 standard now standardizes how everyone...

Documenting/improving memory behavior

I'm happy to do a PR for the above suggestions, but I hesitate to do so in communities where there is no discussion of issues.

Retrieving objects for a set or list of URL's in parallel

The loop could be one iteration... in fact the example you're looking at just loops once (`limit=1`)

Retrieving objects for a set or list of URL's in parallel

Turn up the verbose level and you'll see what's going on -- if you are not limiting your time span, the cdx code has to talk to every Common Crawl...

CommonCrawl index date range code is broken

Can you give some examples? The bug I was complaining about shouldn't affect any real usage.

CommonCrawl index date range code is broken

OK, so Common Crawl is doing the right thing, and the closest on wayback issue is a problem `on` the Internet Archive side, something I can't control.

Invalid WARCs are silently accepted instead of raising an error

One interface thing to keep in mind is that looping over an iterator cannot be continued if the iterator raises. That's why warcio's digest verification has a complicated interface, with...