Greg Lindahl
Greg Lindahl
People who are new to datasets are probably going to be confused by the copyright/license fields. The vocabulary names we have now don't really get across that the underlying data...
I'm a fan of this idea. Recently, I've been trying to quantify "bot defenses" being used to stop our crawler. There are many patterns used by bot defenses. In the...
I'd guess that Browsertrix or other browser-based tools already generate IPv6 traffic -- what does it do with these addresses? Also, wget?
Appreciate your politeness.
I did share my point of view.
Using aiohttp as a client in a web crawler, here are some invalid cookies I observed being sent out by top-million websites: WARNING:aiohttp.client:Can not load response cookies: Illegal key 'ISAWPLB{381DEC7D-8336-4B7A-B144-62C8A8EBBC2A}'...
Thank you for this monkey patch!
I'm not sure how early Stanford WebBase started. While the physical disks still exist, they've been powered down for a long time. So there's a possible source.
This is a little late, but, if you want to support local files, s3, and https, please use the smart_open package. Don't roll your own.
... and to contradict myself, turns out that fsspec is a better choice than smart_open. @damian0815 I think this is almost ready to ship if you make these few minor...