Ivan Kozik

Results 135 comments of Ivan Kozik

I do not know which install steps you followed, but it doesn't look like anything from the grab-site README. I can't really support anything but the various install steps documented...

I have added a `--no-dupespotter` option in 7a63a3dcd113e11de218fe6bb0c3ad03153a6954 and I might have it enabled by default in the future.

Indeed, it takes only the last `--wpull-args`. I'll leave this open until I figure out whether they can/should be combined if used multiple times.

Are `--retry-connrefused --retry-dns-error` something that grab-site should have on by default?

Can you first check the .warc.gz file with `zgrep -F URL FILE.warc.gz`? You can use `-C N` to display more context lines. It's plausible that the WARC playback software you're...

And do you know if the crawl actually made requests that would result in those JSON/XmlHTTP responses? grab-site/wpull isn't going to execute JavaScript (unless used with the phantomjs mode that...

Unfortunately, I don't think there's a way to grab the same URL twice with different request headers in the same crawl. The database in wpull assumes that one successful response...

wpull probably needs a new hook/API for this. Ideally [`accept_url`](https://github.com/ludios/grab-site/blob/bd375b31f376adacc7324ca5f06265ce7762fa4a/libgrabsite/wpull_hooks.py#L232) could just generate parent URLs and feed them into wpull to be queued. (Actually, maybe I can navigate to the...

@chfoo is it safe to call `wpull_hook.factory.get('URLTable').add_many(...)` to feed in extra URLs to crawl?