Ivan Kozik comments

Results 148 comments of


                                            Ivan Kozik

Enhancement idea: conditional ignores

I agree and this probably isn't that hard to do. Perhaps an optional second (tab-delimited) column in in the `ignores` file could specify the required source URL as another regexp.

Enhancement idea: show failed jobs on the dashboard until gs-server is restarted

This wouldn't help you notice a `Segmentation fault` because grab-site can't send a message to gs-server when it segfaults. (ArchiveBot has a different design where a parent process is responsible...

Enhancement idea: show failed jobs on the dashboard until gs-server is restarted

Remembering the last few events for each job wouldn't slow down gs-server. It's probably not too hard to implement either (just use a Python deque).

Crawls sometimes hang forever

I haven't tried it, but I assume so. It should not be necessary to use a tool with so many restrictions though. > As a workaround, you can use the...

Add support for Cloudflare DDoS protection screen

My first step would be to review the things on https://github.com/search?q=cloudflare+scrape+fork%3Atrue and see if anyone has solved the problem in a satisfactory manner where it can just be imported from...

Windows Subsystem for Linux: gs-dump-urls fails on active crawl

Stupid workaround: make a copy of `wpull.db` with `cp` and run `gs-dump-urls` on that. In the rare case that you get an unreadably inconsistent copy, try again.

Ignore errors and keep crawling

grab-site does keep crawling on errors like a connection error. If it is exiting too early, it is probably because it has run through the entire queue (perhaps because it...

Ignore errors and keep crawling

grab-site effectively becomes a wpull process, so that would be https://wpull.readthedocs.io/en/master/api/errors.html#wpull.errors.ExitStatus.network_failure - meaning that some request had a network failure. It doesn't mean that it exited immediately because of it.

Enhancement idea: delay/concurrency by regex

This is a good idea. I think wpull already passes the request into `wait_time` but my hooks don't look at the object.

Enhancement idea: delay/concurrency by regex

This is now possible with a custom hook: upgrade to grab-site 1.4.0, then use a custom hook with: ```python wait_time_grabsite = wpull_hook.callbacks.wait_time def wait_time(seconds, url_info, record_info, response_info, error_info): url =...