pegasus icon indicating copy to clipboard operation
pegasus copied to clipboard

Restarts and incremental crawls

Open shriphani opened this issue 8 years ago • 3 comments

See the confusion in: #22

shriphani avatar Jun 03 '16 07:06 shriphani

For what it's worth, I found 2 problems contributing to making restarts fail.

  1. Factual/durable-queue doesn't like to restore queues with names containing periods. So, given that queue names are constructed from keywords of the host portion of urls, this was causing a problem. (The restored queue name for http://foo.org/Some/path was "org".) I resolved the problem by replacing the default enqueue pipeline component with one that substitutes _ for . in queue names.
  2. The pipeline workers aren't restored, because the cache says that the host has been visited, so pegasus.queue/setup-queue-worker isn't called. I rewrote pegasus.core/start-crawl to take! the first entry from the to-visit queue with 0 timeout. If it gets something, it constructs a queue worker for the queue name that will be associated with that url. If it gets nothing, it does the normal seeding.

ejschoen avatar Oct 31 '16 23:10 ejschoen

Solid. Would really appreciate a PR!

shriphani avatar Nov 02 '16 19:11 shriphani

Will do. I'm working through a fork of a fork that a colleague made. I'll put some effort into folding the changes into defaults and core.

ejschoen avatar Nov 03 '16 13:11 ejschoen