pegasus
pegasus copied to clipboard
Restarts and incremental crawls
See the confusion in: #22
For what it's worth, I found 2 problems contributing to making restarts fail.
-
Factual/durable-queue
doesn't like to restore queues with names containing periods. So, given that queue names are constructed from keywords of the host portion of urls, this was causing a problem. (The restored queue name forhttp://foo.org/Some/path
was"org"
.) I resolved the problem by replacing the default enqueue pipeline component with one that substitutes_
for.
in queue names. - The pipeline workers aren't restored, because the cache says that the host has been visited, so
pegasus.queue/setup-queue-worker
isn't called. I rewrotepegasus.core/start-crawl
totake!
the first entry from the to-visit queue with 0 timeout. If it gets something, it constructs a queue worker for the queue name that will be associated with that url. If it gets nothing, it does the normal seeding.
Solid. Would really appreciate a PR!
Will do. I'm working through a fork of a fork that a colleague made. I'll put some effort into folding the changes into defaults and core.