icrawler icon indicating copy to clipboard operation
icrawler copied to clipboard

parser error(?) in flickr crawler

Open kyung-wook opened this issue 6 years ago • 1 comments

When I command to crawl 1000 images, I got <parser - no more page urls for thread parser-001 to parse> message around 500th image.

That means there's no more images? But when I search on flickr site, there are more hundreds thousands of images

kyung-wook avatar May 30 '18 09:05 kyung-wook

If anyone else gets this error, it appears to be caused by the output queue of the FlickrFeeder being sized at 5*num_threads, which means you'll get at most 5 pages worth of results at 100 results a page before the queue is full. As the FlickrFeeder is set to not block on a full queue this causes it to overwrite the contents and then exit once it has fetched all its max 40 pages. At this time the consumers of that queue hasn't had a chance to do much yet, meaning you get at most 500 results.

Changing the line self.output(complete_url, block=False) to self.output(complete_url, block=True) in FlickrFeeder in icrawler/builtin/flickr.py appears to fix the problem.

stiansel avatar Jul 24 '18 09:07 stiansel