icrawler
icrawler copied to clipboard
parser error(?) in flickr crawler
When I command to crawl 1000 images, I got <parser - no more page urls for thread parser-001 to parse> message around 500th image.
That means there's no more images? But when I search on flickr site, there are more hundreds thousands of images
If anyone else gets this error, it appears to be caused by the output queue of the FlickrFeeder being sized at 5*num_threads, which means you'll get at most 5 pages worth of results at 100 results a page before the queue is full. As the FlickrFeeder is set to not block on a full queue this causes it to overwrite the contents and then exit once it has fetched all its max 40 pages. At this time the consumers of that queue hasn't had a chance to do much yet, meaning you get at most 500 results.
Changing the line self.output(complete_url, block=False)
to self.output(complete_url, block=True)
in FlickrFeeder in icrawler/builtin/flickr.py appears to fix the problem.