icrawler icon indicating copy to clipboard operation
icrawler copied to clipboard

Parse / Donwloader stuck in inf loop if it did not reach max_num

Open zMynxx opened this issue 11 months ago • 7 comments

When using the crawler I've encountered with issue regarding the max_num. If cases where less images found than the provided "max_num", a infinite loop begin, stop any previous result from actually providing information. Expected result should be immidiate stop when there nothing left to download, and exit safely.

The following example is for the greedy crawler with a url provided to use flicker search engine like so:

  • search_phrase set to "gripper"
  • max_num set to 30
def test_flicker(search_phrase: str, max_num: int) -> None:
    print("start testing FlickerImageCrawler")
    greedy_crawler = GreedyImageCrawler(parser_threads=4, storage={"root_dir": root_dir})
    greedy_crawler.crawl(f"https://www.flickr.com/search/?q={search_phrase}", max_num=max_num, min_size=(100, 100))

result: (downloaded image #27 and then inf loop)

INFO - downloader - image #27\thttps://combo.staticflickr.com/pw/images/favicons/f>
INFO - parser - parser-001 is waiting for new page urls
INFO - parser - parser-002 is waiting for new page urls
INFO - parser - parser-004 is waiting for new page urls
INFO - parser - parser-003 is waiting for new page urls
INFO - parser - parser-001 is waiting for new page urls
INFO - parser - parser-002 is waiting for new page urls
INFO - parser - parser-004 is waiting for new page urls
INFO - parser - parser-003 is waiting for new page urls
INFO - parser - parser-001 is waiting for new page urls
INFO - downloader - downloader-001 is waiting for new download tasks
INFO - parser - parser-002 is waiting for new page urls
INFO - parser - parser-003 is waiting for new page urls
INFO - parser - parser-004 is waiting for new page urls
INFO - parser - parser-001 is waiting for new page urls
INFO - parser - parser-002 is waiting for new page urls
INFO - parser - parser-003 is waiting for new page urls
INFO - parser - parser-004 is waiting for new page urls

zMynxx avatar Dec 08 '24 11:12 zMynxx

Thank you for raising this issue, this seems very interesting. I'll look into it

ZhiyuanChen avatar Dec 10 '24 06:12 ZhiyuanChen

Any updates? If not, in the mean time can you please introduce a timeout mechanism? say 30s or so. This way the crawler will still be functional / operational. At the moment it is unstable due to that, to the point where I cannot use it :(

zMynxx avatar Dec 22 '24 13:12 zMynxx

Sorry for this late reply.

Can you try if the latest commit fixes your issue?

ZhiyuanChen avatar Jan 02 '25 08:01 ZhiyuanChen

Sorry for this late reply.

Can you try if the latest commit fixes your issue?

Sorry I missed you're reply, I probably missed the notification. I have another issue so I came to check on this one, I'll definitely give it a go.

Regarding v0.6.10 you did not update the examples, I guess I need to set max_idle_time if I intend to set it explicitly by adding in to the .crawler invokation as a final argument? e.g add max_idle_time=120 here? https://github.com/hellock/icrawler/blob/f7f610795c6000f54a0f632cf38bf590d74a06ac/examples/crawl.py#L21

zMynxx avatar Feb 25 '25 08:02 zMynxx

Yes that's correct

ZhiyuanChen avatar Mar 13 '25 13:03 ZhiyuanChen

It's touch and go really, sometimes it works and other times it loops. I'll scout more info tomorrow, thanks for replying tho.

zMynxx avatar Mar 26 '25 16:03 zMynxx

It's touch and go really, sometimes it works and other times it loops.

That's strange, please let me know if you find anything -- maybe we can work this out together!

I'll scout more info tomorrow, thanks for replying tho.

No worries, glad I can help -- thats the only important thing

ZhiyuanChen avatar Mar 26 '25 18:03 ZhiyuanChen