scrapy icon indicating copy to clipboard operation
scrapy copied to clipboard

Sitemap spider does not resume though JOBDIR is set

Open Arregator opened this issue 5 years ago • 4 comments

Description

A spider inherits SitemapSpider parcing sites sitemaps, starting from robots.txt, has JOBDIR set.

I run it as a CentOS 8.x service with a unit file defined and it runs just fine.

But after I stop the service (memory leaks or something) and run it again it starts spider and closes it immidiately. It does not resume the job. The only way to start it is to remove the directory set in JOBDIR setting, but then it starts parsing the target site all over again.

Steps to Reproduce

  1. Define a SitemapSpider, set a target site:
sitemap_urls = [
    r'https://<target_site>/robots.txt',
]

define a JOBDIR=jobs in settings.py,
2. Run spider 3. Stop spider 4. Run spider again

Expected behavior: I would expect a spider to resume its job may be repeating parsing , well, few pages from previous run, but not having to start parsing the site from the very beginning

Actual behavior: Spider starts and finishes immidiately

Reproduces how often: The problem is reproduced consistently

Versions

Scrapy 2.0

Please paste here the output of executing scrapy version --verbose in the command line.

Scrapy       : 2.0.1
lxml         : 4.5.0.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.5.2
w3lib        : 1.21.0
Twisted      : 19.10.0
Python       : 3.7.6 (default, Jan  8 2020, 19:59:22) - [GCC 7.3.0]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1e  17 Mar 2020)
cryptography : 2.8
Platform     : Linux-4.18.0-80.11.2.el8_0.x86_64-x86_64-with-centos-8.0.1905-Core

Additional context

The log file fragment:

2020-04-09 19:40:57 [scrapy.core.engine] INFO: Spider opened
2020-04-09 19:40:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-09 19:40:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-09 19:40:57 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 50, reanimated: 0, mean backoff time: 0s)
2020-04-09 19:40:57 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://<target_site>/robots.txt> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2020-04-09 19:40:57 [scrapy.core.engine] INFO: Closing spider (finished)

The contents of the JOBDIR after a first run:

drwxr-xr-x. 2 root root 4,0K apr  9 19:25 requests.queue/
-rw-r--r--. 1 root root 6,9M apr  9 19:26 requests.seen
-rw-r--r--. 1 root root    6 apr  9 19:40 spider.state

This is a service unit file I use to start the spider:

[Unit]
Description=Scrapy spider
Documentation=https://scrapyd.readthedocs.io/en/latest/index.html
After=network.target

[Service]
Type=simple
User=root
Group=root
WorkingDirectory=/var/scrapy/MySpider
Environment="VIRTUAL_ENV=/usr/local/miniconda3/envs/scrap"
Environment="PATH=$VIRTUAL_ENV/bin:$PATH"
ExecStart=/usr/local/miniconda3/envs/scrap/bin/python -O /var/scrapy/MySpider/run.py
Restart=no

[Install]
WantedBy=multi-user.target

Arregator avatar Apr 09 '20 18:04 Arregator

As a workaround I found following solution:

I have sitemap_urls like following:

sitemap_urls = [
    r'https://<target_site>/robots.txt',
]

so I guessed that after the first run the hash of request of https://<target_site>/robots.txt would be the first line in JOBDIR/request.seen file and that is the problem preventing resuming my spider's job - when I re-run the spider, it gets https://<target_site>/robots.txt for the first request, finds its hash in JOBDIR/request.seen , drops it as a duplicate and quits as no more urls in sitemap_urls found.

So I went to JOBDIR/request.seen file and copied first hash from that file and saved into some other place for future, then I removed that first hash from JOBDIR/request.seen file and saved it.

Then I inherit my SitemapSpider class and overloaded the _parse_sitemap method to modify the yielding requests for robots.txt and for sitemaps of sitemapindex type and added dont_filer=True as additinal parameter to the request constructor, so yielding would look like this:

yield Request(url, callback=self._parse_sitemap, dont_filter=True)

If you are not familiar with class inheritance, possibly you can try chaning the scrapy source code a little bit in file <vent>\lib\site-packages\scrapy\spiders\sitemap.py lines 44 and 58, add , dont_filter=True into the yielding requests calls.

However, it does solve the problem completely, becase I did not find the way to control very first request, because scrapy does it internally and I have no way to tell it not to filter that very first request, so you need to save accurately the first request hash and remove it from JOBDIR/request.seen file each time you want to resume the job.

(NOTE: after the second run this have will NOT be the first line in the JOBDIR/request.seen , so you will need to find it in the file - that is why you need to save it after the first run, when it will be the first one in the file)

That needs to be fixed in scrapy code.

Arregator avatar Apr 10 '20 08:04 Arregator

I just ran into the same problem :-(

ggilley avatar Dec 05 '22 05:12 ggilley

I think that in scope of this issue it is worth to note that in base spider class with defined start_urls - requests from start_urls yield with dont_filter=True by start_requests method of base spider unless other logic defined in overridden start_requests of custom spider: https://github.com/scrapy/scrapy/blob/e71eab693264188fe081ebb260baf00cc6b4dc11/scrapy/spiders/init.py#L62-L70

meanwhile this logic is not preserved in sitemap spider (from sitemap_urls) https://github.com/scrapy/scrapy/blob/e71eab693264188fe081ebb260baf00cc6b4dc11/scrapy/spiders/sitemap.py#L29-L31

Setting dont_filter=False (default value) in requests derived from start_requests spider method on some cases can lead to side effects mentioned on https://github.com/scrapy/scrapy/issues/3276 counting this it may cause this issue

GeorgeA92 avatar Jan 26 '23 17:01 GeorgeA92