scrapyd icon indicating copy to clipboard operation
scrapyd copied to clipboard

New jobs stuck in pending state

Open andr-c opened this issue 10 months ago • 21 comments

Scrapyd was running for 9 days with default max_proc settings and was running up to 3 jobs concurrently. After completing 26 jobs it has now 1 job running but can't start new jobs any longer - all jobs I launch stuck in pending state. There is no indication of any errors in the logs (like memory errors mentioned in #279). The system shows up only one active pid running that corresponds to the running job.
The memory consumption is constant and far below the allocated limits. Restarting scrapyd helps to resolve the issue and makes scrapyd able to run jobs concurrently again, but after few days of usage it becomes degraded again and not able to run new jobs, sometimes even if there is no active jobs running at all. What the problem could be?

andr-c avatar Feb 19 '25 17:02 andr-c

What version are you running?

Please share your scrapyd.conf or scrapyd.cfg file.

Also, please describe your environment (e.g. VPS server with X CPU and Y RAM, etc.)

jpmckinney avatar Feb 19 '25 20:02 jpmckinney

The memory consumption is constant and far below the allocated limits.

Between Scrapyd start and now, did memory consumption ever peak?

In the development version that you can install from GitHub, I added some log messages to indicate when process slots are ready, occupied or vacated. You should see messages like:

# At startup
2025-02-19T15:26:18-0500 [scrapyd.launcher#debug] Process slot 0 ready
2025-02-19T15:26:18-0500 [scrapyd.launcher#debug] Process slot 1 ready
2025-02-19T15:26:18-0500 [scrapyd.launcher#debug] Process slot 2 ready
2025-02-19T15:26:18-0500 [scrapyd.launcher#debug] Process slot 3 ready
...
# When you call the schedule endpoint
2025-02-19T15:26:23-0500 [scrapyd.launcher#debug] Process slot 0 occupied
...
# When the job completes
2025-02-19T15:26:24-0500 [scrapyd.launcher#debug] Process slot 0 vacated
2025-02-19T15:26:24-0500 [scrapyd.launcher#debug] Process slot 0 ready

If you notice that some slots are never vacated, or don't become ready immediately after being vacated, or don't become occupied after scheduling a job even when they are ready, or any other weird pattern, then we can start debugging.

jpmckinney avatar Feb 19 '25 20:02 jpmckinney

Thanks very much for your prompt reply! I'll try with more debug level next time (don't want to cancel current job). I was running Scrapyd 1.5.0 in Kubernetes pod with memory limit 1G and CPU limit 0.3. The memory consumption has 3 different levels over 7 days, corresponding to running 1-3 spiders concurrently:

Image

While running over the last plateau it stuck not being able to process new jobs.

My scrapyd.conf:

[scrapyd]
spiderqueue = scrapyd_custom.storages.queue_storage.PostgresSpiderQueue
jobstorage  = scrapyd_custom.storages.job_storage.PostgresJobStorage
bind_address      = 0.0.0.0
logs_dir          = .scrapy/scrapyd/logs
dbs_dir           = .scrapy/scrapyd/dbs
eggs_dir          = .scrapy/scrapyd/eggs

I have custom implementation of spiderqueue which stores jobs in PostgresDB. Whenever new job is submitted it shows up a record in the spiderqueue, but seems like this record is never pulled from it.

The only indication of new jobs submitted to scrapyd was in this log record:

2025-02-20T06:40:48+0000 [twisted.python.log#info] "10.10.125.18" - - [20/Feb/2025:06:40:48 +0000] "POST /schedule.json HTTP/1.1" 200 90 "-" "python-requests/2.32.3"

Apparently I was running scrapyd with INFO level.

andr-c avatar Feb 20 '25 08:02 andr-c

The debug messages are not yet in a released version. You would have to install from git to see the new DEBUG messages:

pip install git+https://github.com/scrapy/scrapyd.git

jpmckinney avatar Feb 27 '25 00:02 jpmckinney

My problem persists - I did not try to install the version with DEBUG messages, but my observation is that this issue occurs ONLY when I scrape website's data based on the HTML tags, meaning that I use .css() or .xpath() to find <h1>, <div class="... and other such elements. In these cases, the number of max jobs that Scrapyd can run at the same time decreasing from its max value to zero (in my case from 80 to 2 or 3 - so far, I never observed it going to zero). Then, I restart scrapyd service and and I have available the max of jobs that can run simultaneously - 80.

On the other hand - if I scrapy data from some API endpoint, meaning that I consume that the from the JSON structure, then this problem never occurs. It only happens when I parse HTML markup.

I also tried to upgrade the server, but the fundamentals of this problem is still there and band-aiding the problem by buying more hardware resources is only a temporary fix.

aaronm137 avatar Feb 28 '25 17:02 aaronm137

Hmm, unfortunately I have no hypothesis based on that. Maybe there’s some third cause that’s correlated with those two types of spiders.

jpmckinney avatar Mar 01 '25 00:03 jpmckinney

Small update here to provide more information - after some measurements, I noticed that this is happening when there's a long 100% usage of CPUs. When I scrape data by parsing HTML (reading old-school HTML markup), the utilization of CPU is higher than the case of when I load the data from API/JSON and then this information is parsed and processed further.

In my case, when I scrape HTML content directly, all of my CPU cores are utilized on 100%. When I scrape HTML content and load from there JSON that I process further, the CPU cores are utilized on ~40%.

When the CPU utilization is on non-stop 100%, over the course of a few hours (6) the number of running jobs is decreasing to 0 and no new jobs are fetched from the queue. After scrapyd restart I am able to load maximum number of jobs (80) specified in my scrapyd.conf:

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 500
dbs_dir     = dbs
max_proc    = 80
max_proc_per_cpu = 12
finished_to_keep = 100
#poll_interval = 5.0 # default
#poll_interval = 4.0
poll_interval = 1.5
bind_address = 0.0.0.0
http_port   = 6800
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
ebroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

I still don't know how to effectively resolve this situation. Maybe there's something wrong with my config? As my plan B, I consider spinning off a separate server where I would run scraping of all strict HTML websites and every 3 hours or so restarting scrapyd (to "allow" running the max amount of jobs), but I would like to avoid this scenario at all costs.

aaronm137 avatar Mar 18 '25 12:03 aaronm137

Can you try running the GitHub version of Scrapyd, and sharing any observations from the log file? The log file should follow a pattern like what I showed above (slots should rotate from being ready to occupied, vacated and ready again).

pip install git+https://github.com/scrapy/scrapyd.git

jpmckinney avatar Mar 19 '25 19:03 jpmckinney

@jpmckinney may I ask you how do I get access to the logs as in your example above?

# At startup
2025-02-19T15:26:18-0500 [scrapyd.launcher#debug] Process slot 0 ready
2025-02-19T15:26:18-0500 [scrapyd.launcher#debug] Process slot 1 ready
2025-02-19T15:26:18-0500 [scrapyd.launcher#debug] Process slot 2 ready
2025-02-19T15:26:18-0500 [scrapyd.launcher#debug] Process slot 3 ready
...
# When you call the schedule endpoint
2025-02-19T15:26:23-0500 [scrapyd.launcher#debug] Process slot 0 occupied
...
# When the job completes
2025-02-19T15:26:24-0500 [scrapyd.launcher#debug] Process slot 0 vacated
2025-02-19T15:26:24-0500 [scrapyd.launcher#debug] Process slot 0 ready

On production, I installed the git version of scrapy (pip install git+https://github.com/scrapy/scrapyd.git), restarted the service and then ran: journalctl -xefu scrapyd > scrapyd_logs.txt

Then I analyzed the scrapyd_logs.txt, but I didn't find anywhere the output like yours (Process slot 0 ready) - for the fact, there's no mention of slot in the output. Any advice how to get the debug data would be welcomed. Thanks

aaronm137 avatar Mar 21 '25 15:03 aaronm137

If you just run scrapyd from the command line, what do you see? I see:

2025-03-21T12:27:44-0400 [-] Loading path/to/scrapyd/txapp.py...
2025-03-21T12:27:44-0400 [scrapyd.basicauth#info] Basic authentication disabled as either `username` or `password` is unset
2025-03-21T12:27:44-0400 [scrapyd.app#info] Scrapyd web console available at http://127.0.0.1:6800/
2025-03-21T12:27:44-0400 [-] Loaded.
2025-03-21T12:27:44-0400 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 24.10.0 (path/to/bin/python3 3.12.5) starting up.
2025-03-21T12:27:44-0400 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.selectreactor.SelectReactor.
2025-03-21T12:27:44-0400 [-] Site starting on 6800
2025-03-21T12:27:44-0400 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site object at 0x103e93a10>
2025-03-21T12:27:44-0400 [Launcher] Scrapyd 1.5.0 started: max_proc=32, runner='scrapyd.runner'
2025-03-21T12:27:44-0400 [scrapyd.launcher#debug] Process slot 0 ready
2025-03-21T12:27:44-0400 [scrapyd.launcher#debug] Process slot 1 ready
2025-03-21T12:27:44-0400 [scrapyd.launcher#debug] Process slot 2 ready
2025-03-21T12:27:44-0400 [scrapyd.launcher#debug] Process slot 3 ready
...

jpmckinney avatar Mar 21 '25 16:03 jpmckinney

@jpmckinney ok, so when I run scrapyd on production, I get this output:

2025-03-22T07:51:10+0000 [-] Loading /usr/local/lib/python3.10/dist-packages/scrapyd/txapp.py...
2025-03-22T07:51:11+0000 [scrapyd.basicauth#info] Basic authentication disabled as either `username` or `password` is unset
2025-03-22T07:51:11+0000 [scrapyd.app#info] Scrapyd web console available at http://127.0.0.1:6800/
2025-03-22T07:51:11+0000 [-] Loaded.
2025-03-22T07:51:11+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 22.1.0 (/usr/bin/python3 3.10.12) starting up.
2025-03-22T07:51:11+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2025-03-22T07:51:11+0000 [-] Site starting on 6800
2025-03-22T07:51:11+0000 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site object at 0x7f575eb574c0>
2025-03-22T07:51:11+0000 [Launcher] Scrapyd 1.5.0 started: max_proc=80, runner='scrapyd.runner'

I double-checked and there are no debug statements of scrapyd.launcher#debug - despite I installed the last version of Scrapyd from git as you mentioned in the comment above.

aaronm137 avatar Mar 22 '25 07:03 aaronm137

scrapyd1.zip Here is the log from my scrapyd installation (used version from git master). It was able to run spider2 and spider3 concurrently for a while. Then notice a few POST /schedule.json calls starting 2025-03-17T18:14:29+0000 that resulted in nothing.

andr-c avatar Mar 22 '25 10:03 andr-c

@aaronm137 what is the output of which scrapyd ? You might need to run it with a full path to the git installation in case it is running a different version that is installed elsewhere.

jpmckinney avatar Mar 22 '25 16:03 jpmckinney

@jpmckinney /usr/local/bin/scrapyd

pip show scrapyd:

pip show scrapyd
Name: scrapyd
Version: 1.5.0
Summary: A service for running Scrapy spiders, with an HTTP API
Home-page: https://github.com/scrapy/scrapyd
Author: Scrapy developers
Author-email: [email protected]
License: BSD
Location: /usr/local/lib/python3.10/dist-packages
Requires: packaging, scrapy, setuptools, twisted, w3lib, zope.interface
Required-by:

pip installation output pip install git+https://github.com/scrapy/scrapyd.git:

pip install git+https://github.com/scrapy/scrapyd.git
Collecting git+https://github.com/scrapy/scrapyd.git
  Cloning https://github.com/scrapy/scrapyd.git to /tmp/pip-req-build-ls2y_6n1
  Running command git clone --filter=blob:none --quiet https://github.com/scrapy/scrapyd.git /tmp/pip-req-build-ls2y_6n1
  Resolved https://github.com/scrapy/scrapyd.git to commit 786109334ceeb732485262f2c5bc562607099461
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from scrapyd==1.5.0) (24.0)
Requirement already satisfied: scrapy>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scrapyd==1.5.0) (2.11.1)
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from scrapyd==1.5.0) (59.6.0)
Requirement already satisfied: twisted>=17.9 in /usr/lib/python3/dist-packages (from scrapyd==1.5.0) (22.1.0)
Requirement already satisfied: w3lib in /usr/local/lib/python3.10/dist-packages (from scrapyd==1.5.0) (2.0.0)
Requirement already satisfied: zope.interface in /usr/local/lib/python3.10/dist-packages (from scrapyd==1.5.0) (6.2)
Requirement already satisfied: cryptography>=36.0.0 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.0.0->scrapyd==1.5.0) (42.0.5)
Requirement already satisfied: cssselect>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.0.0->scrapyd==1.5.0) (1.2.0)
Requirement already satisfied: itemloaders>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.0.0->scrapyd==1.5.0) (1.2.0)
Requirement already satisfied: parsel>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.0.0->scrapyd==1.5.0) (1.9.1)
Requirement already satisfied: pyOpenSSL>=21.0.0 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.0.0->scrapyd==1.5.0) (24.1.0)
Requirement already satisfied: queuelib>=1.4.2 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.0.0->scrapyd==1.5.0) (1.6.2)
Requirement already satisfied: service-identity>=18.1.0 in /usr/lib/python3/dist-packages (from scrapy>=2.0.0->scrapyd==1.5.0) (18.1.0)
Requirement already satisfied: protego>=0.1.15 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.0.0->scrapyd==1.5.0) (0.3.1)
Requirement already satisfied: itemadapter>=0.1.0 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.0.0->scrapyd==1.5.0) (0.8.0)
Requirement already satisfied: tldextract in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.0.0->scrapyd==1.5.0) (5.1.2)
Requirement already satisfied: lxml>=4.4.1 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.0.0->scrapyd==1.5.0) (5.2.1)
Requirement already satisfied: PyDispatcher>=2.0.5 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.0.0->scrapyd==1.5.0) (2.0.7)
Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.10/dist-packages (from cryptography>=36.0.0->scrapy>=2.0.0->scrapyd==1.5.0) (1.16.0)
Requirement already satisfied: jmespath>=0.9.5 in /usr/local/lib/python3.10/dist-packages (from itemloaders>=1.0.1->scrapy>=2.0.0->scrapyd==1.5.0) (1.0.1)
Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from tldextract->scrapy>=2.0.0->scrapyd==1.5.0) (2.7)
Requirement already satisfied: requests>=2.1.0 in /usr/lib/python3/dist-packages (from tldextract->scrapy>=2.0.0->scrapyd==1.5.0) (2.25.1)
Requirement already satisfied: requests-file>=1.4 in /usr/local/lib/python3.10/dist-packages (from tldextract->scrapy>=2.0.0->scrapyd==1.5.0) (2.0.0)
Requirement already satisfied: filelock>=3.0.8 in /usr/local/lib/python3.10/dist-packages (from tldextract->scrapy>=2.0.0->scrapyd==1.5.0) (3.13.4)
Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.12->cryptography>=36.0.0->scrapy>=2.0.0->scrapyd==1.5.0) (2.22)

aaronm137 avatar Mar 22 '25 16:03 aaronm137

same problem

At startup, the scarpyd runs for several hours and then just stop with [scrapyd.launcher#debug] Only 0 steps

On restart, it picks up all pending tasks again, works for several hours, and gets up again.

Launch in Docker, CPU: 0.6, RAM: 6 GB Image

I installed the dev version, RUN pip install git+https://github.com/scrapy/scrapyd.git it didn't help

Ralendo avatar Apr 02 '25 07:04 Ralendo

@aaronm137 Can you pip uninstall scrapyd then check that which scrapyd returns nothing, before running pip install git+https://github.com/scrapy/scrapyd.git again? I can't think of anything else that would prevent the debug messages from being printed.

jpmckinney avatar Apr 02 '25 16:04 jpmckinney

@Ralendo There is nothing in Scrapyd's code that would output [scrapyd.launcher#debug] Only 0 steps I'm not sure how you are seeing that.

The dev version just has additional debug statements to help determine the cause of this issue. You would need to let it run in the dev version, wait for the issue to occur, and then upload a log file like andr-c did.

jpmckinney avatar Apr 02 '25 16:04 jpmckinney

@andr-c I'm seeing errors in your log like:

	  File "/app/.scrapy/scrapyd/eggs/scraper/1741266109.egg/scraper/spiders/spider1/spider_pw.py", line 14, in <module>
	ImportError: cannot import name 'set_url_params' from 'scraper.util' (/app/.scrapy/scrapyd/eggs/scraper/1741266109.egg/scraper/util.py)

Or:

	  File "/app/scrapyd_custom/storages/queue_storage.py", line 39, in count
	    return session.query(PriorityQueue).count()
...
	sqlalchemy.exc.OperationalError: (psycopg.OperationalError) connection failed: the database system is shutting down
	FATAL:  the database system is shutting down
	(Background on this error at: https://sqlalche.me/e/20/e3q8)

For a while, each schedule.json call is followed by a "Process started" line, and each cancel.json is followed by a "Process finished" line, as expected.

At the end I see:

2025-03-17T18:44:52+0000 [twisted.python.log#info] "10.10.76.7" - - [17/Mar/2025:18:44:52 +0000] "POST /schedule.json HTTP/1.1" 200 90 "-" "python-requests/2.32.3"
2025-03-17T18:48:36+0000 [twisted.python.log#info] "10.10.10.14" - - [17/Mar/2025:18:48:36 +0000] "POST /cancel.json HTTP/1.1" 200 82 "-" "python-requests/2.32.3"
2025-03-17T18:49:37+0000 [twisted.python.log#info] "10.10.10.14" - - [17/Mar/2025:18:49:36 +0000] "POST /schedule.json HTTP/1.1" 200 90 "-" "python-requests/2.32.3"
2025-03-17T18:59:52+0000 [twisted.python.log#info] "10.10.10.17" - - [17/Mar/2025:18:59:51 +0000] "POST /cancel.json HTTP/1.1" 200 82 "-" "python-requests/2.32.3"
2025-03-18T07:02:24+0000 [twisted.python.log#info] "10.10.141.53" - - [18/Mar/2025:07:02:23 +0000] "POST /schedule.json HTTP/1.1" 200 90 "-" "python-requests/2.32.3"
2025-03-18T07:19:38+0000 [twisted.python.log#info] "10.10.141.53" - - [18/Mar/2025:07:19:37 +0000] "POST /cancel.json HTTP/1.1" 200 82 "-" "python-requests/2.32.3"
2025-03-18T10:33:58+0000 [twisted.python.log#info] "10.10.10.19" - - [18/Mar/2025:10:33:58 +0000] "POST /schedule.json HTTP/1.1" 200 90 "-" "python-requests/2.32.3"

In other words, it looks like you can't schedule new jobs at this point.

The new "slot" debug messages show that all slots were vacated and ready, so that's not the problem. That's kind of expected since launcher.py is quite similar to what it was 4 years ago: https://github.com/scrapy/scrapyd/blame/50f4019463c5f9204e39392e1f8147a7b05b6ff0/scrapyd/launcher.py

I notice that you have 4 slots available, and also that this issue occurs only once you've had the ImportError 4 times.

So, I'll look into how errors are handled, and how that affects the queues. I don't see anything in launcher.py since process.deferred.addBoth is used like it was 4 years ago, and that handles processes with errors.

In the meantime, if you monitor your logs for coding errors in your own spiders, and then restart Scrapyd after fixing those errors, you shouldn't run into this problem anymore.

jpmckinney avatar Apr 02 '25 16:04 jpmckinney

@jpmckinney

On March 28, scrapyd completed the task and did not take any more new ones. I made a request to check that there are tasks (they were in pending). I restart on April 1 and he immediately takes tasks from the queue.

2025-03-28T13:18:43+0000 [scrapyd.launcher#info] Process finished: project='crawler_service' spider='kubsu' job='edu_kubsau_ru_25_03_28_10_27_57' pid=2443 args=['/usr/local/bin/python3.12', '-m', 'scr
2025-03-28T13:18:43+0000 [scrapyd.launcher#debug] Process slot 1 vacated
2025-03-28T13:18:43+0000 [scrapyd.launcher#debug] Process slot 1 ready
2025-04-01T10:06:57+0000 [twisted.python.log#info] "192.168.0.2" - - [01/Apr/2025:10:06:57 +0000] "GET /listjobs.json?project=crawler_service HTTP/1.1" 200 388040 "-" "python-requests/2.32.3"
2025-04-01T10:09:05+0000 [twisted.internet.base#info] Received SIGTERM, shutting down.
2025-04-01T10:09:05+0000 [-] (TCP Port 6800 Closed)
2025-04-01T10:09:05+0000 [twisted.web.server.Site#info] Stopping factory <twisted.web.server.Site object at 0x7a1a827cbc50>
2025-04-01T10:09:05+0000 [twisted.internet.base#info] Main loop terminated.
2025-04-01T10:09:05+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] Server Shut Down.
2025-04-01T10:09:11+0000 [-] Loading /usr/local/lib/python3.12/site-packages/scrapyd/txapp.py...
2025-04-01T10:09:11+0000 [scrapyd.basicauth#info] Basic authentication disabled as either `username` or `password` is unset
2025-04-01T10:09:11+0000 [scrapyd.app#info] Scrapyd web console available at http://0.0.0.0:6800/
2025-04-01T10:09:11+0000 [-] Loaded.
2025-04-01T10:09:11+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 24.11.0 (/usr/local/bin/python3.12 3.12.9) starting up.
2025-04-01T10:09:11+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2025-04-01T10:09:11+0000 [-] Site starting on 6800
2025-04-01T10:09:11+0000 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site object at 0x7fe98f7e7110>
2025-04-01T10:09:11+0000 [Launcher] Scrapyd 1.5.0 started: max_proc=2, runner='scrapyd.runner'
2025-04-01T10:09:11+0000 [scrapyd.launcher#debug] Process slot 0 ready
2025-04-01T10:09:11+0000 [scrapyd.launcher#debug] Process slot 1 ready
2025-04-01T10:09:11+0000 [scrapyd.launcher#info] Process started: project='crawler_service' spider='kubsu' job='cifrek_kubsau_ru_25_03_28_10_28_28' pid=7 args=['/usr/local/bin/python3.12', '-m', 'scra
2025-04-01T10:09:11+0000 [scrapyd.launcher#debug] Process slot 0 occupied
2025-04-01T10:09:11+0000 [scrapyd.launcher#info] Process started: project='crawler_service' spider='mauniver' job='priem_mauniver_ru_25_03_28_10_29_20' pid=8 args=['/usr/local/bin/python3.12', '-m', '
2025-04-01T10:09:11+0000 [scrapyd.launcher#debug] Process slot 1 occupied

#Dockerfile:

FROM python:3.12-slim
ENV PLAYWRIGHT_BROWSERS_PATH=/playwright-browsers

COPY ../scrapyd.conf /etc/scrapyd/scrapyd.conf

RUN apt -y update
RUN apt -y install libpq-dev build-essential git
RUN pip install --upgrade pip
RUN pip install psycopg psycopg2 pandas brotli scrapy-playwright greenlet
RUN pip install git+https://github.com/scrapy/scrapyd.git
RUN playwright install-deps
RUN playwright install chromium
RUN mkdir /logs
EXPOSE 6800

ENTRYPOINT ["scrapyd", "--logfile", "/logs/scrapyd_jobs_logs/scrapyd_main_process.log"]

#scrapyd.conf

[scrapyd]

bind_address      = 0.0.0.0
logs_dir          = ./logs/scrapyd_jobs_logs/
jobs_to_keep      = 9223372036854775807
eggs_dir          = /scrapyd_data/eggs
dbs_dir           = /scrapyd_data/dbs
max_proc          = 2

Ralendo avatar Apr 03 '25 11:04 Ralendo

Thanks @Ralendo. If it happens again, please share the log file (or, at least the end of the log file, if it is very long), along with the time at which crawls stopped being scheduled, so that I can compare that time to the timestamps in the log.

jpmckinney avatar Apr 03 '25 13:04 jpmckinney

@jpmckinney

In the logs, I have no idea what the error is. It just stands up as a dead weight without errors. Yesterday, as part of testing, he was loaded with tasks, and he was tired at night.

Today, something new happened. When requesting the listjobs, I got some errors, but they hardly relate to my problem.

I am sending logs from the moment the tasks is started to the current moment. And also listjobs.json, where you can see that the tasks are still in the waiting state (I removed other tasks from pending and finished state to make it clearer)

response_1743758665345.zip

Ralendo avatar Apr 04 '25 09:04 Ralendo

I also had a lot of jobs stuck in pending state, but it seems to have been caused by the partition that scrapyd was installed in running out of disk space.

pfrenssen avatar Jul 22 '25 05:07 pfrenssen

@pfrenssen Thanks for reporting. I think many of these issues are caused by running Scrapyd with limited resources, and running out of resources (whether disk space, memory, etc.), or otherwise not running a functional system.


@andr-c In your log file, there is:

	sqlalchemy.exc.OperationalError: (psycopg.OperationalError) connection failed: the database system is shutting down
	FATAL:  the database system is shutting down
	(Background on this error at: https://sqlalche.me/e/20/e3q8)

You had mentioned:

I have custom implementation of spiderqueue which stores jobs in PostgresDB

Obviously, if PostgreSQL is down, the spiderqueue that is stored in PostgreSQL cannot be accessed, and so Scrapyd cannot get the pending jobs from that queue, and will need to be restarted (once PostgreSQL is back online).


@Ralendo For your listjobs error, I've pushed code that will print a more specific error message. As far as I can tell, this error is caused by scheduling a job with too many spider arguments. There is nothing inherent in Scrapyd that could cause that error. See here: https://github.com/scrapy/scrapyd/blob/c5af2e29f60373044ed5c64293de5683f6fa1397/scrapyd/launcher.py#L69-L72

Otherwise, the log file shows slot 1 being ready, occupied, then vacated. But at 2025-04-03T21:00:20+0000, slot 0 became ready, then never got a job scheduled. And at 2025-04-03T23:22:39+0000, slot 1 became ready for the last time.

You don't mention at what time you called the listjobs endpoint, so I'm unable to debug further, as this information is necessary to reconcile the timeline.

Your log file doesn't show Scrapyd startup, so it has been truncated at least at the start, if not at the end. I can't make much progress with incomplete information.

From your Dockerfile, it looks like you're using postgres, so maybe like @andr-c there is an issue with that integration.


I've been running Scrapyd for my own needs more or less continuously for many years now, only restarting for version upgrades, and I have not encountered this issue. I have tried to reproduce with e.g. a spider that immediately errors (in case this was a potential cause), but Scrapyd still runs.

For the cases with known causes, they are external to Scrapyd (no disk space, spiderqueue in PostgreSQL becomes inaccessible, etc.). So, for now, I will assume that is the explanation.

Feel free to contribute new, complete information if you are encountering this issue.

jpmckinney avatar Jul 25 '25 02:07 jpmckinney