scrapyd
scrapyd copied to clipboard
Starting scrapyd docker container with eggs included
Hi I've been experimenting a little with scrapyd on docker, and done the following:
- in the config file, i specified different directory for eggs
eggs_dir = /src/eggs
- in dockerfile, i added prebuilt projects to this directory
ADD eggs /src/eggs
At first glance, it looked like it's working
but, when I wanted to make a scheduje.json post, it returned me an error
{"node_name": "295e305bea8e", "status": "error", "message": "Scrapy 1.4.0 - no active project\n\nUnknown command: list\n\nUse "scrapy" to see available commands\n"}
I could type anything into project and spider fields and the result was the same. How can I fix this issue?
Hi, this message means as far as I remember that Scrapy project is not deploying. I'm not working with Docker, but could you just run command scrapyd in a container if it will help, this means that you have a problem with permission to file system. Also sometimes(I don't know why. I have this only on my local machine) if scrapyd service was stopped and restart, the project can display as active in 127.0.0.1:6800 port, but when you will try to run, you can get this error and if we restart server list of active project will be empty(Maybe it was my small knowledge)
I think it's rather because scrapyd, when making request to addversion
, despite adding egg file to eggs_dir
, is making some other stuff that activate the project. I have even seen this functions in code, but I'm not able to recreate them. Also, I tried searching in sqlite database which scrapyd use if there are some data about eggs, but unfortunatelly there wasn't any and I'm stuck
I'd recommended you use scrapyd-client for deploy and after deploy run scrapyd server
scrapyd-client is good on small scale. I wan't to have docker image with eggs and daemon configurated, so that I can launch it right away, without using scrapyd-client or scrapyd API
@VanDavv, the project name in "available projects" shouldn't have the version, python version and egg extension in the name. There's definitely something wrong there. We need more info. You configuration files, the commands you type and their output and the logs. Eventually, the docker image but unfortunately I don't have the time to dig that far right now.
@Digenis here are all the informations You requested. Also, the project name in available projects
is the same as a name of the egg file stored in eggs_dir
Dockerfile
FROM python:3.6
MAINTAINER [email protected]
RUN set -xe \
&& apt-get update \
&& apt-get install -y autoconf \
build-essential \
curl \
git \
libffi-dev \
libssl-dev \
libtool \
libxml2 \
libxml2-dev \
libxslt1.1 \
libxslt1-dev \
python \
python-dev \
vim-tiny \
&& apt-get install -y libtiff5 \
libtiff5-dev \
libfreetype6-dev \
libjpeg62-turbo \
libjpeg62-turbo-dev \
liblcms2-2 \
liblcms2-dev \
libwebp5 \
libwebp-dev \
zlib1g \
zlib1g-dev \
&& curl -sSL https://bootstrap.pypa.io/get-pip.py | python \
&& pip install git+https://github.com/scrapy/scrapy.git \
git+https://github.com/scrapy/scrapyd.git \
git+https://github.com/scrapy/scrapyd-client.git \
git+https://github.com/scrapinghub/scrapy-splash.git \
git+https://github.com/scrapinghub/scrapyrt.git \
git+https://github.com/python-pillow/Pillow.git \
&& curl -sSL https://github.com/scrapy/scrapy/raw/master/extras/scrapy_bash_completion -o /etc/bash_completion.d/scrapy_bash_completion \
&& curl -sL https://deb.nodesource.com/setup_6.x | bash - \
&& apt-get install -y nodejs \
&& echo 'source /etc/bash_completion.d/scrapy_bash_completion' >> /root/.bashrc \
&& apt-get purge -y --auto-remove autoconf \
build-essential \
libffi-dev \
libssl-dev \
libtool \
libxml2-dev \
libxslt1-dev \
python-dev \
&& apt-get purge -y --auto-remove libtiff5-dev \
libfreetype6-dev \
libjpeg62-turbo-dev \
liblcms2-dev \
libwebp-dev \
zlib1g-dev \
&& rm -rf /var/lib/apt/lists/*
RUN npm install -g phantomjs-prebuilt
COPY ./scrapyd.conf /etc/scrapyd/
VOLUME /etc/scrapyd/ /var/lib/scrapyd/
EXPOSE 6800
ADD requirements.txt .
RUN pip install -r requirements.txt
ADD . .
ADD eggs /src/eggs
CMD ["scrapyd", "--pidfile="]
Config file
[scrapyd]
eggs_dir = /src/eggs
logs_dir = /var/lib/scrapyd/logs
items_dir = /var/lib/scrapyd/items
dbs_dir = /var/lib/scrapyd/dbs
jobs_to_keep = 5
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5
bind_address = 0.0.0.0
http_port = 6800
debug = off
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
Eggs are built in separate step (which is tested and works ok), and we can assume that I have couple of egg files created with python setup.py bdist_egg
command on some scrapy projects, stored in eggs
directory. Container is then simply runned by docker run scrapy-deamon-eggs
.
Logs from scrapyd when ran:
2017-07-02T20:40:16+0000 [-] Loading /usr/local/lib/python3.6/site-packages/scrapyd/txapp.py...
2017-07-02T20:40:16+0000 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-07-02T20:40:16+0000 [-] Loaded.
2017-07-02T20:40:16+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.1.0 (/usr/local/bin/python 3.6.1) starting up.
2017-07-02T20:40:16+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-07-02T20:40:16+0000 [-] Site starting on 6800
2017-07-02T20:40:16+0000 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site object at 0x7f3f293cde48>
2017-07-02T20:40:16+0000 [Launcher] Scrapyd 1.2.0 started: max_proc=32, runner='scrapyd.runner'
@VanDavv Do you have a nice solution for this problem by now? I'm also interested in deploying scrapyd
using Docker and even though I only have one scraper to deploy, I would much prefer to have everything built locally and sent to AWS in one nice package, rather than having to upload the docker image first and then use scrapyd-client
to deploy my scraper.
@omrihar I abandoned this project, as far as I could get was to include eggs in the image and after scapyd startup upload them via scrapyd-client.
Other solution - launch scrapyd, upload spiders, then make docker commit
and push this image also worked, but it wasn't what I wanted
Maybe @Digenis could help us handling this case (and maybe remove insufficient info label?)
I managed to get through this by running a background deploy after my scrapyd instance has started. Not sure it's the best way but it works for me now
Dockerfile
FROM python:3.6
COPY requirements.txt /requirements.txt
RUN pip install -r requirements.txt
COPY docker-entrypoint /usr/local/bin/
RUN chmod 0755 /usr/local/bin/docker-entrypoint
COPY . /scrapyd
WORKDIR /scrapyd
ENTRYPOINT ["/usr/local/bin/docker-entrypoint"]
Entrypoint script
#!/bin/bash
bash -c 'sleep 15; scrapyd-deploy' &
scrapyd
scrapy.cfg
[settings]
default = scraper.settings
[deploy]
url = http://localhost:6800
project = projectname
This assumes you are copying your scrapy project folder into /scrapyd
and have the requirements.tx
with all your dependencies (including scrapyd server)
After reading the comment of @radyz , i could also run a container with a deployed a spider in following way.
Dockerfile :
FROM vimagick/scrapyd:py3
COPY myspider /myspider/
COPY entrypoint1.sh /myspider
COPY entrypoint2.sh /myspider
COPY wrapper.sh /myspider
RUN chmod +x myspider/entrypoint1.sh
RUN chmod +x myspider/entrypoint2.sh
RUN chmod +x myspider/wrapper.sh
WORKDIR /myspider
CMD ./wrapper.sh
wrapper.sh :
#!/bin/bash
# turn on bash's job control
set -m
# Start the primary process and put it in the background
./entrypoint1.sh &
# Start the helper process
./entrypoint2.sh
# the my_helper_process might need to know how to wait on the
# primary process to start before it does its work and returns
# now we bring the primary process back into the foreground
# and leave it there
fg %1
entrypoint1.sh :
scrapyd
entrypoint2.sh :
sleep 15;scrapyd-deploy
My scrapyd project resides in the myspider folder.
Refer : https://docs.docker.com/config/containers/multi-service_container/
@VanDavv @iamprageeth @radyz
I managed to solve the problem without using the API. Unfortunately, there is no way to deploy Scrapy projects without the egg
files completely (the only way is to override some scrapyd
components), so you'll need a simple deployment script:
build.sh
:
#!/bin/sh
set -e
# The alternative way to build eggs is to use setup.py
# if you already have it in the Scrapy project's root
scrapy-deploy --build-egg=myproject.egg
# your docker container build commands
# ...
Dockerfile
:
RUN mkdir -p eggs/myproject
COPY myproject.egg eggs/myproject/1_0.egg
CMD ["scrapyd"]
That's all! So instead of deploying myproject.egg
into the eggs
folder directly, you have to create the following structure: eggs/myproject/1_0.egg
where myproject
is your project name, and 1_0
is a version of your project in scrapyd
Experimenting with above approach I ended up with two-step build. First step is used to build the egg without installing unnecessary scrapyd-client to final container. The resulting image with alpine as base is about 100 Mb.
FROM python as builder
RUN pip install scrapyd-client
WORKDIR /build
COPY . .
RUN scrapyd-deploy --build-egg=scraper.egg
FROM python:alpine
RUN apk add --update --no-cache --virtual .build-deps \
gcc \
libffi-dev \
libressl-dev \
libxml2 \
libxml2-dev \
libxslt-dev \
musl-dev \
&& pip install --no-cache-dir \
scrapyd \
&& apk del .build-deps \
&& apk add \
libressl \
libxslt
VOLUME /etc/scrapyd/ /var/lib/scrapyd/
COPY ./scrapyd.conf /etc/scrapyd/
RUN mkdir -p /src/eggs/scraper
COPY --from=builder /build/scraper.egg /src/eggs/scraper/1_0.egg
EXPOSE 6800
ENTRYPOINT ["scrapyd", "--pidfile="]
Not fully tested yet, but seems operational.