grab-site
grab-site copied to clipboard
Add Dockerfile to simplify installation
I'm deploying a couple instances of grab-site to a CoreOS cluster, so I made a Dockerfile... Hopefully this is a bit easier to use than pip/virtualenv. The reason why this uses the larger python:3.4-slim
image (rather than python:3.4-alpine
) is because Alpine had some issues compiling https://github.com/dw/py-lmdb with its version of gcc.
This PR still needs docs, so it's a work-in-progress right now.
After starting the container you can use the regular grab-site
command via docker exec <container-name> grab-site <args and site url>
I haven't used Docker, so bear with me...
-
Why
COPY
to/app/
if you still subsequently do apip3 install .
? If youpip3 install .
, thengrab-site
,gs-server
, etc should be installed somewhere, right? -
Can you make the script in
.travis.yml
test that this Dockerfile works? (Probably after all the existing stuff.)
Thanks for working on this!
No prob - you can think of a Docker container as a lightweight VM... Like VirtualBox, but with better tooling and less overhead). The Dockerfile automates building/configuring the container and the COPY
directive handles copying the code from your working directory into the container's file-system. Once the code is in the container (at /app
) then we do pip3 install
to get all the deps and set everything up.
This creates a fully isolated, reproducible installation of grab-site in a 200-300MB image. This image can be run on any host OS, including CoreOS where Python isn't even installed. Using Alpine as a base we could get this image down to 20-50MB, but that requires some modifications to py-lmdb.
As for testing, we can have https://hub.docker.com automatically rebuild the image whenever new code is pushed (see: https://docs.docker.com/docker-hub/builds/) and run Docker-based tests in Travis if you want: https://docs.travis-ci.com/user/docker/
pip3 install .
should install grab-site in addition to the dependencies, though. pip3 install
puts things in /usr/local/bin
while pip3 install --user
puts things in ~/.local/bin
, unless there's some extra configuration doing something else. Would it make sense to use the installed grab-site scripts in one of those paths rather than duplicate some pip functionality with the COPY
lines?
Is there an issue filed somewhere for py-lmdb's failure to compile on Alpine Linux's gcc?
pip3 install .
is being run within the context of the Docker container (not the host OS) so you need to COPY
the files into the container for pip to work.
Oh, that explains it :-)
There isn't an issue filed on https://github.com/dw/py-lmdb/issues yet.
Alpine had some issues compiling https://github.com/dw/py-lmdb with its version of gcc.
I haven't tried running grab-site, but it seems like installing py-lmdb works on python:3.4-alpine
with this:
FROM python:3.4-alpine
RUN apk add --update build-base libffi-dev
RUN pip install lmdb
You're right about it working on Alpine - I was just missing libffi-dev
. Now it's down to 112.4 MB
(37 MB
when compressed). Also, I added instructions to the README, so I'm going to remove the "[wip]" from this.
Thanks for the fixes.
I am currently somewhat busy and under-dockered, can a grab-site user please give the Docker instructions a try and see if they work? (And let me know if you had to perform any other steps to make this a useful setup?)
I had docker already, but worth linking to https://docs.docker.com/engine/installation/ instructions
I tried it with:
Ubuntu: 12.04.5 LTS, x86_64, 3.8.0-44-generic
Docker: Docker version 1.7.1, build 786b29d
Ran (sudo for docker commands because i skipped this step https://docs.docker.com/engine/installation/linux/ubuntulinux/#/create-a-docker-group):
sudo docker pull slang800/grab-site
sudo docker run --detach -p 29000:29000 -v ~/grab-site-data:/data --name warcfactory slang800/grab-site
Web UI worked on http://localhost:29000/
sudo docker exec warcfactory grab-site --no-offsite-links http://xkcd.com/
Crawl finished successfully!
I tried this out, but couldn't find a way to attach a terminal to a docker exec -d
process (or a docker exec
process that has been ctrl-c'ed - note the ctrl-c is not passed to the child). The reason that you sometimes need a terminal attached to a grab-site process is to 1) see which URL is currently being grabbed (this information is not reported to the dashboard, only finished responses) and 2) look at segfaults and websocket connection problems that don't get reported to the dashboard either.
Would adding tmux to the container and using tmux work? (Note, tmux 2.1 is broken; 1.8 is a known-good version.) I just hope that docker exec tmux attach
works. If this does work, the documentation should also be updated.
Also, running gs-server
as PID 1 seems undesirable because if it were killed, it would kill all the grab-site
processes as well. grab-site
processes are designed to stay running even if gs-server
crashes or is taken down for an upgrade. Maybe gs-server
(and each grab-site
) should run in its own container instead.
Maybe gs-server (and each grab-site) should run in its own container instead.
Splitting up the server and client would make sense, especially since you could then run them on different machines, but I should probably do that as a separate PR, since I'll need to look into how they communicate.
Also, running gs-server as PID 1 seems undesirable because if it were killed, it would kill all the grab-site processes as well.
Would using dumb-init as PID 1 allow the orphaned grab-site processes to keep running in the case where gs-server
dies? If so, that would be a decent temporary fix.
I tried this out, but couldn't find a way to attach a terminal to a docker exec -d process (or a docker exec process that has been ctrl-c'ed - note the ctrl-c is not passed to the child).
You could run docker exec
without detatching, but this whole setup could be simplified by splitting up the processes into their own containers... Then you'd be able to use docker logs
and pass signals in a sane manner.
hey people! what is the status of this PR? I could give a hand.
For now, I would like someone else to be the Dockerized grab-site upstream. I don't use Docker and I don't have the resources to 1) figure out if a PR is taking the right approach with Dockerization (which base? which init? one container per grab-site? how to integrate tmux, if needed?) 2) double my manual testing matrix.
So, please, have at it and promote your fork/Dockerfile here. If you (or someone else) stays interested in maintaining and testing it, I might take a PR in the future.
what is the status of this PR?
@semente I've been using it pretty often for my own projects, and it works fine, but I haven't rebased it since 2016. I'll try rebasing and pushing a new image to the Docker hub.
For now, I would like someone else to be the Dockerized grab-site upstream
Ok, I'll keep an image updated over here: https://cloud.docker.com/u/slang800/repository/docker/slang800/grab-site
@notslang Thank you for all this work. Can you confirm that your fork still works fine? I am curious if you ran into any issues or discovered anything of note.
https://cloud.docker.com/u/slang800/repository/docker/slang800/grab-site
It says updated 3 years ago, any plans to update it?
Or any plans to officially ship a Dockerfile for this?
FYI this third party grab-site Dockerfile currently works as of this comment being posted: https://github.com/Nold360/docker-grab-site.
https://registry.hub.docker.com/r/nold360/grab-site/