grab-site icon indicating copy to clipboard operation
grab-site copied to clipboard

Add Dockerfile to simplify installation

Open notslang opened this issue 7 years ago • 20 comments

I'm deploying a couple instances of grab-site to a CoreOS cluster, so I made a Dockerfile... Hopefully this is a bit easier to use than pip/virtualenv. The reason why this uses the larger python:3.4-slim image (rather than python:3.4-alpine) is because Alpine had some issues compiling https://github.com/dw/py-lmdb with its version of gcc.

This PR still needs docs, so it's a work-in-progress right now.

After starting the container you can use the regular grab-site command via docker exec <container-name> grab-site <args and site url>

notslang avatar Sep 05 '16 23:09 notslang

I haven't used Docker, so bear with me...

  1. Why COPY to /app/ if you still subsequently do a pip3 install .? If you pip3 install ., then grab-site, gs-server, etc should be installed somewhere, right?

  2. Can you make the script in .travis.yml test that this Dockerfile works? (Probably after all the existing stuff.)

Thanks for working on this!

ivan avatar Sep 05 '16 23:09 ivan

No prob - you can think of a Docker container as a lightweight VM... Like VirtualBox, but with better tooling and less overhead). The Dockerfile automates building/configuring the container and the COPY directive handles copying the code from your working directory into the container's file-system. Once the code is in the container (at /app) then we do pip3 install to get all the deps and set everything up.

This creates a fully isolated, reproducible installation of grab-site in a 200-300MB image. This image can be run on any host OS, including CoreOS where Python isn't even installed. Using Alpine as a base we could get this image down to 20-50MB, but that requires some modifications to py-lmdb.

As for testing, we can have https://hub.docker.com automatically rebuild the image whenever new code is pushed (see: https://docs.docker.com/docker-hub/builds/) and run Docker-based tests in Travis if you want: https://docs.travis-ci.com/user/docker/

notslang avatar Sep 06 '16 10:09 notslang

pip3 install . should install grab-site in addition to the dependencies, though. pip3 install puts things in /usr/local/bin while pip3 install --user puts things in ~/.local/bin, unless there's some extra configuration doing something else. Would it make sense to use the installed grab-site scripts in one of those paths rather than duplicate some pip functionality with the COPY lines?

ivan avatar Sep 06 '16 12:09 ivan

Is there an issue filed somewhere for py-lmdb's failure to compile on Alpine Linux's gcc?

ivan avatar Sep 06 '16 13:09 ivan

pip3 install . is being run within the context of the Docker container (not the host OS) so you need to COPY the files into the container for pip to work.

notslang avatar Sep 06 '16 13:09 notslang

Oh, that explains it :-)

ivan avatar Sep 06 '16 13:09 ivan

There isn't an issue filed on https://github.com/dw/py-lmdb/issues yet.

notslang avatar Sep 06 '16 13:09 notslang

Alpine had some issues compiling https://github.com/dw/py-lmdb with its version of gcc.

I haven't tried running grab-site, but it seems like installing py-lmdb works on python:3.4-alpine with this:

FROM python:3.4-alpine
RUN apk add --update build-base libffi-dev
RUN pip install lmdb

igorbrigadir avatar Sep 06 '16 13:09 igorbrigadir

You're right about it working on Alpine - I was just missing libffi-dev. Now it's down to 112.4 MB (37 MB when compressed). Also, I added instructions to the README, so I'm going to remove the "[wip]" from this.

notslang avatar Sep 13 '16 05:09 notslang

Thanks for the fixes.

I am currently somewhat busy and under-dockered, can a grab-site user please give the Docker instructions a try and see if they work? (And let me know if you had to perform any other steps to make this a useful setup?)

ivan avatar Sep 13 '16 06:09 ivan

I had docker already, but worth linking to https://docs.docker.com/engine/installation/ instructions

I tried it with: Ubuntu: 12.04.5 LTS, x86_64, 3.8.0-44-generic Docker: Docker version 1.7.1, build 786b29d

Ran (sudo for docker commands because i skipped this step https://docs.docker.com/engine/installation/linux/ubuntulinux/#/create-a-docker-group):

sudo docker pull slang800/grab-site sudo docker run --detach -p 29000:29000 -v ~/grab-site-data:/data --name warcfactory slang800/grab-site Web UI worked on http://localhost:29000/ sudo docker exec warcfactory grab-site --no-offsite-links http://xkcd.com/

Crawl finished successfully!

igorbrigadir avatar Sep 13 '16 10:09 igorbrigadir

I tried this out, but couldn't find a way to attach a terminal to a docker exec -d process (or a docker exec process that has been ctrl-c'ed - note the ctrl-c is not passed to the child). The reason that you sometimes need a terminal attached to a grab-site process is to 1) see which URL is currently being grabbed (this information is not reported to the dashboard, only finished responses) and 2) look at segfaults and websocket connection problems that don't get reported to the dashboard either.

Would adding tmux to the container and using tmux work? (Note, tmux 2.1 is broken; 1.8 is a known-good version.) I just hope that docker exec tmux attach works. If this does work, the documentation should also be updated.

ivan avatar Sep 19 '16 07:09 ivan

Also, running gs-server as PID 1 seems undesirable because if it were killed, it would kill all the grab-site processes as well. grab-site processes are designed to stay running even if gs-server crashes or is taken down for an upgrade. Maybe gs-server (and each grab-site) should run in its own container instead.

ivan avatar Sep 19 '16 07:09 ivan

Maybe gs-server (and each grab-site) should run in its own container instead.

Splitting up the server and client would make sense, especially since you could then run them on different machines, but I should probably do that as a separate PR, since I'll need to look into how they communicate.

Also, running gs-server as PID 1 seems undesirable because if it were killed, it would kill all the grab-site processes as well.

Would using dumb-init as PID 1 allow the orphaned grab-site processes to keep running in the case where gs-server dies? If so, that would be a decent temporary fix.

I tried this out, but couldn't find a way to attach a terminal to a docker exec -d process (or a docker exec process that has been ctrl-c'ed - note the ctrl-c is not passed to the child).

You could run docker exec without detatching, but this whole setup could be simplified by splitting up the processes into their own containers... Then you'd be able to use docker logs and pass signals in a sane manner.

notslang avatar Oct 20 '16 05:10 notslang

hey people! what is the status of this PR? I could give a hand.

semente avatar Nov 07 '18 01:11 semente

For now, I would like someone else to be the Dockerized grab-site upstream. I don't use Docker and I don't have the resources to 1) figure out if a PR is taking the right approach with Dockerization (which base? which init? one container per grab-site? how to integrate tmux, if needed?) 2) double my manual testing matrix.

So, please, have at it and promote your fork/Dockerfile here. If you (or someone else) stays interested in maintaining and testing it, I might take a PR in the future.

ivan avatar Nov 07 '18 02:11 ivan

what is the status of this PR?

@semente I've been using it pretty often for my own projects, and it works fine, but I haven't rebased it since 2016. I'll try rebasing and pushing a new image to the Docker hub.

For now, I would like someone else to be the Dockerized grab-site upstream

Ok, I'll keep an image updated over here: https://cloud.docker.com/u/slang800/repository/docker/slang800/grab-site

notslang avatar Jan 15 '19 22:01 notslang

@notslang Thank you for all this work. Can you confirm that your fork still works fine? I am curious if you ran into any issues or discovered anything of note.

gabefair avatar Apr 19 '20 20:04 gabefair

https://cloud.docker.com/u/slang800/repository/docker/slang800/grab-site

It says updated 3 years ago, any plans to update it?

Or any plans to officially ship a Dockerfile for this?

818S avatar May 20 '21 21:05 818S

FYI this third party grab-site Dockerfile currently works as of this comment being posted: https://github.com/Nold360/docker-grab-site.

https://registry.hub.docker.com/r/nold360/grab-site/

brandongalbraith avatar Mar 26 '22 02:03 brandongalbraith