Docker
I wrote a Docker file for the current version(s). Maybe you want to look into it and integrate it here.
It works for me but I only have some simple use-cases (like API tests with python3), so I do not know how it performs under stress. And whether users require more configuration options. (But they could theoretically bind-mount other files if required.)
See Docker-Hub: https://hub.docker.com/r/ekoerner/heritrix
My Dockerfile (currently in private repository, so I can't provide any link, just the content here)
ARG java=11-jre
FROM openjdk:${java}
ARG version="3.4.0-20210923"
ARG contrib=0
ARG user="heritrix"
ARG userid=1000
LABEL version=${version}
LABEL contrib=${contrib}
LABEL user=${user}/$userid
# create user
RUN \
groupadd -g $userid $user && \
useradd -r -u $userid -g $user $user
# install other requirements (for contrib)
RUN \
if [ ${contrib} -eq 1 ] ; then \
apt-get update && \
apt-get install -y --no-install-recommends \
youtube-dl && \
rm -rf /var/lib/apt/lists/* ; \
fi
WORKDIR /opt
# download latest version according to:
# https://github.com/internetarchive/heritrix3/releases/tag/3.4.0-20210923
RUN \
if [ ${contrib} -eq 1 ] ; then \
wget -O heritrix-contrib-${version}-dist.tar.gz https://repo1.maven.org/maven2/org/archive/heritrix/heritrix-contrib/${version}/heritrix-contrib-${version}-dist.tar.gz && \
tar xvfz heritrix-contrib-${version}-dist.tar.gz && \
rm heritrix-contrib-${version}-dist.tar.gz && \
mv heritrix-contrib-${version} heritrix ; \
else \
wget -O heritrix-${version}-dist.zip https://repo1.maven.org/maven2/org/archive/heritrix/heritrix/${version}/heritrix-${version}-dist.zip && \
unzip heritrix-${version}-dist.zip && \
rm heritrix-${version}-dist.zip && \
mv heritrix-${version} heritrix ; \
fi && \
chmod u+x heritrix/bin/heritrix && \
chown -R $user:$user /opt/heritrix
# create a run script because dynamic configuration of credentials
RUN printf '%s\n' \
'#!/bin/bash' \
'' \
'_JOBARGS="-b /"' \
'' \
'# set credentials (require both USERNAME and PASSWORD)' \
'# -a "${USERNAME}:${PASSWORD}"' \
'if [[ ! -z "$USERNAME" ]] && [[ ! -z "$PASSWORD" ]]; then' \
' echo "${USERNAME}:${PASSWORD}" > ${HERITRIX_HOME}/credentials.txt' \
' _JOBARGS="$_JOBARGS -a @${HERITRIX_HOME}/credentials.txt"' \
'elif [[ ! -z "$CREDSFILE" ]]; then' \
' _JOBARGS="$_JOBARGS -a @${CREDSFILE}"' \
'else' \
' >&2 echo "No USERNAME and/or PASSWORD environment var set!"' \
'fi' \
'' \
'# check if -r mode' \
'if [[ ! -z "$JOBNAME" ]]; then' \
' >&2 echo "Found JOBNAME envvar, just running job: $JOBNAME"' \
' _JOBARGS="$_JOBARGS -r $JOBNAME"' \
' if [ ! -f "/opt/heritrix/jobs/$JOBNAME/crawler-beans.cxml" ]; then' \
' >&2 echo "Did not find any '"'"'crawler-beans.cxml'"'"' for job '"'"'$JOBNAME'"'"'!"' \
' fi' \
'fi' \
'' \
'# run' \
'exec ${HERITRIX_HOME}/bin/heritrix $_JOBARGS' \
'' \
> heritrix.sh && \
chmod +x heritrix.sh && \
chown $user:$user heritrix.sh
WORKDIR /opt/heritrix
USER $user
ENV HERITRIX_HOME /opt/heritrix
# let it run in the foreground, required for docker
ENV FOREGROUND true
# standard webport
# NOTE: that the webpage is via HTTPS only available!
EXPOSE 8443
CMD ["/opt/heritrix.sh"]
Build it:
docker build --build-arg version=3.4.0-20210923 -t heritrix .
Build heritrix-contrib (requires Java 8, with Java 11 (JRE/JDK) some JNI error, maybe related to #265?)
docker build --build-arg version=3.4.0-20210923 --build-arg contrib=1 --build-arg java=8-jre -t heritrix-contrib .
Example docker-compose.yml (also on DockerHub currently)
version: "3.7"
services:
heritrix:
build: .
container_name: "heritrix"
# TEST: keeps the container running without doing anything (for inspections)
# entrypoint: bash -c 'while :; do :; done & kill -STOP $$! && wait $$!'
# env_file: .env
environment:
- USERNAME=admin
- PASSWORD=admin
# optional jobname to run (will only run this single job and exit!)
# - JOBNAME=myjob
# - JAVA_OPTS=-Xmx1024M
init: true
ports:
# if you want to use a .env file with `PORT=8443` for example
# - ${PORT}:8443
- 8443:8443
restart: unless-stopped
volumes:
# where jobs will be stored
- job-files:/opt/heritrix/jobs
# or if JOBNAME envvar is used (mount just the single job folder)
# jobfolder in the container needs to have the same name as in JOBNAME
# - $(pwd)/host_myjob:/opt/heritrix/jobs/myjob
volumes:
job-files:
UPDATE: I added the -r <jobname> option to my image on dockerhub. Simply set the JOBNAME=jobname environment variable to run the job jobname. Take care to mount the (preconfigured) job folder into the image, see above. Only works from version 3.4.0-20210803, see pull request #406.
UPDATE2: I added a contrib image that uses heritrix-contrib. For now it only includes youtube-dl as extra dependency and it only works with Java 8 JRE. The contrib image is only available from version 3.4.0-20210923.
UPDATE3: Added a custom user to make it a bit more secure (e. g., no package installs possible anymore). Note that -b / is required to make the web UI visible in the docker image.
+1
Just noting that if anyone would like to see a Dockerfile merged please submit it as a pull request and include the documentation/examples you feel appropriate. I'm willing merge it and connect it to Docker Hub under the IIPC group but I don't use Docker much myself so you'll need to do the legwork and testing. :-)
I find myself unable to really stress-test my own docker image. It works for some toy samples but I'm not sure about more involved scenarios and how docker handles this. Mine was more for short-term and low url count crawls. 😃 I also think the configuration handling can be improved by a lot. In my use case I just needed the most basic things but I saw use-cased on the internet that did much more. So, I'm not sure whether my image might be a good "official" image. (But I will still update my dockerhub images with each new release here. And the code above is my most current version.)
I added the -r <jobname> flag into my image. This is option really nice and makes automation easier.
I updated the first comment of the issue.
So, after a request I added a heritrix-contrib docker image (same docker hub URL, just :contrib tag). But I had difficulties finding any documentation about the contrib stuff. I found the javadocs but nowhere was mentioned how to set it up, what other requirements are there (e.g. for the various extractors, ...) and so on. I also found that it only worked with Java 8 and not with Java 11.
Now my Dockerfile gets to the point that it might make sense to create a pull request. What exactly would be required? I'm especially puzzled about tests since I can do some manual tests but how would I do automated stuff?
All I had in mind was a a pull request that adds the Dockerfile itself and maybe a section named something like 'Running Heritrix under Docker' with some brief usage instructions to docs/operating.rs. By testing I just meant manually verifying the instructions work not automated tests. :-)
Ok. I'm working on it.
I did extract the entrypoint script outside, so it is a bit easier to edit. And a separate Dockerfile for the heritrix-contrib image.
And I added a Makefile to create the images.
I did not yet add a description on how to build the docker image. Would a README.md be enough in the docker folder or a wiki page (currently in my fork only)?
I would suggest running docker with the official images, so the image build process uses the maven releases and does not build from the sources again.
I found the following Docker Hub users:
- iipc (mentioned in comments above)
- internetarchive
Which should then also be used in the documentation. (instead of just heritrix)
Thanks. That looks great.
I've merged it and pushed the main and contrib images to iipc/heritrix. I had intended to automate this with the autobuilder but it seems the free tier of that has been discontinued. I'll look into alternative options but I guess it's not too difficult to build them manually after each release.
I used the IIPC Docker org because the Heritrix "interim" releases are currently maintained by some members of the IIPC community and several of us (including someone from IA) have access to that org.
I can take a look at using GH Actions. It seems to me that the tags correspond to the releases. So, build the docker image after a new tag is pushed, or on a new release (tag) has been added. I think it should be possible to extract the current or latest tag to supply the build arg. Or alternatively, manually update the standard release number for each release in the Dockerfile.
Then, we can probably also transfer all the old images from my hub account to the iipc one, if necessary? I will later clear out my hub repo to remove confusion. But no concrete time plan yet.
And thanks about the IIPC explanation. :-)
As for the tags, I had -jre in case a -jdk base image might be added later on, and where subsequent users would want to base their custom images on either one, depending on their requirements and to-be-installed software.
Then, I also added the Docker wiki page. If anyone plans to rename it, please update the link in docker/README.md.
I updated wiki: HOWTO Ship a Heritrix Release.