tika-docker icon indicating copy to clipboard operation
tika-docker copied to clipboard

[TIKA-3420] Set tesseract ocr langauges as docker build args

Open mhf-ir opened this issue 3 years ago • 9 comments

Ability to user build docker with list of tesseract-ocr-[lang] as build args.

mhf-ir avatar Dec 15 '20 08:12 mhf-ir

Seems be docker-build.sh also must be update. for accept <TESSERACT_LANGUAGES>

mhf-ir avatar Dec 15 '20 08:12 mhf-ir

Change the build for rest of parameters and fix echo problem.

Also some change in README.md using Markdown lint and this changes, change my PR if needed for grammer or etc problem.

Thanks for your attention

mhf-ir avatar Jan 01 '21 17:01 mhf-ir

@dameikle Could you please look at this? Any problem, issue, changes? This simple patch will help me a lot.

mhf-ir avatar May 24 '21 09:05 mhf-ir

Seems TIKA_JAR_NAME also added, please check it.

mhf-ir avatar May 26 '21 19:05 mhf-ir

@mhf-ir OK I tried out the new patch today

./docker-tool.sh build 1.26 tika-1.27-tesseract-french.jar tesseract-ocr-fra

...

 => ERROR [dependencies 1/2] RUN DEBIAN_FRONTEND=noninteractive apt-get -y install openjdk-14-jre-headless gdal-bin tesseract-ocr 'tesseract-ocr-fra'                                                                                    1.4s
------
 > [dependencies 1/2] RUN DEBIAN_FRONTEND=noninteractive apt-get -y install openjdk-14-jre-headless gdal-bin tesseract-ocr 'tesseract-ocr-fra':
#6 0.293 Reading package lists...
#6 1.103 Building dependency tree...
#6 1.268 Reading state information...
#6 1.383 E: Unable to locate package 'tesseract-ocr-fra'
------
executor failed running [/bin/sh -c DEBIAN_FRONTEND=noninteractive apt-get -y install $JRE gdal-bin tesseract-ocr $TESSERACT_LANGUAGES]: exit code: 100

Is this the correct way to invoke ./docker-tool.sh's build command?

lewismc avatar Jun 05 '21 17:06 lewismc

@lewismc seems be problem for multiple packages name for build-args. i will try to find better way for that.

mhf-ir avatar Jun 06 '21 11:06 mhf-ir

@lewismc Try this, must be okey

./docker-tool.sh build 1.26 jar-alt-name tesseract-ocr-fra tesseract-ocr-fas

mhf-ir avatar Jun 07 '21 16:06 mhf-ir

This doesn't work either @mhf-ir

./docker-tool.sh build 1.26 jar-alt-name tesseract-ocr-fra
...
#9 9.069 --2021-06-07 19:53:48--  https://www.apache.org/dyn/closer.cgi/tika/jar-alt-name-1.26.jar?filename=tika/jar-alt-name-1.26.jar&action=download
#9 9.070 Resolving www.apache.org (www.apache.org)... 207.244.88.140, 95.216.26.30, 2a01:4f9:2a:1a61::2
#9 9.178 Connecting to www.apache.org (www.apache.org)|207.244.88.140|:443... connected.
#9 9.400 HTTP request sent, awaiting response... 404 Not Found
#9 10.63 2021-06-07 19:53:49 ERROR 404: Not Found.
#9 10.63
#9 10.64 --2021-06-07 19:53:49--  https://archive.apache.org/dist/tika/jar-alt-name-1.26.jar
#9 10.64 Resolving archive.apache.org (archive.apache.org)... 138.201.131.134, 2a01:4f8:172:2ec5::2
#9 10.74 Connecting to archive.apache.org (archive.apache.org)|138.201.131.134|:443... connected.
#9 12.28 HTTP request sent, awaiting response... 404 Not Found
#9 12.50 2021-06-07 19:53:51 ERROR 404: Not Found.
#9 12.50
------
executor failed running [/bin/sh -c DEBIAN_FRONTEND=noninteractive apt-get -y install gnupg2 wget     && wget -t 10 --max-redirect 1 --retry-connrefused -qO- https://downloads.apache.org/tika/KEYS | gpg --import     && wget -t 10 --max-redirect 1 --retry-connrefused $NEAREST_TIKA_SERVER_URL -O /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar || rm /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar     && sh -c "[ -f /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar ]" || wget $ARCHIVE_TIKA_SERVER_URL -O /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar || rm /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar     && sh -c "[ -f /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar ]" || exit 1     && wget -t 10 --max-redirect 1 --retry-connrefused $DEFAULT_TIKA_SERVER_ASC_URL -O /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar.asc  || rm /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar.asc     && sh -c "[ -f /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar.asc ]" || wget $ARCHIVE_TIKA_SERVER_ASC_URL -O /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar.asc || rm /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar.asc     && sh -c "[ -f /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar.asc ]" || exit 1;]: exit code: 1

lewismc avatar Jun 07 '21 19:06 lewismc

seems $jar variable has problem, default is tika-server https://github.com/apache/tika-docker/blob/master/docker-tool.sh#L61 It's not my modification. I just resolve that conflict like master.

gpg: Total number processed: 7
gpg:               imported: 7
gpg: no ultimately trusted keys found
--2021-06-08 04:16:59--  https://www.apache.org/dyn/closer.cgi/tika/tika-server-1.26.jar?filename=tika/tika-server-1.26.jar&action=download
Resolving www.apache.org (www.apache.org)... 207.244.88.140, 95.216.26.30, 2a01:4f9:2a:1a61::2
Connecting to www.apache.org (www.apache.org)|207.244.88.140|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://archive.apache.org/dist/tika/tika-server-1.26.jar [following]
--2021-06-08 04:17:01--  https://archive.apache.org/dist/tika/tika-server-1.26.jar
Resolving archive.apache.org (archive.apache.org)... 138.201.131.134, 2a01:4f8:172:2ec5::2
Connecting to archive.apache.org (archive.apache.org)|138.201.131.134|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 79903002 (76M) [application/java-archive]
Saving to: '/tika-server-1.26.jar'

     0K .......... .......... .......... .......... ..........  0%  171K 7m36s
    50K .......... .......... .......... .......... ..........  0%  283K 6m5s
   100K .......... .......... .......... .......... ..........  0%  483K 4m57s
   150K .......... .......... .......... .......... ..........  0%  496K 4m22s

try this:

./docker-tool.sh build 1.26 tika-server tesseract-ocr-fra tesseract-ocr-fas
 ---> 594d05b32156
Step 22/26 : COPY --from=fetch_tika /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar
 ---> 74da0f6e2136
Step 23/26 : USER $UID_GID
 ---> Running in 27cd65503cc4
Removing intermediate container 27cd65503cc4
 ---> 3087a7429f40
Step 24/26 : EXPOSE 9998
 ---> Running in 85d927ba4187
Removing intermediate container 85d927ba4187
 ---> bc6b467c7eed
Step 25/26 : ENTRYPOINT [ "/bin/sh", "-c", "exec java -jar /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar -h 0.0.0.0 $0 $@"]
 ---> Running in 22f85e5b9e83
Removing intermediate container 22f85e5b9e83
 ---> 16c47b78d7e9
Step 26/26 : LABEL maintainer="Apache Tika Developers [email protected]"
 ---> Running in d670b34497d3
Removing intermediate container d670b34497d3
 ---> 6f04502585ad
Successfully built 6f04502585ad
Successfully tagged apache/tika:1.26-full
sweb@sweb-laptop:/sweb/tmp/tika-d-mhf$ TZ=UTC date && docker images | grep tika
Tue 08 Jun 2021 04:20:15 AM UTC
apache/tika                             1.26-full         6f04502585ad   2 minutes ago   690MB
apache/tika                             1.26              65ea0073c1e2   7 minutes ago   408MB

mhf-ir avatar Jun 08 '21 04:06 mhf-ir