kotaemon icon indicating copy to clipboard operation
kotaemon copied to clipboard

🔧 Fixed missing dependencies for additional file parsers in Dockerfile and some refactoring

Open SpaceShaman opened this issue 1 year ago • 2 comments

This update introduces docker-compose support for easier project setup and deployment.

Added

Docker Support

  • Added a docker-compose.yml file to simplify running the project in a Docker environment.

Refactor

  • Renamed .env to .env.default for improved security and to prevent accidental exposure of sensitive information. Updated documentation to reflect the new environment configuration process.
  • Included Gradio server settings in the .env.default file for easier configuration.

Documentation

  • Updated the README with instructions for using docker-compose and improved command syntax highlighting.

Fixed

  • Added missing dependencies for additional file parsers in Dockerfile
  • Added unstructured[all-docs] to pip install for extended document support

Without these dependencies, various types of errors appeared when trying to parse documents in a format other than PDF

SpaceShaman avatar Aug 31 '24 10:08 SpaceShaman

@SpaceShaman thanks for your contribution. From my side I think your PR is quite good. I just have two concerns, related to docker and may need @taprosoft to consult us as well:

  • The entrypoint in Dockerfile, should we change from gradio launch.py to python launch.py. The 1st command from my point of view should be used while under developing only.
  • I see you uncomment some apts such as ffmpeg, tesseract, ... to install unstructure. This can handle more file types but the trade off is increasing the docker image size (IF I remember correctly from my last experiment in will add around 2Gb). @taprosoft Can you help to make decision for this point ?

phv2312 avatar Aug 31 '24 17:08 phv2312

@phv2312

I see you uncomment some apts such as ffmpeg, tesseract, ... to install unstructure. This can handle more file types but the trade off is increasing the docker image size (IF I remember correctly from my last experiment in will add around 2Gb). @taprosoft Can you help to make decision for this point ?

In addition to uncommenting the apts dependency, I also added libmagic-dev there and added unstructured[all-docs] to the pip installation, because without it I couldn't add csv and txt files.

We could use multi-stage building to define a basic version and an extended version with additional dependencies for other file types, like below:

FROM python:3.10-slim as basic
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV PYTHONIOENCODING=UTF-8
WORKDIR /app

RUN apt update -qqy \
  && apt install -y \
  ssh git \
  gcc g++ \
  poppler-utils \
  libpoppler-dev \
  && \
  apt-get clean && \
  apt-get autoremove

COPY . /app

RUN --mount=type=ssh pip install -e "libs/kotaemon[all]"
RUN --mount=type=ssh pip install -e "libs/ktem"
RUN pip install graphrag future
RUN pip install "pdfservices-sdk@git+https://github.com/niallcm/pdfservices-python-sdk.git@bump-and-unfreeze-requirements"

ENTRYPOINT ["python", "app.py"]

FROM basic as full

RUN apt install -y \
  tesseract-ocr \
  tesseract-ocr-jpn \
  libsm6 \
  libxext6 \
  ffmpeg \
  libmagic-dev

RUN pip install unstructured[all-docs]

SpaceShaman avatar Sep 01 '24 16:09 SpaceShaman