kotaemon 🔧 Fixed missing dependencies for additional file parsers in Dockerfile and some refactoring

This update introduces docker-compose support for easier project setup and deployment.

Added

Docker Support

Added a docker-compose.yml file to simplify running the project in a Docker environment.

Refactor

Renamed .env to .env.default for improved security and to prevent accidental exposure of sensitive information. Updated documentation to reflect the new environment configuration process.
Included Gradio server settings in the .env.default file for easier configuration.

Documentation

Updated the README with instructions for using docker-compose and improved command syntax highlighting.

Fixed

Added missing dependencies for additional file parsers in Dockerfile
Added unstructured[all-docs] to pip install for extended document support

Without these dependencies, various types of errors appeared when trying to parse documents in a format other than PDF

Aug 31 '24 10:08 SpaceShaman

@SpaceShaman thanks for your contribution. From my side I think your PR is quite good. I just have two concerns, related to docker and may need @taprosoft to consult us as well:

The entrypoint in Dockerfile, should we change from gradio launch.py to python launch.py. The 1st command from my point of view should be used while under developing only.
I see you uncomment some apts such as ffmpeg, tesseract, ... to install unstructure. This can handle more file types but the trade off is increasing the docker image size (IF I remember correctly from my last experiment in will add around 2Gb). @taprosoft Can you help to make decision for this point ?

Aug 31 '24 17:08 phv2312

@phv2312

I see you uncomment some apts such as ffmpeg, tesseract, ... to install unstructure. This can handle more file types but the trade off is increasing the docker image size (IF I remember correctly from my last experiment in will add around 2Gb). @taprosoft Can you help to make decision for this point ?

In addition to uncommenting the apts dependency, I also added libmagic-dev there and added unstructured[all-docs] to the pip installation, because without it I couldn't add csv and txt files.

We could use multi-stage building to define a basic version and an extended version with additional dependencies for other file types, like below:

FROM python:3.10-slim as basic
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV PYTHONIOENCODING=UTF-8
WORKDIR /app

RUN apt update -qqy \
  && apt install -y \
  ssh git \
  gcc g++ \
  poppler-utils \
  libpoppler-dev \
  && \
  apt-get clean && \
  apt-get autoremove

COPY . /app

RUN --mount=type=ssh pip install -e "libs/kotaemon[all]"
RUN --mount=type=ssh pip install -e "libs/ktem"
RUN pip install graphrag future
RUN pip install "pdfservices-sdk@git+https://github.com/niallcm/pdfservices-python-sdk.git@bump-and-unfreeze-requirements"

ENTRYPOINT ["python", "app.py"]

FROM basic as full

RUN apt install -y \
  tesseract-ocr \
  tesseract-ocr-jpn \
  libsm6 \
  libxext6 \
  ffmpeg \
  libmagic-dev

RUN pip install unstructured[all-docs]

Sep 01 '24 16:09 SpaceShaman