🔧 Fixed missing dependencies for additional file parsers in Dockerfile and some refactoring
This update introduces docker-compose support for easier project setup and deployment.
Added
Docker Support
- Added a
docker-compose.ymlfile to simplify running the project in a Docker environment.
Refactor
- Renamed .env to .env.default for improved security and to prevent accidental exposure of sensitive information. Updated documentation to reflect the new environment configuration process.
- Included Gradio server settings in the
.env.defaultfile for easier configuration.
Documentation
- Updated the README with instructions for using
docker-composeand improved command syntax highlighting.
Fixed
- Added missing dependencies for additional file parsers in Dockerfile
- Added unstructured[all-docs] to pip install for extended document support
Without these dependencies, various types of errors appeared when trying to parse documents in a format other than PDF
@SpaceShaman thanks for your contribution. From my side I think your PR is quite good. I just have two concerns, related to docker and may need @taprosoft to consult us as well:
- The entrypoint in Dockerfile, should we change from
gradio launch.pytopython launch.py. The 1st command from my point of view should be used while under developing only. - I see you uncomment some apts such as ffmpeg, tesseract, ... to install unstructure. This can handle more file types but the trade off is increasing the docker image size (IF I remember correctly from my last experiment in will add around 2Gb). @taprosoft Can you help to make decision for this point ?
@phv2312
I see you uncomment some apts such as ffmpeg, tesseract, ... to install unstructure. This can handle more file types but the trade off is increasing the docker image size (IF I remember correctly from my last experiment in will add around 2Gb). @taprosoft Can you help to make decision for this point ?
In addition to uncommenting the apts dependency, I also added libmagic-dev there and added unstructured[all-docs] to the pip installation, because without it I couldn't add csv and txt files.
We could use multi-stage building to define a basic version and an extended version with additional dependencies for other file types, like below:
FROM python:3.10-slim as basic
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV PYTHONIOENCODING=UTF-8
WORKDIR /app
RUN apt update -qqy \
&& apt install -y \
ssh git \
gcc g++ \
poppler-utils \
libpoppler-dev \
&& \
apt-get clean && \
apt-get autoremove
COPY . /app
RUN --mount=type=ssh pip install -e "libs/kotaemon[all]"
RUN --mount=type=ssh pip install -e "libs/ktem"
RUN pip install graphrag future
RUN pip install "pdfservices-sdk@git+https://github.com/niallcm/pdfservices-python-sdk.git@bump-and-unfreeze-requirements"
ENTRYPOINT ["python", "app.py"]
FROM basic as full
RUN apt install -y \
tesseract-ocr \
tesseract-ocr-jpn \
libsm6 \
libxext6 \
ffmpeg \
libmagic-dev
RUN pip install unstructured[all-docs]