MIDAS
MIDAS copied to clipboard
Should either Dockerize or better specify dependencies
I'm running Ubuntu 18.04 and so created the following initial Dockerfile to get around the cmake version requirements that prevent my following the steps listed in the Demo
section of the README:
FROM ubuntu:20.04
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update \
&& apt-get install --yes \
build-essential \
cmake \
python-is-python3 \
&& apt-get clean \
&& rm --recursive --force \
/var/lib/apt/lists/* \
/tmp/* \
/var/tmp/*
RUN mkdir /src
WORKDIR /src
COPY CMakeLists.txt ./
RUN mkdir --parents build/release \
&& cp CMakeLists.txt build/release/
COPY example ./example
COPY src ./src
COPY temp ./temp
COPY util ./util
RUN cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/release \
&& cmake --build build/release --target Demo
I then build it via
# Wouldn't need to use `sudo` on macOS
sudo docker build . --tag midas
and run the compile Demo
app via
sudo docker run \
--tty \
--interactive \
--rm \
--volume $PWD/data:/src/data \
midas \
build/release/Demo
which, when shelling out to the Python scripts, aborts with the following
Traceback (most recent call last):
File "/src/util/EvaluateScore.py", line 20, in <module>
from pandas import read_csv
ModuleNotFoundError: No module named 'pandas'
since pandas
is not available.
To better avoid the need for local environment debugging, my personal preference would be for a known-working Dockerfile.
Hi, about the minimal CMake version, the one given in CMakeLists.txt:17
is merely the version installed on my machine. I think it's safe to lower it a bit so as to fit your version.
Also, I notice that the CMake provided by apt
is version 3.16.3, which is enough to satisfy the version requirement. Or if you don't have sudo
access, you can directly download from CMake's website. It can be installed locally without sudo
.
And, pandas
is a python package used for efficiently reading the output file. Maybe you can add one line of command to install it in your docker instance.
Dependencies are also specified in the README.md
, under the section Demo
.
By the way, the change at https://github.com/Stream-AD/MIDAS/commit/5c8563169e7adc454c8920a969b8b8a929245a4d#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aL21-R21 makes the above Dockerfile fail earlier.
With that change reverted, the following Dockerfile succeeds:
FROM ubuntu:20.04 AS build
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update \
&& apt-get install --yes \
build-essential \
cmake \
&& apt-get clean \
&& rm --recursive --force \
/var/lib/apt/lists/* \
/tmp/* \
/var/tmp/*
RUN mkdir /src
WORKDIR /src
COPY CMakeLists.txt ./
RUN mkdir --parents build/release \
&& cp CMakeLists.txt build/release/
COPY example ./example
COPY src ./src
COPY util ./util
RUN cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/release \
&& cmake --build build/release --target Demo
FROM python:3.9
RUN pip install pandas sklearn
COPY --from=build /src /src
WORKDIR /src
ENTRYPOINT ["build/release/Demo"]
Run via
# again...no need to use sudo if running on macOS
sudo docker build --tag midas .
sudo docker run \
--tty \
--rm \
--volume $PWD/data:/src/data \
--volume $PWD/temp:/src/temp \
midas
If you want, I can open a PR for that. The one piece I'd want to modify is how I install sklearn
. Ideally would grab a pre-existing ML Python image to avoid having to do the expensive build of that dependency.
With a Dockerfile, it would be trivial to set up CI, which would automatically inform you of test failures induced by newly opened PRs.
I found there's a missing header in the Demo.cpp
and made a commit to fix it.
Also I wrote a Dockerfile
based on yours, but it installs python using apt
, so the python image is not needed.
I tested it on my Windows 10 machine, using a similar command as yours.
Can you help test if the Dockerfile
can work on your machine? Also feel free to give suggestions to the Dockerfile
, as I feel the initial build speed is a bit slow.
Also I wrote a
Dockerfile
based on yours, but it installs python usingapt
, so the python image is not needed.
It can end up being a style preference, but I structured it as a multi-stage build like that so that if you needed to tweak any of the dependencies, you could tweak the C++ deps or the Python deps independently without invalidating the rest of the cached Docker layers. If you prefer to install the C++ and Python deps in the same image, I'd recommend moving the RUN pip3 ...
line up to immediately after you symlink python
to python3
. Since installing scikit-learn
takes as long as it does, you want to do it prior to copying in your source files, since those are more likely to change (and those invalidate subsequently cached layers) than the Python dependencies.
Can you help test if the
Dockerfile
can work on your machine?
The build seems to succeed. However, the docker run
doesn't make it through all of the demo steps. If I shell-in (rather than relying on the built-in entrypoint), running Demo
manually shows a segfault:
> sudo docker run \
--tty \
--interactive \
--rm \
--volume $PWD/data:/src/data \
--volume $PWD/temp:/src/temp \
--entrypoint bash \
midas
root@bd63e46c5d18:/MIDAS# build/release/Demo
Seed = 1605667365 // In case of reproduction
Segmentation fault (core dumped)
Also feel free to give suggestions to the
Dockerfile
, as I feel the initial build speed is a bit slow.
Definitely. It's much slower than I'd want for a build that I might need to run semi-regularly. The first improvement is usually re-arranging the steps to keep the slow-to-build but infrequently changing steps (e.g., installing scikit-learn
) from being ejected from the cache by steps that come prior to it that are likely to change frequently (e.g., COPY src ./src
, which will be invalidated every time a file in src/
is modified).
The other thing to do is to utilize a publicly available base image that has the expensive steps already included. A number of ML/data science Python base images are available on Docker Hub that will have common deps like scikit-learn
and pandas
already installed. You'd might want to go the multi-stage (using multiple FROM
declarations) so that you can swap out the base image in the future if necessary.
I'll ask around to see if anyone recommends a particular Python analytics-centric base image.
Thanks for your advice. To be honest, I'm not quite familiar with docker, it would be very helpful if you can open a PR for it. I'll later test it on my Windows machine.
About the segmentation fault, I've met this several times but those were because the code tries to read the data file which is failed to be opened. As you might notice, there's almost no defensive code in most source files, but this is on purpose, so other researchers can focus more on the core of the algorithm.
I'd be happy to open a PR. Will try to do that tomorrow evening or this weekend. (Today was a really long day.)