MIDAS Should either Dockerize or better specify dependencies

I'm running Ubuntu 18.04 and so created the following initial Dockerfile to get around the cmake version requirements that prevent my following the steps listed in the Demo section of the README:

FROM ubuntu:20.04

ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update \
    && apt-get install --yes \
      build-essential \
      cmake \
      python-is-python3 \
    && apt-get clean \
    && rm --recursive --force \
      /var/lib/apt/lists/* \
      /tmp/* \
      /var/tmp/*

RUN mkdir /src
WORKDIR /src

COPY CMakeLists.txt ./
RUN mkdir --parents build/release \
    && cp CMakeLists.txt build/release/

COPY example ./example
COPY src ./src
COPY temp ./temp
COPY util ./util

RUN cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/release \
    && cmake --build build/release --target Demo

I then build it via

# Wouldn't need to use `sudo` on macOS
sudo docker build . --tag midas

and run the compile Demo app via

sudo docker run \
  --tty \
  --interactive \
  --rm \
  --volume $PWD/data:/src/data \
  midas \
  build/release/Demo

which, when shelling out to the Python scripts, aborts with the following

Traceback (most recent call last):
  File "/src/util/EvaluateScore.py", line 20, in <module>
    from pandas import read_csv
ModuleNotFoundError: No module named 'pandas'

since pandas is not available.

To better avoid the need for local environment debugging, my personal preference would be for a known-working Dockerfile.

Nov 16 '20 00:11 scooter-dangle

Hi, about the minimal CMake version, the one given in CMakeLists.txt:17 is merely the version installed on my machine. I think it's safe to lower it a bit so as to fit your version.

Also, I notice that the CMake provided by apt is version 3.16.3, which is enough to satisfy the version requirement. Or if you don't have sudo access, you can directly download from CMake's website. It can be installed locally without sudo.

And, pandas is a python package used for efficiently reading the output file. Maybe you can add one line of command to install it in your docker instance.

Nov 16 '20 05:11 liurui39660

Dependencies are also specified in the README.md, under the section Demo.

Nov 16 '20 06:11 liurui39660

By the way, the change at https://github.com/Stream-AD/MIDAS/commit/5c8563169e7adc454c8920a969b8b8a929245a4d#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aL21-R21 makes the above Dockerfile fail earlier.

With that change reverted, the following Dockerfile succeeds:

FROM ubuntu:20.04 AS build

ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update \
    && apt-get install --yes \
      build-essential \
      cmake \
    && apt-get clean \
    && rm --recursive --force \
      /var/lib/apt/lists/* \
      /tmp/* \
      /var/tmp/*

RUN mkdir /src
WORKDIR /src

COPY CMakeLists.txt ./
RUN mkdir --parents build/release \
    && cp CMakeLists.txt build/release/

COPY example ./example
COPY src ./src
COPY util ./util

RUN cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/release \
    && cmake --build build/release --target Demo


FROM python:3.9

RUN pip install pandas sklearn

COPY --from=build /src /src
WORKDIR /src

ENTRYPOINT ["build/release/Demo"]

Run via

# again...no need to use sudo if running on macOS
sudo docker build --tag midas .
sudo docker run \
  --tty \
  --rm \
  --volume $PWD/data:/src/data \
  --volume $PWD/temp:/src/temp \
  midas

If you want, I can open a PR for that. The one piece I'd want to modify is how I install sklearn. Ideally would grab a pre-existing ML Python image to avoid having to do the expensive build of that dependency.

With a Dockerfile, it would be trivial to set up CI, which would automatically inform you of test failures induced by newly opened PRs.

Nov 17 '20 03:11 scooter-dangle

I found there's a missing header in the Demo.cpp and made a commit to fix it.

Also I wrote a Dockerfile based on yours, but it installs python using apt, so the python image is not needed. I tested it on my Windows 10 machine, using a similar command as yours.

Can you help test if the Dockerfile can work on your machine? Also feel free to give suggestions to the Dockerfile, as I feel the initial build speed is a bit slow.

Nov 17 '20 11:11 liurui39660

Also I wrote a Dockerfile based on yours, but it installs python using apt, so the python image is not needed.

It can end up being a style preference, but I structured it as a multi-stage build like that so that if you needed to tweak any of the dependencies, you could tweak the C++ deps or the Python deps independently without invalidating the rest of the cached Docker layers. If you prefer to install the C++ and Python deps in the same image, I'd recommend moving the RUN pip3 ... line up to immediately after you symlink python to python3. Since installing scikit-learn takes as long as it does, you want to do it prior to copying in your source files, since those are more likely to change (and those invalidate subsequently cached layers) than the Python dependencies.

Can you help test if the Dockerfile can work on your machine?

The build seems to succeed. However, the docker run doesn't make it through all of the demo steps. If I shell-in (rather than relying on the built-in entrypoint), running Demo manually shows a segfault:

> sudo docker run \
    --tty \
    --interactive \
    --rm \
    --volume $PWD/data:/src/data \
    --volume $PWD/temp:/src/temp \
    --entrypoint bash \
    midas
root@bd63e46c5d18:/MIDAS# build/release/Demo
Seed = 1605667365       // In case of reproduction
Segmentation fault (core dumped)

Also feel free to give suggestions to the Dockerfile, as I feel the initial build speed is a bit slow.

Definitely. It's much slower than I'd want for a build that I might need to run semi-regularly. The first improvement is usually re-arranging the steps to keep the slow-to-build but infrequently changing steps (e.g., installing scikit-learn) from being ejected from the cache by steps that come prior to it that are likely to change frequently (e.g., COPY src ./src, which will be invalidated every time a file in src/ is modified).

The other thing to do is to utilize a publicly available base image that has the expensive steps already included. A number of ML/data science Python base images are available on Docker Hub that will have common deps like scikit-learn and pandas already installed. You'd might want to go the multi-stage (using multiple FROM declarations) so that you can swap out the base image in the future if necessary.

I'll ask around to see if anyone recommends a particular Python analytics-centric base image.

Nov 18 '20 03:11 scooter-dangle

Thanks for your advice. To be honest, I'm not quite familiar with docker, it would be very helpful if you can open a PR for it. I'll later test it on my Windows machine.

About the segmentation fault, I've met this several times but those were because the code tries to read the data file which is failed to be opened. As you might notice, there's almost no defensive code in most source files, but this is on purpose, so other researchers can focus more on the core of the algorithm.

Nov 18 '20 05:11 liurui39660

I'd be happy to open a PR. Will try to do that tomorrow evening or this weekend. (Today was a really long day.)

Nov 19 '20 04:11 scooter-dangle

MIDAS MIDAS copied to clipboard

Should either Dockerize or better specify dependencies

MIDAS
MIDAS copied to clipboard