dsub Docker in docker support

It's great to be able to specify a docker image which makes it quite simple and portable. With the docker approach to separate each part in a separate container, I would want to be able to use an existing image as a container within docker, e.g. via docker compose. Similar to Kubernetes pods.

I understand that's currently not possible because the docker container is not privileged. Which could be easily rectified locally but not sure when running it in the cloud.

Is that something that will be supported or do you have another recommendation to achieve that?

Oct 18 '18 19:10 de-code

Can you describe in more detail what it is that you are looking to do?

The pipelines.run API indeed does not support running with --privileged. I can't speak for the Cloud Health team that develops the Pipelines API, but it seems unlikely to be a feature supported without a compelling use-case.

Oct 19 '18 00:10 mbookman

I'm mainly looking at options for bulk pre-processing and conversion as part of our ScienceBeam project. In general we found it better manageable using separate containers rather than trying to build one container with everything in it. e.g. we would for example want to:

start a conversion tool docker container
for each document (in the worker queue):
- retrieve document from bucket
- submit document to Rest API
- save to bucket
stop container etc.

The docker container should stay up beyond one document as it might load a machine learning model that takes time in itself.

Conversion tools could be for example GROBID, Science Parse or in the future our own experimental model. We would use their already published docker container. There may be additional steps we want to chain in front of it, e.g. convert a Word document to PDF or perform an XSLT XML transformation.

Currently we are using Apache Beam / Dataflow for bulk tasks but it has the same limitations of not being able to use a docker container within a Dataflow worker (none of them would be available as a package via the system package manager, they are also written in different languages). There is the deprecated dockerflow which now recommends dsub. I understand that dockerflow was also using the Pipeline API to run docker tasks.

Options so far to run it in the cloud in parallel:

create a container with everything in it which could get messy, would work with dsub but not Dataflow
create and tear down a cluster with the docker containers that we want to use. Then we could use the API from dsub and Dataflow but its another layer of complexity

In general I could see other use-cases, when seeing containers as the applications that we want to use as part of our pipeline. It is possible that I am looking at it the wrong way around.

Oct 19 '18 10:10 de-code

Hi @de-code!

dsub has thus far tried to really focus on simplicity with its interface (the command line and optional TSV file) to make it easier for scientists and computational biologists to develop algorithms locally and quickly transition to running those algorithms at scale (on Cloud). Some of the simplicity has also been forced by the architecture of the original Google Pipelines API (v1alpha2), which only supported running a single container.

With the new Pipelines v2alpha1, we now have the ability to run a series of containers (multiple Actions in a single Pipeline), including running containers in the background. We have developed the google_v2 provider to support the new API and can now look more at how features of v2alpha1 can be made available in dsub.

We have recently been discussing how best to enable the execution of multiple containers within the same dsub task while not overly complicating the interface for users.

Some options/use cases:

(1) Enable multiple scripts/commands

If we take the approach simply making the following possible:

1- Localize inputs 2- Run a --command or --script with --image 3- If (2) succeeds, run another --command or --script with --image repeat 4- Delocalize outputs

Then we could support something like:

dsub \
  ...
  --input INPUT=gs://bucket/path/input.txt \
  --output OUTPUT=gs://bucket/path/output.txt \
  --image my-image-1 \
  --command my-command-1 \
  --image my-image-2 \
  --command my-command-2 \
  *etc*

or

dsub \
  ...
  --input INPUT=gs://bucket/path/input.txt \
  --output OUTPUT=gs://bucket/path/output.txt \
  --image my-image-1 \
  --script my-script-1.sh \
  --image my-image-2 \
  --script my-script-2.sh \
  *etc*

this would be fairly straight-forward to do.

However, one immediately starts to think of other features.

(2) Enable multiple scripts/commands and support post-step delocalization

What I might really want is:

1- Localize inputs 2- Run a --command or --script with --image 3- If (2) succeeds, de-localize some files to GCS 4- Run another --command or --script with --image 5- If (4) succeeds, de-localize some files to GCS repeat

The idea here is that after a given step completes, it produces some output that we'd like to checkpoint. If a subsequent step fails, we don't want to lose all of the previous output. This seems reasonable but then ... what you want next is to be able to restart that same task; have it localize those outputs from that previous step and try the failed step again.

This might be worth enabling (post-step delocalization), even without support for automated retries. Might.

(3) Enable multiple scripts/commands and support running commands in the background

I believe that this is what you are requesting. Something like this might not be too bad:

dsub \
  ...
  --input INPUT=gs://bucket/path/input.txt \
  --output OUTPUT=gs://bucket/path/output.txt \
  --image my-image-1 \
  --flags RUN_IN_BACKGROUND \
  --script my-script-1.sh \
  --image my-image-2 \
  --script my-script-2.sh \
  *etc*

I'm unclear whether there is cross-container communication currently supported by the Pipelines API v2alpha1. I believe that it is, but will need to inquire with the Cloud Health team on the implementation details and then can determine whether the command-line needs to be expanded to support it.

Does this sound like it would satisfy your use case?

BTW, there is always the option of have a task YAML file rather than putting everything on the command-line:

my-task.yaml

steps:
- image: my-image-1 
  flags RUN_IN_BACKGROUND
  script my-script-1.sh
- image my-image-2
  script my-script-2.sh

and then just add:

--task-file my-task.yaml

But again, we are striving for a certain simplicity.

Oct 26 '18 00:10 mbookman

Hi @mbookman, thank you for the response.

Keeping it simple is a good strategy.

I think I am yet a bit hazy of what is dsub and what is the Pipeline API but looking at the API helps a bit more.

I think I have two use-cases:

Run secondary docker containers on-demand and communicate with them via a Rest API (the one reported in the initial description of the issue)
Run many short lived tasks

The 2. use case was something I uses as an initial test case. The issue there was that it created separate VMs for every single short lived task. I was thinking of maybe pass in a file list instead, but then it wouldn't work so well with the current transparent mounts, as I would want to upload the results of each task separately.

The idea here is that after a given step completes, it produces some output that we'd like to checkpoint. If a subsequent step fails, we don't want to lose all of the previous output.

That part of your option 2 would address the main issue with use case 2.

It could also potentially allow to do some setup in the first step (e.g. add dependencies), which might reduce the need to build a separate image.

This seems reasonable but then ... what you want next is to be able to restart that same task; have it localize those outputs from that previous step and try the failed step again.

That is another challenge I saw with the current Pipeline API. I'm comparing it with Dataflow which may not be fair. With Dataflow I can have metric counters and scanning through the logs seems to be easier as it's combining the logs from all workers and lets me filter it. I believe Dataflow has some re-try built in but so far I opted to log it instead (and use the metric counters). Most of my errors wouldn't be resolved by re-running them without fixing something. I would then later run the pipeline again with a --resume flag (app specific, which checks which output file is already present). Of course that depends very much on the use-case. But for me to work it well, I'd need to be able to fix something and then run the tasks again. But I would also need some sort of overview of what went wrong. Using the Pipeline API for use case 2 would allow me to build my own docker container (which can also be a burden for some projects), whereas Dataflow currently caters for that use case better I think, in term of monitoring, visualisation, distributing the load.

I'm unclear whether there is cross-container communication currently supported by the Pipelines API v2alpha1. I believe that it is, but will need to inquire with the Cloud Health team on the implementation details and then can determine whether the command-line needs to be expanded to support it.

If the cross-container communication works, then that could indeed satisfy my first use-case. I would probably want to start and stop a service. But it would probably also be okay if the service container was just stopped in the end, without issuing a separate command.

Just to illustrate that for my use case:

steps:
- image: service-image-1 
  flags: RUN_IN_BACKGROUND
  command: start my-service
- image: service-image-2
  flags: RUN_IN_BACKGROUND
  command: start my-service
- image: my-image
  script: my-script-1.sh
- image: service-image-1 
  flags: RUN_IN_BACKGROUND
  command: stop my-service  # optional

To run it in parallel, I would then need to split a list of things to do and create separate task.yaml files I guess. Unless the task.yaml file could contain the information of what needs to be run in parallel.

Oct 29 '18 15:10 de-code