for-azure icon indicating copy to clipboard operation
for-azure copied to clipboard

Container accessing Docker API and mounting Azure File Storage breaks whole machine

Open manixx opened this issue 8 years ago • 4 comments

We have a 5 node cluster (3 manager, 2 worker) and I'm working on a small helper image to view the container logs nicely. So in theory my container does some HTTP requests to the Docker API to get the ID of the tasks, and mounts the Azure File Storage, which holds the actual logs files. Inspired by the editions_logger (image docker4x/logger-azure:17.06.0-ce-azure1) I also want to mount the actual storage right inside the container.

In my case the the script is not ready, so please don't judge the script itself.. :) I wrote a simple NodeJS app which mounts the storage and gets the tasks.

This is my Dockerfile:

FROM node:8-alpine

ENV APP_DIR            /app
ENV DOCKER_HOST        /var/run/docker.sock
ENV DOCKER_API_VERSION v1.30

RUN apk add --update cifs-utils

RUN mkdir -p $APP_DIR
WORKDIR $APP_DIR

COPY package* $APP_DIR/
RUN npm install
COPY . $APP_DIR

CMD ["npm", "start"]

To do requests to the Docker API:

const path = require('path');
const http = require('http');

/*
 * This is used to do requests against the Docker API.
 */
module.exports = (method, uri, data) => {
    if(!process.env.DOCKER_HOST || !process.env.DOCKER_API_VERSION) {
        throw Error('Please provide DOCKER_HOST and DOCKER_API_VERSION to contact Docker API properly.');
    }

    const options = {
        socketPath: process.env.DOCKER_HOST,
        port: 80,
        headers: { 'Content-Type': 'application/json' },
        dockerAPI: process.env.DOCKER_API_VERSION
    };
    let rawData = '';

    options.method = method;
    options.path = path.join('/', options.dockerAPI, uri);

    return new Promise((resolve, reject) => {
        const req = http.request(options, res => {
            res.setEncoding('utf8');
            res.on('error', reject);
            res.on('data', chunk => { rawData += chunk });
            res.on('end', () => {
                if([200, 201].indexOf(res.statusCode) == -1) {
                    return reject(Error(`[${res.statusCode}] ${options.path} (${JSON.stringify(data)}) failed: ${rawData}`));
                }
                resolve(JSON.parse(rawData));
            });
        });
        req.end(JSON.stringify(data));
    });
}

And the actual script:

const request = require('./request');
const fs = require('fs');
const { execSync } = require('child_process');

const storage = '//xxx.file.core.windows.net/xxx';
const logmountFolder = '/logmnt';
const username = 'xxx';
const password = 'xxx';

if(!fs.existsSync(logmountFolder)) {
    fs.mkdirSync(logmountFolder);
}
const mount = execSync(`mount -t cifs ${storage} ${logmountFolder} -o vers=2.1,username=${username},password=${password},dir_mode=0777,file_mode=0777,uid=0,gid=0`);
const files = fs.readdirSync(logmountFolder);

request('get', '/tasks?filters={"label":["com.docker.stack.namespace=production"]}')
.then(tasks => {
    tasks.forEach(task => {
        console.log('task', task.ID);

        files.forEach(file => {
            if(file.indexOf(task.ID) != -1) {
                console.log('file', file);
            }
        })
    });
})

Expected behavior

Is used this command to run it on a master-machine:

docker run --rm -ti -v /var/run/docker.sock:/var/run/docker.sock --privileged infra-log

And it works without any troubles, but only the first run.

Actual behavior

The second time the whole machine breaks and is unable to rejoin the cluster after the restart. After restart, around 3-5 minutes later, the whole machine breaks againt, continuously. After a bunch of restarts Azure itself deallocates the machine and creates a new machine in the scaleset (or reimages the broken machine.. I can't really tell).

In the past I also reimaged the broken machine and rejoined the machine back into the cluster by hand.

Information

I ran docker-diagnose after Azure created the new machine:

swarm-manager000001:~$ docker-diagnose
curl: (7) Failed to connect to 10.0.0.7 port 44554: Connection refused
OK hostname=swarm-manager000002 session=1500387848-1vtGIWvbMflyjRA2SQWBXR2iXTZPVLSH
OK hostname=swarm-manager000003 session=1500387848-1vtGIWvbMflyjRA2SQWBXR2iXTZPVLSH
OK hostname=swarm-worker000000 session=1500387848-1vtGIWvbMflyjRA2SQWBXR2iXTZPVLSH
OK hostname=swarm-worker000001 session=1500387848-1vtGIWvbMflyjRA2SQWBXR2iXTZPVLSH
Done requesting diagnostics.
Your diagnostics session ID is 1500387848-1vtGIWvbMflyjRA2SQWBXR2iXTZPVLSH
Please provide this session ID to the maintainer debugging your issue.

I also got the docker.log file from the broken machine after a bunch of restarts, but i'm not going to post this here because it may contain sensitive information. But i can send it to you.

manixx avatar Jul 18 '17 14:07 manixx

@manixx Feel free to join our Docker Community slack channel in order to share some of the logs: https://blog.docker.com/2016/11/introducing-docker-community-directory-docker-community-slack/

One thing to keep in mind, is that you are running the container as privileged, which means that it has access to all devices on the host. https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities I would limit the devices to what the container actually needs, see: https://github.com/moby/moby/issues/22197#issuecomment-212506571

FrenchBen avatar Jul 19 '17 18:07 FrenchBen

@FrenchBen Thanksf ro reply. I will try to minimize the devices and try again.

manixx avatar Jul 20 '17 08:07 manixx

@FrenchBen I tried it with the --cap-add flag and added only SYS_ADMIN:

docker run -v /var/run/docker.sock:/var/run/docker.sock --cap-add SYS_ADMIN infra-log

which leads the same results.

manixx avatar Jul 24 '17 07:07 manixx

@manixx Do you have a repo, you can share that has all of the above, so that I can take a look at your build and test it as well on my end?

FrenchBen avatar Jul 24 '17 17:07 FrenchBen