moby icon indicating copy to clipboard operation
moby copied to clipboard

Feature: Add the capability to pin cpu/s to docker service create in the same way that this can be done with docker run --cpu-set=

Open piersharding opened this issue 8 years ago • 23 comments

Hi -

I am working on an HPC project that would need to achieve CPU pinning for containers so that specific tasks run on the appropriate CPU as well as speak directly to the host network adapter (a topic covered well in #25303). This could be achieved by having a head node task (perhaps on the swarm manager node) that uses docker swarm to run an agent container on each compute node launched as a service and then each agent container could then launch the appropriately configured task containers locally using docker run --cpuset-cpus="1,2" etc. However, it would be much cleaner/nicer to manage the entire deployment process through docker swarm so that orchestration features would not need to be implemented in the agent nodes - something that would simplify the entire orchestration control structure.

On this basis, please consider the inclusion of the --cpuset-cpus parameter.

Thanks, Piers Harding.

piersharding avatar Jan 26 '17 18:01 piersharding

Could you explain why these tasks have to be pinned to specific CPU's, and --limit-cpu / --reserve-cpu is not sufficient for your use case?

also, /cc @crosbymichael

thaJeztah avatar Jan 27 '17 08:01 thaJeztah

Hi -

Yes - I think for most use cases they would be sufficient, and I had not found enough background on them to understand them (I'm hoping that http://stackoverflow.com/questions/38200028/docker-service-limits-and-reservations is correct). This should do what I need - thanks. However, I'm wondering if there are other cases in heterogeneous systems where CPUs with different instruction sets, or co-processors are addressable by CPU id only (instead of --device as for GPGPUs etc)?

Cheers, Piers Harding.

piersharding avatar Jan 30 '17 22:01 piersharding

@thaJeztah, processor affinity is sometimes very important for achieving high (any predictable) performance. It is critical for projects like Aeron and LMAX Disruptor. Having support for processor affinity at swarm mode level is highly desirable in some use cases, e.g. when running Aeron's media driver as a service.

barorion avatar Feb 16 '17 05:02 barorion

@thaJeztah there is still another problem to be resolved: assume we have add processor affinity feature into Docker, we still can't use it well, because we can't allocate one specific replica to a specific NUMA node. Certainly, you may say we can bind all replicas of this service to one NUMA node, and then bind another service to anther NUMA node, but it will introduce other problems, like each service has different level of HW resource requirements so that we can't utilize a physical host efficiently in some cases. So I persist to be think Docker Swarn will allow assign different option value for one of replica: docker service update --cpuset=0

yonger1516 avatar Mar 17 '17 00:03 yonger1516

any plans for this?

barorion avatar Jan 24 '18 15:01 barorion

+1. Many workloads that would like to run in a container also would like to maintain their CPU affinity.

qlyoung avatar Jan 08 '20 07:01 qlyoung

Any news? It's 2020 guys, just tell something if you have any plan for this. Thanks.

erfansahaf avatar Jan 08 '20 12:01 erfansahaf

AFAIK, I haven't seen someone working on this, but feel free to work on a design/proposal. Things to be taking into account would be that not all nodes in the cluster may be equal (so nodes could have a different number of CPUs/cores); likely the scheduler would have to take that into account somehow (so that a task for a service having --cpuset-cpus="3,4" would only land on a (worker)node that has 3 CPUs.

thaJeztah avatar Jan 08 '20 14:01 thaJeztah

@thaJeztah thanks for the reply. Slightly off topic, but as I understand it, it should be possible to set cpu affinity a la taskset within the container itself. However, this requires the container to be privileged. And swarm does not support privileged containers either, so this strategy is also not viable with swarm. Same deal with niceness. Have I got this correct? I definitely see the challenges with CPU pinning in a cluster oriented environment like swarm, given heterogeneous node hardware, but I don't see why swarm can't allow privileged containers. If you have any light to shed on what the challenges are there I'd appreciate the info.

qlyoung avatar Jan 08 '20 18:01 qlyoung

+1 here, just surprised such obvious feature is not implemented yet. out of curiosity, is swarm still being developed?

nevgeniev avatar May 11 '20 13:05 nevgeniev

Just wanted to add my use case: I need to limit both the cpus and mems a process can utilize to eliminate any accidental cross-NUMA memory bandwidth using the --cpuset-cpus AND --cpuset-mems arguments to docker run.

d3matt avatar Aug 12 '20 22:08 d3matt

Hello,

Docker Swarm should include support for NUMA (cpusets-cpus, cpuset-mem). There are many time sensitive applications out there (market data feed handlers, pricing engines) that require this feature. Kubernetes (k8s) supports this with 'CPU Manager' since 2018.

Many applications in HFT and Algorithmic trading use NUMA and Docker already leverages this when you use 'docker run', it should be available at the Swarm level.

jnunezgts avatar Oct 20 '20 09:10 jnunezgts

AFAIK, I haven't seen someone working on this, but feel free to work on a design/proposal. Things to be taking into account would be that not all nodes in the cluster may be equal (so nodes could have a different number of CPUs/cores); likely the scheduler would have to take that into account somehow (so that a task for a service having --cpuset-cpus="3,4" would only land on a (worker)node that has 3 CPUs.

Kubernetes has this feature.

jnunezgts avatar Oct 20 '20 09:10 jnunezgts

Kubernetes has this feature.

Yes, it does, and this is what made me have to ditch Swarm for k8s. Unfortunately.

qlyoung avatar Jan 02 '21 07:01 qlyoung

OK guys. The time has come. Time to put the end on this madness. I am going to develop this feature.

Before we actually start to discuss about the possible drawbacks and corner cases of this feature, I want to develop PoC for the very specific case - when all nodes are equal in terms of resources. This is very easy IMO, just pass the cpuset-cpus down to container creation and see what will blow up ;)

Right now I am trying to track down the path from docker service create --limit-cpu [...] --reserve-cpu [..] down to container creation.

Limits and reservations are defined in SwarmSpec (api/types/swarm/task.go:96) and they travel down up to (daemon/cluster/services.go:182), the CreateService function. This is where I'm currently stuck. Any idea where it comes up? I am trying the track down the place where ContainerSpec is derived from SwarmSpec, because I suppose this will be the place where most of the work will be done.

Best regards Alex, StreamVX

aleek avatar Feb 08 '22 20:02 aleek

Also, I fell like 80% of you needs the simpliest case - 1 swarm node, or N exact same nodes. Am I correct?

@jnunezgts @d3matt @nevgeniev @qlyoung @piersharding

I can make the simple case, and then we can work out how to manage the corner cases.

aleek avatar Feb 08 '22 20:02 aleek

Hello Aleksander,

In our scenario we were not contemplating to have multiple swarm nodes.

--Jose

From: Aleksander @.> Sent: Tuesday, February 8, 2022 3:29 PM To: moby/moby @.> Cc: Jose Nunez @.>; Mention @.> Subject: Re: [moby/moby] Feature: Add the capability to pin cpu/s to docker service create in the same way that this can be done with docker run --cpu-set= (#30477)

Also, I fell like 80% of you needs the simpliest case - 1 swarm node, or N exact same nodes. Am I correct?

@jnunezgtshttps://github.com/jnunezgts @d3matthttps://github.com/d3matt @nevgenievhttps://github.com/nevgeniev @qlyounghttps://github.com/qlyoung @piershardinghttps://github.com/piersharding

— Reply to this email directly, view it on GitHubhttps://github.com/moby/moby/issues/30477#issuecomment-1033032244, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQCQZK2FWKXQRQQ523FDMETU2F4HZANCNFSM4C532IAQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.@.>>


Strike Technologies, LLC (“Strike”) is part of the GTS family of companies. Strike is a technology solutions provider, and is not a broker or dealer and does not transact any securities related business directly whatsoever. This communication is the property of Strike and its affiliates, and does not constitute an offer to sell or the solicitation of an offer to buy any security in any jurisdiction. It is intended only for the person to whom it is addressed and may contain information that is privileged, confidential, or otherwise protected from disclosure. Distribution or copying of this communication, or the information contained herein, by anyone other than the intended recipient is prohibited. If you have received this communication in error, please immediately notify Strike at @.***, and delete and destroy any copies hereof.


CONFIDENTIALITY / PRIVILEGE NOTICE: This transmission and any attachments are intended solely for the addressee. This transmission is covered by the Electronic Communications Privacy Act, 18 U.S.C ''2510-2521. The information contained in this transmission is confidential in nature and protected from further use or disclosure under U.S. Pub. L. 106-102, 113 U.S. Stat. 1338 (1999), and may be subject to attorney-client or other legal privilege. Your use or disclosure of this information for any purpose other than that intended by its transmittal is strictly prohibited, and may subject you to fines and/or penalties under federal and state law. If you are not the intended recipient of this transmission, please DESTROY ALL COPIES RECEIVED and confirm destruction to the sender via return transmittal.

jnunezgts avatar Feb 08 '22 20:02 jnunezgts

In my case the nodes can be heterogeneous. My use case was to inspect the available CPUs and select one to pin against from within the container.

qlyoung avatar Feb 08 '22 20:02 qlyoung

+1 for the feature, I would really need it. I am running an HPC clusters with multiple identical nodes, and I would like my containers to be limited to a specific set of cores depending on which GPUs they utilize.

carloalbertobarbano avatar Mar 30 '22 11:03 carloalbertobarbano

+1 for the feature, I would really need it. I am running an HPC clusters with multiple identical nodes, and I would like my containers to be limited to a specific set of cores depending on which GPUs they utilize.

I have already working PoC. Give me a bit more time, I need to figure out other things, and I'll publish the code.

Alex

aleek avatar Mar 31 '22 06:03 aleek

While waiting for @aleek's patch, we developed a small utility which can run as swarm service and takes care of automatically assigning cpu cores based on the required GPUs (if gpu computing is your use-case). Documentation is lacking, but the idea is very simple, and could perhaps be adapted to different use cases.. https://github.com/EIDOSlab/swarm-cpupin

carloalbertobarbano avatar Apr 10 '22 20:04 carloalbertobarbano

Could you explain why these tasks have to be pinned to specific CPU's, and --limit-cpu / --reserve-cpu is not sufficient for your use case?

We use Docker / Docker Compose for development and CI. Which I think it is very common use case. Then we use a proper orchestrator in production (we are slow migrating our physical VMs to k8s).

The use case we have is that we are using self-hosted Github Actions runners and they are very sensitive to an overloaded CPU. They need a dedicated CPU to ensure the runner does not disconnect and break the Github Actions job. Since we are using Docker Compose to run out CI test, being able to pin the containers to specific containers will solve the above issue. And it does using the 2.4 compose specification. If we have 3 containers and they are each given 8 cores, that does not mean all 3 containers combined will use a max of 8 cores, but rather a max of 24 cores. So the choice is either to seriously harm the parallelization of our tests (use (max cores - 2) / 3) or we CPU pin so ensure the 3 containers share the same CPU cores to ensure they never use all of them (pin to 0 to (max cores - 2))

EDIT: It actually looks like my issue can be better solved using cset. Then it takes docker compose out of the equation:

sudo apt install -y cpuset
sudo cset set --cpu=0-21 --set=docker  # assuming 24 CPUs, reserve 2
sudo systemctl restart docker

AngellusMortis avatar Apr 12 '22 14:04 AngellusMortis

One use-case for me is that I have a 96 core build machine, and I want to run 8 agents on them. If I use reservations and limits, nproc still returns 96 cores, which some build systems will use to run nproc amounts of jobs. Running 96 jobs, when only having a cpu limit of 8 is of course suboptimal. However with cpusets nproc reports the proper amount of cores available. So in the end, all I want is that nproc reports max(1, floor(limit-cpu)) amount of cores. I don't necessarily need or want cpu pinning.

But a nice way of doing cpu pinning would be to have cpuset be computable, like this: {{- .Task.ID*8 }}-{{- ((.Task.ID+1)*8)-1 }},64-96

KarstenB avatar Oct 08 '22 07:10 KarstenB

Any news about this?

lordraiden avatar Oct 03 '23 21:10 lordraiden

Any movement on this? ScyllaDB gets a massive performance boost from cpu pinning...

AustEcon avatar Nov 21 '23 17:11 AustEcon