ubc-cluster-goodies icon indicating copy to clipboard operation
ubc-cluster-goodies copied to clipboard

queue_cc for multi-gpu setup

Open kmyi opened this issue 6 years ago • 3 comments

Implement new version of queue_cc which runs multiple jobs in a node, dedicating each to a gpu. Requires on-the-fly creation of job scripts.

kmyi avatar Dec 07 '18 05:12 kmyi

Hey @weiweisun2018, not urgent, but if you can do this it would be great. Basically, the system currently prefers jobs that can take full nodes. Since our jobs can be batched together, it'd be a good idea to

  1. grab N jobs (4 for cedar, 2 for graham)
  2. create a new bash file for the job that has
#!/bin/bash
CUDA_VISIBLE_DEVICES=XXX jobscript1.sh &
CUDA_VISIBLE_DEVICES=YYY jobscript2.sh &
...
join

which would be a meta job that consumes a full node. 3. queue these jobs.

kmyi avatar Dec 07 '18 17:12 kmyi

Sure, my pleasure to do it. But there is a couple of question: 1, do you mean to submit a job in the array style? Could you please give me the specific document about the new way of arranging jobs? 2, From my understanding: Take cedar as an example:

def check_ready_for_next_batch_job_and_return_batch_job(job_id):
       if there_is_no_batch_job_runing:
              return netxt_batch_job

batchjobs=[]

def scheduler_batchjobs():
       while true:
              for job_id in job_ids:
                    batchjobs.append(check_ready_for_next_batch_job_and_return_batch_job(job_id))
                    if len(batchjobs)=4:
                           notify_grabber()

def grab4batchjobs():
       while true:
               wait_for_scheduler()
                4batchjobs = batchjobs.pop(:4)
                

def submit_full_node_job()
        thread_scheduler = threading.thread(target=schedule_batchjobs)
        for 4batchjobs in  grab4batchjobs():
                send4batchjobs_to_cedar(4batchjobs) # My question here about how to submit such a full_node_job: should I request 4 GPUs and more memory?

wsunid avatar Dec 07 '18 20:12 wsunid

So for example, it'll be a single job containing multiple job executions inside. We want to do it this way so that we assign each job a specific GPU. array submission probably can't support that.

In short, we submit a fake job that uses all four GPUs (e.g. cedar), which internally is just running four jobs in parallel, one for each GPU.

kmyi avatar Dec 08 '18 05:12 kmyi