ubc-cluster-goodies
ubc-cluster-goodies copied to clipboard
queue_cc for multi-gpu setup
Implement new version of queue_cc which runs multiple jobs in a node, dedicating each to a gpu. Requires on-the-fly creation of job scripts.
Hey @weiweisun2018, not urgent, but if you can do this it would be great. Basically, the system currently prefers jobs that can take full nodes. Since our jobs can be batched together, it'd be a good idea to
- grab N jobs (4 for cedar, 2 for graham)
- create a new bash file for the job that has
#!/bin/bash
CUDA_VISIBLE_DEVICES=XXX jobscript1.sh &
CUDA_VISIBLE_DEVICES=YYY jobscript2.sh &
...
join
which would be a meta job that consumes a full node. 3. queue these jobs.
Sure, my pleasure to do it. But there is a couple of question: 1, do you mean to submit a job in the array style? Could you please give me the specific document about the new way of arranging jobs? 2, From my understanding: Take cedar as an example:
def check_ready_for_next_batch_job_and_return_batch_job(job_id):
if there_is_no_batch_job_runing:
return netxt_batch_job
batchjobs=[]
def scheduler_batchjobs():
while true:
for job_id in job_ids:
batchjobs.append(check_ready_for_next_batch_job_and_return_batch_job(job_id))
if len(batchjobs)=4:
notify_grabber()
def grab4batchjobs():
while true:
wait_for_scheduler()
4batchjobs = batchjobs.pop(:4)
def submit_full_node_job()
thread_scheduler = threading.thread(target=schedule_batchjobs)
for 4batchjobs in grab4batchjobs():
send4batchjobs_to_cedar(4batchjobs) # My question here about how to submit such a full_node_job: should I request 4 GPUs and more memory?
So for example, it'll be a single job containing multiple job executions inside. We want to do it this way so that we assign each job a specific GPU. array submission probably can't support that.
In short, we submit a fake job that uses all four GPUs (e.g. cedar), which internally is just running four jobs in parallel, one for each GPU.