dask-tensorflow icon indicating copy to clipboard operation
dask-tensorflow copied to clipboard

Multiple GPU on the same server

Open surajkamal opened this issue 8 years ago • 2 comments

This handy tool is a great addition to dask-distributed especially for those who re using tensorflow distributed. It would much better if there are options to specify GPU resources started by each dask tasks as in such situations like multiple workers resides on a same machine. When a tensorflow server initializes it grab all the available GPU RAM of all GPU cards availabe on a machine by default (as mentioned here). I have managed to achieve this by just adding:

import os
def start_and_attach_server(spec, job_name=None, task_index=None, dask_worker=None):
     os.environ["CUDA_VISIBLE_DEVICES"]=str(task_index)
     server = tf.train.Server(spec, job_name=job_name, task_index=task_index)
     dask_worker.tensorflow_server = server
     dask_worker.tensorflow_queue = Queue()
     return 'OK'

but this rudimentary, it will not check for whether it is on the same node or not which is essential and there is no clean interface to specify that as well. Is it possible to add such functionality in a more better way, so many tensorflow tasks can share a same node.

surajkamal avatar Jul 14 '17 07:07 surajkamal

Do these docs suffice? http://distributed.readthedocs.io/en/latest/resources.html

On Fri, Jul 14, 2017 at 2:46 AM, surajkamal [email protected] wrote:

This handy tool is a great addition to dask-distributed especially for those who re using tensorflow distributed. It would much better if there are options to specify GPU resources started by each dask tasks as in such situations like multiple workers resides on a same machine. When a tensorflow server initializes it grab all the available GPU RAM of all GPU cards availabe on a machine by default (as mentioned here https://stackoverflow.com/a/34776814). I have managed to achieve this by just adding:

import os def start_and_attach_server(spec, job_name=None, task_index=None, dask_worker=None): os.environ["CUDA_VISIBLE_DEVICES"]=str(task_index) server = tf.train.Server(spec, job_name=job_name, task_index=task_index) dask_worker.tensorflow_server = server dask_worker.tensorflow_queue = Queue() return 'OK'

but this rudimentary, it will not check for whether it is on the same node or not which is essential and there is no clean interface to specify that as well. Is it possible to add such functionality in a more better way, so many tensorflow tasks can share a same node.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-tensorflow/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszPvOKO8OdIxp--L6eVryO4Yc7sCgks5sNxzNgaJpZM4OX74s .

mrocklin avatar Jul 17 '17 21:07 mrocklin

Excellent, yes this will do. I was just looking to the dask-tensorflow or rather tensorflow code itself constrain resources inline. However, the scheduler level resource constraining is much more straight forward. Nice work, thank you.

surajkamal avatar Jul 20 '17 09:07 surajkamal