Platform to run this application
Where do i run this application. i am currently using windows with a single GPU. so in a terminal i am running the server and in two seperate terminals i am running the examples. if the wait argument is mentioned, the program doesnt move forward, if i remove the wait argument, it moves, but the UI doesnt show the allocation and everything.
some sort of clarification would be verymuch appreciated
Hi there, thank you very much for using this software~ Currently I only test the functionalities on Linux (Ubuntu distribution), but it should work on Windows.
I'm not sure about the wait argument you are using. Is it the WatchClient.wait() method?
You mentioned there's only one GPU in the environment, was there any GPU consumption or processes running on that GPU? Currently the GPU could be allocated to a client if the GPU is completely free:
https://github.com/Spico197/watchmen/blob/6b2456755b6fba91f21c86fec72815d90a05c794/watchmen/listener.py#L16-L18
Besides, was the client registered on the server successfully? You may find the relevant information in the printed log.
If you do have a while, we can connect regarding this. i really am looking forward to make contribution to this repo and add it to my thesis. so do let me know if you can spare some time for this. my email: [email protected]
Hey so How i am running your application is ,
Terminal-1. running the watchmen.server
Terminal-2. python single_card_mnist.py --id="single" --cuda=0 --wait --wait_mode="query"
Terminal-3. python single_card_mnist.py --id="single_schedule" --cuda=0 --wait --wait_mode="schedule"
so when i run all the above 3 terminals
it stays in that state for a while even after when the terminal says the training is completed and final accuracy is given and after quite a while, the another terminal's processing is starting. (I presume there is a duplicate process being created.) the same happens for the other terminal as well. a duplicate process being triggered is what i assume. correct me if im wrong but the front end is not being updated even the terminal says the training is completed.
and also i would like to know more about the --wait-mode="queue" and --wait-mode="schedule" what difference does it make.
and hey i have no idea on the wait argument that you have mentioned in the Readme File. #
Hi there, thanks for your valuable feedback!
- Ideally, the server check GPU status every second and try to assign available GPUs to clients every 5 seconds. The client would ping the server every 10 seconds to see if there are available GPUs.
- Clients in
queuemode would wait until the specific GPU is free to use (here in the example,cuda:0). Clients inschedulemode may be assigned to another GPU (one GPU from 0, 2, and 3).
Test case 1: queue vs. shedule
Test case 2: queue and schedule on the same GPU
it stays in that state for a while even after when the terminal says the training is completed
I've understood what's going on from your point of view. You said the scheduling server would be still waiting even if the job had finished. However, a client is not designed or required to send a I'm finished signal to the server, so the server would wait until queue_timeout is triggered.
Here's why we need the queue_timeout mechanism: a job may take time to start (like downloading datasets, pretrained models, etc.), and the GPU is not occupied immediately when the job starts training. So we have to set a queue_timeout time to make sure the server could wait for the client to load models or datasets to GPU to indicate it's running. When a job is finished, the server would still be waiting for another queue_timeout (10 min in default), and this induces the time gap.
A possible solution to this may be adding another client status: RUNNING, so when the GPU is available back again with a job runned on that GPU, the server may skip the queue_timeout and directly assign the GPU to the next job.
Or maybe we could optimize the logic in watchmen/server/check_work.py to make an instant GPU assignment without waiting for queue time of the last job.
Thanks for your Great explantion and your time to test my scenario.
Some more queries from my side are,
- What scheduling have you implemented in the application
- I am so willing to test the application using multiple GPUs. (can you suggest a platform where you are running).
- Do you have multiple GPUs with your system or are you running the application in an Instance in AWS or any cloud.
- When i am running using the EC2 instance, the url opens(used Flask -NGrok). The url opens but the process are not reflected neither do the other details. instead i can see a red color Error with the empty table values.
Help me out with regards to this.
and hey, thanks once again for your time. Really Grateful !. Let me know if i can be of any help to you.
- While, the implementation is rather primitive here. It just loops and checks if the GPUs are available for a job.
https://github.com/Spico197/watchmen/blob/6b2456755b6fba91f21c86fec72815d90a05c794/watchmen/server.py#L252
2-3. I'm running the experiments with a local cluster in my lab. However, I suggest not to really rent GPUs from cloud providers if you care about the cost. You could hook some APIs in the functions below to build a testing environment, which is the least costly way to test new functions.
https://github.com/Spico197/watchmen/blob/6b2456755b6fba91f21c86fec72815d90a05c794/watchmen/server.py#L17-L22
- Do you mean the website UI is showing well, but the status is Error? That means the frontend is not connecting to the backend. You could try to
curl http://localhost:62333/apito test the connectivity.