AVT
AVT copied to clipboard
-l seems to go into deadlock
Hi, I'm able to train models with -g, but when I try to train on my local machine with multiple GPUs with -l, the model doesn't start training after the commands for hydra are printed, and a few jobs are spawned only on GPU 0. It seems to be some deadlock caused by submitit/hydra. Do you know what is going on here? Any help is greatly appreciated!
Can you check your RAM once? Maybe the videos loaded are occupying the RAM completely.
Hi @haofengac apologies for the delay in responding. So when running locally, no logs will be printed to the shell -- they will get stored in the output directory (which should be somwhere in BASE_RUN_DIR in launch.py). However all the GPUs should get used etc, maybe look at the stored stdout/stderr logs to see if something went wrong?
Hi @rohitgirdhar, thanks for the reply! I called with the command CUDA_VISIBLE_DEVICES=0,1,2,3 python launch.py -c expts/09_ek55_avt.txt -l. For the log, I've only got these 4 warnings:
[2022-02-01 15:22:12,259][py.warnings][WARNING] - /sailhome/haofeng/anaconda3/envs/avt/lib/python3.7/site-packages/torchvision/__init__.py:78: UserWarning: video_reader video backend is not available. Please compile torchvision from source and try again
warnings.warn(message)
[2022-02-01 15:22:12,259][py.warnings][WARNING] - /sailhome/haofeng/anaconda3/envs/avt/lib/python3.7/site-packages/torchvision/__init__.py:78: UserWarning: video_reader video backend is not available. Please compile torchvision from source and try again
warnings.warn(message)
[2022-02-01 15:22:12,548][py.warnings][WARNING] - /sailhome/haofeng/anaconda3/envs/avt/lib/python3.7/site-packages/torchvision/__init__.py:78: UserWarning: video_reader video backend is not available. Please compile torchvision from source and try again
warnings.warn(message)
[2022-02-01 15:22:12,553][py.warnings][WARNING] - /sailhome/haofeng/anaconda3/envs/avt/lib/python3.7/site-packages/torchvision/__init__.py:78: UserWarning: video_reader video backend is not available. Please compile torchvision from source and try again
warnings.warn(message)
The output of nvidia-smi is as follows:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84 Driver Version: 460.84 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 37C P0 55W / 300W | 4127MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 |
| N/A 36C P0 43W / 300W | 3MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 |
| N/A 35C P0 43W / 300W | 3MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |
| N/A 36C P0 43W / 300W | 3MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 773590 C ...onda3/envs/avt/bin/python 1031MiB |
| 0 N/A N/A 773591 C ...onda3/envs/avt/bin/python 1031MiB |
| 0 N/A N/A 773592 C ...onda3/envs/avt/bin/python 1031MiB |
| 0 N/A N/A 773593 C ...onda3/envs/avt/bin/python 1031MiB |
+-----------------------------------------------------------------------------+
It seems that the four jobs are all spawned on the first GPU although there are 4 available, and it is stuck like this forever. Do you know what's going on here? Thanks!
@haofengac you can check the logs in .submitit folder (in OUTPUT directory) in log.err (for errors) or log.out(for other logs). For the GPU usage, not sure why the distributed training is not working :/
Hi @haofengac unfortunately I don't have a good sense of what might be going on either. Maybe try to submit a 1 GPU job and see if that works at least? (prints logs and starts training)
@haofengac Hi, I'm facing with same issue as you. Since I've tried installing torchvision with source, I'm facing with another issue( have to install pytorch version v1.12.1-rc5 but have know idea how to do). Did you solved your problem? Working on local with several GPU is essential to me but can't solve this problem
@haofengac @rohitgirdhar hi, I've solved this issue with changing num_workers to 0. If you change workers : 10 to 0 in AVT/conf/data/default.yaml , you can use multiple gpu at local. But you have to see logs from OUTPUT directory.