AVT icon indicating copy to clipboard operation
AVT copied to clipboard

-l seems to go into deadlock

Open haofengac opened this issue 3 years ago • 7 comments

Hi, I'm able to train models with -g, but when I try to train on my local machine with multiple GPUs with -l, the model doesn't start training after the commands for hydra are printed, and a few jobs are spawned only on GPU 0. It seems to be some deadlock caused by submitit/hydra. Do you know what is going on here? Any help is greatly appreciated!

haofengac avatar Jan 27 '22 01:01 haofengac

Can you check your RAM once? Maybe the videos loaded are occupying the RAM completely.

Anirudh257 avatar Jan 30 '22 17:01 Anirudh257

Hi @haofengac apologies for the delay in responding. So when running locally, no logs will be printed to the shell -- they will get stored in the output directory (which should be somwhere in BASE_RUN_DIR in launch.py). However all the GPUs should get used etc, maybe look at the stored stdout/stderr logs to see if something went wrong?

rohitgirdhar avatar Jan 31 '22 16:01 rohitgirdhar

Hi @rohitgirdhar, thanks for the reply! I called with the command CUDA_VISIBLE_DEVICES=0,1,2,3 python launch.py -c expts/09_ek55_avt.txt -l. For the log, I've only got these 4 warnings:

[2022-02-01 15:22:12,259][py.warnings][WARNING] - /sailhome/haofeng/anaconda3/envs/avt/lib/python3.7/site-packages/torchvision/__init__.py:78: UserWarning: video_reader video backend is not available. Please compile torchvision from source and try again
  warnings.warn(message)

[2022-02-01 15:22:12,259][py.warnings][WARNING] - /sailhome/haofeng/anaconda3/envs/avt/lib/python3.7/site-packages/torchvision/__init__.py:78: UserWarning: video_reader video backend is not available. Please compile torchvision from source and try again
  warnings.warn(message)

[2022-02-01 15:22:12,548][py.warnings][WARNING] - /sailhome/haofeng/anaconda3/envs/avt/lib/python3.7/site-packages/torchvision/__init__.py:78: UserWarning: video_reader video backend is not available. Please compile torchvision from source and try again
  warnings.warn(message)

[2022-02-01 15:22:12,553][py.warnings][WARNING] - /sailhome/haofeng/anaconda3/envs/avt/lib/python3.7/site-packages/torchvision/__init__.py:78: UserWarning: video_reader video backend is not available. Please compile torchvision from source and try again
  warnings.warn(message)

The output of nvidia-smi is as follows:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84       Driver Version: 460.84       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   37C    P0    55W / 300W |   4127MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0    43W / 300W |      3MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0    43W / 300W |      3MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   36C    P0    43W / 300W |      3MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    773590      C   ...onda3/envs/avt/bin/python     1031MiB |
|    0   N/A  N/A    773591      C   ...onda3/envs/avt/bin/python     1031MiB |
|    0   N/A  N/A    773592      C   ...onda3/envs/avt/bin/python     1031MiB |
|    0   N/A  N/A    773593      C   ...onda3/envs/avt/bin/python     1031MiB |
+-----------------------------------------------------------------------------+

It seems that the four jobs are all spawned on the first GPU although there are 4 available, and it is stuck like this forever. Do you know what's going on here? Thanks!

haofengac avatar Feb 01 '22 23:02 haofengac

@haofengac you can check the logs in .submitit folder (in OUTPUT directory) in log.err (for errors) or log.out(for other logs). For the GPU usage, not sure why the distributed training is not working :/

sanketsans avatar Feb 03 '22 12:02 sanketsans

Hi @haofengac unfortunately I don't have a good sense of what might be going on either. Maybe try to submit a 1 GPU job and see if that works at least? (prints logs and starts training)

rohitgirdhar avatar Feb 06 '22 22:02 rohitgirdhar

@haofengac Hi, I'm facing with same issue as you. Since I've tried installing torchvision with source, I'm facing with another issue( have to install pytorch version v1.12.1-rc5 but have know idea how to do). Did you solved your problem? Working on local with several GPU is essential to me but can't solve this problem

jinwooahnKHU avatar Sep 30 '22 12:09 jinwooahnKHU

@haofengac @rohitgirdhar hi, I've solved this issue with changing num_workers to 0. If you change workers : 10 to 0 in AVT/conf/data/default.yaml , you can use multiple gpu at local. But you have to see logs from OUTPUT directory.

wasabipretzel avatar Oct 17 '22 01:10 wasabipretzel