bonito icon indicating copy to clipboard operation
bonito copied to clipboard

Expect epoch time for training

Open callumparr opened this issue 3 years ago • 2 comments

I am running bonito train test on a node that has 2 x T4 Tesla, I tried running with --multi-gpu but it would run out of memory even when reducing the batch size and watching by Nvidia-smi it never seemed to fill the available memory (2x 16GB).

In any case its running on one card, does this time seem reasonable? I've done some comparisons on benchmark sites between T4 and V100 and I think its OK. I guess there isn't much else I can maximize as its filling CUDA cores and memory right now:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   68C    P0    65W /  70W |  14200MiB / 15109MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:86:00.0 Off |                    0 |
| N/A   31C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

(bonito) callum@dgt-gpu2:~$ bonito train test --directory /home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/models/dna_r9.4.1
[loading data]
[loading model]
[1221470/1221470]: 100%|###################################################| [15:16:24, loss=0.1108]
[epoch 1] directory=test loss=0.1126 mean_acc=95.209% median_acc=95.789%
[1221470/1221470]: 100%|###################################################| [14:35:58, loss=0.0956]
[epoch 2] directory=test loss=0.0960 mean_acc=95.938% median_acc=96.520%
[1221470/1221470]: 100%|###################################################| [14:28:41, loss=0.0856]
[epoch 3] directory=test loss=0.0880 mean_acc=96.306% median_acc=96.886%

callumparr avatar Apr 01 '21 03:04 callumparr

Hi Callum,

I noticed that you have the same CUDA/driver versions as I have. You also appear to be using a Conda environment. I’m having trouble getting my installation to work. Would you mind sharing what you did to make the environment?

Thanks, Torben

On Mar 31, 2021, at 20:55, callumparr @.***> wrote:

I am running bonito train test on a node that has 2 x T4 Tesla, I tried running with --multi-gpu but it would run out of memory even when reducing the batch size and watching by Nvidia-smi it never seemed to fill the available memory (2x 16GB).

In any case its running on one card, does this time seem reasonable? I've done some comparisons on benchmark sites between T4 and V100 and I think its OK. I guess there isn't much else I can maximize as its filling CUDA cores and memory right now:

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 | | N/A 68C P0 65W / 70W | 14200MiB / 15109MiB | 98% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:86:00.0 Off | 0 | | N/A 31C P8 9W / 70W | 3MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+

(bonito) @.***:~$ bonito train test --directory /home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/models/dna_r9.4.1 [loading data] [loading model] [1221470/1221470]: 100%|###################################################| [15:16:24, loss=0.1108] [epoch 1] directory=test loss=0.1126 mean_acc=95.209% median_acc=95.789% [1221470/1221470]: 100%|###################################################| [14:35:58, loss=0.0956] [epoch 2] directory=test loss=0.0960 mean_acc=95.938% median_acc=96.520% [1221470/1221470]: 100%|###################################################| [14:28:41, loss=0.0856] [epoch 3] directory=test loss=0.0880 mean_acc=96.306% median_acc=96.886% — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nanoporetech/bonito/issues/139, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMXPRSCF3D3RHFNGJ7N3L3TGPVCNANCNFSM42GFSDPQ.

tnn111 avatar Apr 01 '21 17:04 tnn111

Hi Callum, I noticed that you have the same CUDA/driver versions as I have. You also appear to be using a Conda environment. I’m having trouble getting my installation to work. Would you mind sharing what you did to make the environment? Thanks, Torben

I remember having various issues with conflicts from pip so I set up a conda environment with python3.8 and it seemed to work OK from them. Another issue I had was to use node that CUDA >=11.0 .

From them base calling worked fine and for training, I had to issue with the download command but then used convert to get the inputs for training and then worked.

callumparr avatar Apr 02 '21 04:04 callumparr