bonito
bonito copied to clipboard
Expect epoch time for training
I am running bonito train test
on a node that has 2 x T4 Tesla, I tried running with --multi-gpu but it would run out of memory even when reducing the batch size and watching by Nvidia-smi it never seemed to fill the available memory (2x 16GB).
In any case its running on one card, does this time seem reasonable? I've done some comparisons on benchmark sites between T4 and V100 and I think its OK. I guess there isn't much else I can maximize as its filling CUDA cores and memory right now:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 |
| N/A 68C P0 65W / 70W | 14200MiB / 15109MiB | 98% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:86:00.0 Off | 0 |
| N/A 31C P8 9W / 70W | 3MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
(bonito) callum@dgt-gpu2:~$ bonito train test --directory /home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/models/dna_r9.4.1
[loading data]
[loading model]
[1221470/1221470]: 100%|###################################################| [15:16:24, loss=0.1108]
[epoch 1] directory=test loss=0.1126 mean_acc=95.209% median_acc=95.789%
[1221470/1221470]: 100%|###################################################| [14:35:58, loss=0.0956]
[epoch 2] directory=test loss=0.0960 mean_acc=95.938% median_acc=96.520%
[1221470/1221470]: 100%|###################################################| [14:28:41, loss=0.0856]
[epoch 3] directory=test loss=0.0880 mean_acc=96.306% median_acc=96.886%
Hi Callum,
I noticed that you have the same CUDA/driver versions as I have. You also appear to be using a Conda environment. I’m having trouble getting my installation to work. Would you mind sharing what you did to make the environment?
Thanks, Torben
On Mar 31, 2021, at 20:55, callumparr @.***> wrote:
I am running bonito train test on a node that has 2 x T4 Tesla, I tried running with --multi-gpu but it would run out of memory even when reducing the batch size and watching by Nvidia-smi it never seemed to fill the available memory (2x 16GB).
In any case its running on one card, does this time seem reasonable? I've done some comparisons on benchmark sites between T4 and V100 and I think its OK. I guess there isn't much else I can maximize as its filling CUDA cores and memory right now:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 | | N/A 68C P0 65W / 70W | 14200MiB / 15109MiB | 98% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:86:00.0 Off | 0 | | N/A 31C P8 9W / 70W | 3MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
(bonito) @.***:~$ bonito train test --directory /home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/models/dna_r9.4.1 [loading data] [loading model] [1221470/1221470]: 100%|###################################################| [15:16:24, loss=0.1108] [epoch 1] directory=test loss=0.1126 mean_acc=95.209% median_acc=95.789% [1221470/1221470]: 100%|###################################################| [14:35:58, loss=0.0956] [epoch 2] directory=test loss=0.0960 mean_acc=95.938% median_acc=96.520% [1221470/1221470]: 100%|###################################################| [14:28:41, loss=0.0856] [epoch 3] directory=test loss=0.0880 mean_acc=96.306% median_acc=96.886% — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nanoporetech/bonito/issues/139, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMXPRSCF3D3RHFNGJ7N3L3TGPVCNANCNFSM42GFSDPQ.
Hi Callum, I noticed that you have the same CUDA/driver versions as I have. You also appear to be using a Conda environment. I’m having trouble getting my installation to work. Would you mind sharing what you did to make the environment? Thanks, Torben …
I remember having various issues with conflicts from pip so I set up a conda environment with python3.8 and it seemed to work OK from them. Another issue I had was to use node that CUDA >=11.0 .
From them base calling worked fine and for training, I had to issue with the download command but then used convert to get the inputs for training and then worked.