alignn
alignn copied to clipboard
train_folder_ff does not utilize GPU
I am trying to train a force fields model by using a variation of the following command that is mentioned in the readme to match my directories:
train_folder_ff.py --root_dir "alignn/examples/sample_data_ff" --config "alignn/examples/sample_data_ff/config_example_atomwise.json" --output_dir=temp
However, training is super slow and does not seem to utilize the GPU at all. This can be further confirmed by running nvidia-smi
and viewing the output during training:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.43.02 Driver Version: 535.43.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:07:00.0 Off | N/A |
| 0% 42C P8 13W / 170W | 71MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1405 G /usr/lib/xorg/Xorg 56MiB | | 0 N/A N/A 1571 G /usr/bin/gnome-shell 5MiB | +---------------------------------------------------------------------------------------+
If I am training a model that does not utilize force fields, the GPU is used.
For example, running train_folder.py --root_dir "alignn/examples/sample_data" --config "alignn/examples/sample_data/config_example.json" --output_dir=temp
and simultanously running nvidia-smi
gives the following output:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.43.02 Driver Version: 535.43.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:07:00.0 Off | N/A |
| 0% 46C P2 62W / 170W | 921MiB / 12288MiB | 39% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1405 G /usr/lib/xorg/Xorg 56MiB | | 0 N/A N/A 1571 G /usr/bin/gnome-shell 5MiB | | 0 N/A N/A 29095 C .../miniconda3/envs/version/bin/python 848MiB | +---------------------------------------------------------------------------------------+
I have done my best to check that all the dependencies are compatible and I can confirm that the device is switched to cuda in the train_folder_ff.py script.
Hi @rashigeek
What is the batch_size
that you are using?
Lower batch_size tends to under-utilize GPUs.
I have tried a wide range of batch sizes even really big batch sizes such as 1028 but the performance was unaffected. I even tried passing batch_size as an argument and the problem still persisted.
I'm having the same problem, I can't use the GPU when running run_alignn_ff.py
.