graphnet
graphnet copied to clipboard
ISeeCube triggering an exception
Describe the bug A clear and concise description of what the bug is.
To Reproduce
Run a script that triggers the get_all_grapnet_classes class. for an example running
graphnet/examples/04_training/03_train_dynedge_from_config.py
triggers the error for me.
Expected behavior No error message and running as usual.
Error message
Exception has occurred: RuntimeError
operator torchvision::nms does not exist
Additional context
The error message happens during the from torchscale.architecture.encoder import Encoder call in the relatively recently added ISeeCube model. It happens when running model.from_config() via the get_all_grapnet_classes class. (graphnet/src/graphnet/utilities/config/parsing.py). I suspect that there might be a missing dependency or something of the sort but cannot say for sure and the error message is somewhat ambiguous.
also just noted the typo in get_all_grapnet_classes ...
I have also checked that I do not have conflicting versions in the nvidia libraries
nvidia-cublas-cu11 11.11.3.6 nvidia-cuda-cupti-cu11 11.8.87 nvidia-cuda-nvrtc-cu11 11.8.89 nvidia-cuda-runtime-cu11 11.8.89 nvidia-cudnn-cu11 8.7.0.84 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.3.0.86 nvidia-cusolver-cu11 11.4.1.48 nvidia-cusparse-cu11 11.7.5.86 nvidia-ml-py 12.550.52 nvidia-nccl-cu11 2.19.3 nvidia-nvtx-cu11 11.8.86
And the torch libraries
torch 2.2.0+cu118 torch_cluster 1.6.3+pt22cu118 torch_geometric 2.5.3 torch_scatter 2.1.2+pt22cu118 torch_sparse 0.6.18+pt22cu118 torchmetrics 1.4.0 torchscale 0.2.0 torchvision 0.17.0+rocm5.7
Hi, I'm able to reproduce the bug, and here's the full error report. I'm working on it.
graphnet [MainProcess] WARNING 2024-05-15 09:26:04 - has_icecube_package - `icecube` not available. Some functionality may be missing.
graphnet [MainProcess] INFO 2024-05-15 09:26:08 - NodesAsPulses.__init__ - Writing log to logs/graphnet_20240515-092608.log
-------------------------------------------------------------------------------
03_train_dynedge_from_config.py 153 <module>
main(
03_train_dynedge_from_config.py 49 main
model: StandardModel = StandardModel.from_config(model_config, trust=True)
model.py 108 from_config
return source._construct_model(trust, load_modules)
model_config.py 126 _construct_model
namespace_classes = get_all_grapnet_classes(
parsing.py 59 get_all_grapnet_classes
submodules = list_all_submodules(*packages)
parsing.py 38 list_all_submodules
return list(
parsing.py 49 list_all_submodules
module = __import__(module_name, fromlist="dummylist")
__init__.py 3 <module>
from .iseecube import ISeeCube
iseecube.py 14 <module>
from torchscale.architecture.encoder import Encoder
encoder.py 16 <module>
from torchscale.component.droppath import DropPath
droppath.py 5 <module>
from timm.models.layers import drop_path
__init__.py 2 <module>
from .models import create_model, list_models, is_model, list_modules, model_entrypoint, \
__init__.py 1 <module>
from .byoanet import *
byoanet.py 15 <module>
from timm.data import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
__init__.py 7 <module>
from .loader import create_loader
loader.py 12 <module>
from .transforms_factory import create_transform
transforms_factory.py 9 <module>
from torchvision import transforms
__init__.py 6 <module>
from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
_meta_registrations.py 164 <module>
def meta_nms(dets, scores, iou_threshold):
library.py 440 inner
handle = entry.abstract_impl.register(func_to_register, source)
abstract_impl.py 30 register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError:
operator torchvision::nms does not exist
The problem is with torchvision 0.17.0+rocm5.7, I re-installed the cuda version and now it works:
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118
Please refer to this issue.
This is the report now after running graphnet/examples/04_training/03_train_dynedge_from_config.py, which should be correct:
graphnet [MainProcess] WARNING 2024-05-15 10:14:36 - has_icecube_package - `icecube` not available. Some functionality may be missing.
graphnet [MainProcess] INFO 2024-05-15 10:14:39 - NodesAsPulses.__init__ - Writing log to logs/graphnet_20240515-101439.log
graphnet [MainProcess] WARNING 2024-05-15 10:14:40 - _validate_and_set_transforms - Setting one of `transform_target` and `transform_inference`, but not the other.
graphnet [MainProcess] INFO 2024-05-15 10:14:40 - StringSelectionResolver.resolve - Resolving selection: event_no % 5 == 0
graphnet [MainProcess] INFO 2024-05-15 10:14:40 - StringSelectionResolver.resolve - Resolving selection: event_no % 5 == 1
graphnet [MainProcess] INFO 2024-05-15 10:14:40 - StringSelectionResolver.resolve - Resolving selection: event_no % 5 > 1
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
graphnet [MainProcess] INFO 2024-05-15 10:14:40 - StandardModel._create_default_callbacks - EarlyStopping has been added with a patience of 5.
graphnet [MainProcess] INFO 2024-05-15 10:14:40 - StandardModel._print_callbacks - Training initiated with callbacks: ProgressBar, EarlyStopping, ModelCheckpoint
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/setup.py:187: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
2024-05-15 10:14:41.374318: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-15 10:14:41.374379: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-15 10:14:41.376053: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-15 10:14:42.538051: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py:28: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn("The verbose parameter is deprecated. Please use get_last_lr() "
| Name | Type | Params
-------------------------------------------------
0 | _graph_definition | KNNGraph | 0
1 | backbone | DynEdge | 1.4 M
2 | _tasks | ModuleList | 129
-------------------------------------------------
1.4 M Trainable params
0 Non-trainable params
1.4 M Total params
5.515 Total estimated model params size (MB)
Sanity Checking: | | 0/? [00:00<?, ?it/s]/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
Epoch 0: 100% 2/2 [00:01<00:00, 1.97 batch(es)/s, lr=5.95e-5]
Validation: | | 0/? [00:00<?, ?it/s]
Validation: 0% 0/1 [00:00<?, ? batch(es)/s]
Validation DataLoader 0: 0% 0/1 [00:00<?, ? batch(es)/s]
Validation DataLoader 0: 100% 1/1 [00:00<00:00, 11.92 batch(es)/s]
Epoch 0: 100% 2/2 [00:01<00:00, 1.67 batch(es)/s, lr=5.95e-5, val_loss=0.0482, train_loss=0.0361]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100% 2/2 [00:01<00:00, 1.60 batch(es)/s, lr=5.95e-5, val_loss=0.0482, train_loss=0.0361]
graphnet [MainProcess] INFO 2024-05-15 10:14:45 - StandardModel.fit - Best-fit weights from EarlyStopping loaded.
graphnet [MainProcess] INFO 2024-05-15 10:14:45 - main - Writing results to /content/data/examples/output/train_model/prometheus-events/dynedge_total_energy_example
graphnet [MainProcess] INFO 2024-05-15 10:14:45 - StandardModel.save_state_dict - Model state_dict saved to /content/data/examples/output/train_model/prometheus-events/dynedge_total_energy_example/state_dict.pth
graphnet [MainProcess] INFO 2024-05-15 10:14:45 - StandardModel.save - Model saved to /content/data/examples/output/train_model/prometheus-events/dynedge_total_energy_example/model.pth
graphnet [MainProcess] INFO 2024-05-15 10:14:45 - main - config.target: ['total_energy']
graphnet [MainProcess] INFO 2024-05-15 10:14:45 - main - prediction_columns: ['energy_pred']
graphnet [MainProcess] INFO 2024-05-15 10:14:45 - StandardModel.predict_as_dataframe - Column names for predictions are:
['energy_pred']
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Predicting DataLoader 0: 100% 1/1 [00:00<00:00, 6.34 batch(es)/s]
Maybe the requirements folder should be modified somehow? I followed https://graphnet-team.github.io/graphnet/installation/install.html#quick-start (PyTorch 2.2.*, Linux, 11.8) and reproduced this bug. @RasmusOrsoe
Maybe the
requirementsfolder should be modified somehow? I followed https://graphnet-team.github.io/graphnet/installation/install.html#quick-start (PyTorch 2.2.*, Linux, 11.8) and reproduced this bug. @RasmusOrsoe
I found that adding the line torchvision==0.17.0+cu121 to the requirement file (here the cuda 12.1 version) enforces the cuda version of torchvision.
Wonderful.
I can confirm that this error happened for a small subset of participants at the workshop, and it appeared that the error came from the installation process (for some still unknown reason) defaulted to the rocm-version of torchvision. Most people did not encounter this issue. @Aske-Rosted I think your solution (specifically fixing torchvision in requirement files) is the right solution to the problem. Would you mind making a PR with your changes?
Issue closed by #719