graphnet icon indicating copy to clipboard operation
graphnet copied to clipboard

ISeeCube triggering an exception

Open Aske-Rosted opened this issue 1 year ago • 3 comments

Describe the bug A clear and concise description of what the bug is.

To Reproduce Run a script that triggers the get_all_grapnet_classes class. for an example running graphnet/examples/04_training/03_train_dynedge_from_config.py triggers the error for me.

Expected behavior No error message and running as usual.

Error message

Exception has occurred: RuntimeError
operator torchvision::nms does not exist

Additional context The error message happens during the from torchscale.architecture.encoder import Encoder call in the relatively recently added ISeeCube model. It happens when running model.from_config() via the get_all_grapnet_classes class. (graphnet/src/graphnet/utilities/config/parsing.py). I suspect that there might be a missing dependency or something of the sort but cannot say for sure and the error message is somewhat ambiguous.

also just noted the typo in get_all_grapnet_classes ...

I have also checked that I do not have conflicting versions in the nvidia libraries

nvidia-cublas-cu11 11.11.3.6 nvidia-cuda-cupti-cu11 11.8.87 nvidia-cuda-nvrtc-cu11 11.8.89 nvidia-cuda-runtime-cu11 11.8.89 nvidia-cudnn-cu11 8.7.0.84 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.3.0.86 nvidia-cusolver-cu11 11.4.1.48 nvidia-cusparse-cu11 11.7.5.86 nvidia-ml-py 12.550.52 nvidia-nccl-cu11 2.19.3 nvidia-nvtx-cu11 11.8.86

And the torch libraries

torch 2.2.0+cu118 torch_cluster 1.6.3+pt22cu118 torch_geometric 2.5.3 torch_scatter 2.1.2+pt22cu118 torch_sparse 0.6.18+pt22cu118 torchmetrics 1.4.0 torchscale 0.2.0 torchvision 0.17.0+rocm5.7

Aske-Rosted avatar May 15 '24 05:05 Aske-Rosted

Hi, I'm able to reproduce the bug, and here's the full error report. I'm working on it.

graphnet [MainProcess] WARNING  2024-05-15 09:26:04 - has_icecube_package - `icecube` not available. Some functionality may be missing.
graphnet [MainProcess] INFO     2024-05-15 09:26:08 - NodesAsPulses.__init__ - Writing log to logs/graphnet_20240515-092608.log

-------------------------------------------------------------------------------
03_train_dynedge_from_config.py 153 <module>
main(

03_train_dynedge_from_config.py 49 main
model: StandardModel = StandardModel.from_config(model_config, trust=True)

model.py 108 from_config
return source._construct_model(trust, load_modules)

model_config.py 126 _construct_model
namespace_classes = get_all_grapnet_classes(

parsing.py 59 get_all_grapnet_classes
submodules = list_all_submodules(*packages)

parsing.py 38 list_all_submodules
return list(

parsing.py 49 list_all_submodules
module = __import__(module_name, fromlist="dummylist")

__init__.py 3 <module>
from .iseecube import ISeeCube

iseecube.py 14 <module>
from torchscale.architecture.encoder import Encoder

encoder.py 16 <module>
from torchscale.component.droppath import DropPath

droppath.py 5 <module>
from timm.models.layers import drop_path

__init__.py 2 <module>
from .models import create_model, list_models, is_model, list_modules, model_entrypoint, \

__init__.py 1 <module>
from .byoanet import *

byoanet.py 15 <module>
from timm.data import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD

__init__.py 7 <module>
from .loader import create_loader

loader.py 12 <module>
from .transforms_factory import create_transform

transforms_factory.py 9 <module>
from torchvision import transforms

__init__.py 6 <module>
from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils

_meta_registrations.py 164 <module>
def meta_nms(dets, scores, iou_threshold):

library.py 440 inner
handle = entry.abstract_impl.register(func_to_register, source)

abstract_impl.py 30 register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):

RuntimeError:
operator torchvision::nms does not exist

chenlinear avatar May 15 '24 09:05 chenlinear

The problem is with torchvision 0.17.0+rocm5.7, I re-installed the cuda version and now it works:

pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118

Please refer to this issue.

This is the report now after running graphnet/examples/04_training/03_train_dynedge_from_config.py, which should be correct:

graphnet [MainProcess] WARNING  2024-05-15 10:14:36 - has_icecube_package - `icecube` not available. Some functionality may be missing.
graphnet [MainProcess] INFO     2024-05-15 10:14:39 - NodesAsPulses.__init__ - Writing log to logs/graphnet_20240515-101439.log
graphnet [MainProcess] WARNING  2024-05-15 10:14:40 - _validate_and_set_transforms - Setting one of `transform_target` and `transform_inference`, but not the other.
graphnet [MainProcess] INFO     2024-05-15 10:14:40 - StringSelectionResolver.resolve - Resolving selection: event_no % 5 == 0
graphnet [MainProcess] INFO     2024-05-15 10:14:40 - StringSelectionResolver.resolve - Resolving selection: event_no % 5 == 1
graphnet [MainProcess] INFO     2024-05-15 10:14:40 - StringSelectionResolver.resolve - Resolving selection: event_no % 5 > 1
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
graphnet [MainProcess] INFO     2024-05-15 10:14:40 - StandardModel._create_default_callbacks - EarlyStopping has been added with a patience of 5.
graphnet [MainProcess] INFO     2024-05-15 10:14:40 - StandardModel._print_callbacks - Training initiated with callbacks: ProgressBar, EarlyStopping, ModelCheckpoint
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/setup.py:187: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

2024-05-15 10:14:41.374318: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-15 10:14:41.374379: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-15 10:14:41.376053: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-15 10:14:42.538051: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py:28: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
  warnings.warn("The verbose parameter is deprecated. Please use get_last_lr() "

  | Name              | Type       | Params
-------------------------------------------------
0 | _graph_definition | KNNGraph   | 0     
1 | backbone          | DynEdge    | 1.4 M 
2 | _tasks            | ModuleList | 129   
-------------------------------------------------
1.4 M     Trainable params
0         Non-trainable params
1.4 M     Total params
5.515     Total estimated model params size (MB)
Sanity Checking: |          | 0/? [00:00<?, ?it/s]/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
Epoch  0: 100% 2/2 [00:01<00:00,  1.97 batch(es)/s, lr=5.95e-5]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation:   0% 0/1 [00:00<?, ? batch(es)/s]
Validation DataLoader 0:   0% 0/1 [00:00<?, ? batch(es)/s]
Validation DataLoader 0: 100% 1/1 [00:00<00:00, 11.92 batch(es)/s]
Epoch  0: 100% 2/2 [00:01<00:00,  1.67 batch(es)/s, lr=5.95e-5, val_loss=0.0482, train_loss=0.0361]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch  0: 100% 2/2 [00:01<00:00,  1.60 batch(es)/s, lr=5.95e-5, val_loss=0.0482, train_loss=0.0361]
graphnet [MainProcess] INFO     2024-05-15 10:14:45 - StandardModel.fit - Best-fit weights from EarlyStopping loaded.
graphnet [MainProcess] INFO     2024-05-15 10:14:45 - main - Writing results to /content/data/examples/output/train_model/prometheus-events/dynedge_total_energy_example
graphnet [MainProcess] INFO     2024-05-15 10:14:45 - StandardModel.save_state_dict - Model state_dict saved to /content/data/examples/output/train_model/prometheus-events/dynedge_total_energy_example/state_dict.pth
graphnet [MainProcess] INFO     2024-05-15 10:14:45 - StandardModel.save - Model saved to /content/data/examples/output/train_model/prometheus-events/dynedge_total_energy_example/model.pth
graphnet [MainProcess] INFO     2024-05-15 10:14:45 - main - config.target: ['total_energy']
graphnet [MainProcess] INFO     2024-05-15 10:14:45 - main - prediction_columns: ['energy_pred']
graphnet [MainProcess] INFO     2024-05-15 10:14:45 - StandardModel.predict_as_dataframe - Column names for predictions are: 
 ['energy_pred']
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Predicting DataLoader 0: 100% 1/1 [00:00<00:00,  6.34 batch(es)/s]

chenlinear avatar May 15 '24 10:05 chenlinear

Maybe the requirements folder should be modified somehow? I followed https://graphnet-team.github.io/graphnet/installation/install.html#quick-start (PyTorch 2.2.*, Linux, 11.8) and reproduced this bug. @RasmusOrsoe

chenlinear avatar May 15 '24 10:05 chenlinear

Maybe the requirements folder should be modified somehow? I followed https://graphnet-team.github.io/graphnet/installation/install.html#quick-start (PyTorch 2.2.*, Linux, 11.8) and reproduced this bug. @RasmusOrsoe

I found that adding the line torchvision==0.17.0+cu121 to the requirement file (here the cuda 12.1 version) enforces the cuda version of torchvision.

Aske-Rosted avatar May 16 '24 03:05 Aske-Rosted

Wonderful.

I can confirm that this error happened for a small subset of participants at the workshop, and it appeared that the error came from the installation process (for some still unknown reason) defaulted to the rocm-version of torchvision. Most people did not encounter this issue. @Aske-Rosted I think your solution (specifically fixing torchvision in requirement files) is the right solution to the problem. Would you mind making a PR with your changes?

RasmusOrsoe avatar May 16 '24 06:05 RasmusOrsoe

Issue closed by #719

RasmusOrsoe avatar May 17 '24 08:05 RasmusOrsoe