mmf icon indicating copy to clipboard operation
mmf copied to clipboard

errors while running krisp code

Open jiny419 opened this issue 3 years ago • 5 comments

❓ Questions and Help

Hi !

I am running KRISP project code in mmf. But I discovered some errors.

  1. torch-sparse module is missed in requirements.txt of krisp project.
  2. when I installed torch-sparse suitable for cuda version 10.2, I got the error below

2021-06-22T16:32:10 | mmf.utils.configuration: Overriding option config to ./projects/krisp/configs/krisp/okvqa/train_val.yaml 2021-06-22T16:32:10 | mmf.utils.configuration: Overriding option run_type to train_val 2021-06-22T16:32:10 | mmf.utils.configuration: Overriding option datasets to okvqa 2021-06-22T16:32:10 | mmf.utils.configuration: Overriding option model to krisp 2021-06-22T16:32:14 | mmf.utils.distributed: XLA Mode:False 2021-06-22T16:32:14 | mmf.utils.distributed: Distributed Init (Rank 3): tcp://localhost:12572 2021-06-22T16:32:14 | mmf.utils.distributed: XLA Mode:False 2021-06-22T16:32:14 | mmf.utils.distributed: Distributed Init (Rank 4): tcp://localhost:12572 2021-06-22T16:32:15 | mmf.utils.distributed: XLA Mode:False 2021-06-22T16:32:15 | mmf.utils.distributed: Distributed Init (Rank 1): tcp://localhost:12572 2021-06-22T16:32:15 | mmf.utils.distributed: XLA Mode:False 2021-06-22T16:32:15 | mmf.utils.distributed: Distributed Init (Rank 0): tcp://localhost:12572 2021-06-22T16:32:15 | mmf.utils.distributed: XLA Mode:False 2021-06-22T16:32:15 | mmf.utils.distributed: Distributed Init (Rank 2): tcp://localhost:12572 2021-06-22T16:32:15 | root: Added key: store_based_barrier_key:1 to store for rank: 2 2021-06-22T16:32:15 | mmf.utils.distributed: XLA Mode:False 2021-06-22T16:32:15 | mmf.utils.distributed: Distributed Init (Rank 5): tcp://localhost:12572 2021-06-22T16:32:15 | root: Added key: store_based_barrier_key:1 to store for rank: 5 2021-06-22T16:32:15 | mmf.utils.distributed: XLA Mode:False 2021-06-22T16:32:15 | mmf.utils.distributed: Distributed Init (Rank 7): tcp://localhost:12572 2021-06-22T16:32:15 | root: Added key: store_based_barrier_key:1 to store for rank: 7 2021-06-22T16:32:15 | mmf.utils.distributed: XLA Mode:False 2021-06-22T16:32:15 | mmf.utils.distributed: Distributed Init (Rank 6): tcp://localhost:12572 2021-06-22T16:32:15 | root: Added key: store_based_barrier_key:1 to store for rank: 6 2021-06-22T16:32:15 | root: Added key: store_based_barrier_key:1 to store for rank: 3 2021-06-22T16:32:15 | root: Added key: store_based_barrier_key:1 to store for rank: 4 2021-06-22T16:32:16 | root: Added key: store_based_barrier_key:1 to store for rank: 1 2021-06-22T16:32:16 | root: Added key: store_based_barrier_key:1 to store for rank: 0 2021-06-22T16:32:16 | mmf.utils.distributed: Initialized Host 4eb3a36d858c as Rank 0 2021-06-22T16:32:16 | mmf.utils.distributed: Initialized Host 4eb3a36d858c as Rank 2 2021-06-22T16:32:16 | mmf.utils.distributed: Initialized Host 4eb3a36d858c as Rank 5 2021-06-22T16:32:16 | mmf.utils.distributed: Initialized Host 4eb3a36d858c as Rank 3 2021-06-22T16:32:16 | mmf.utils.distributed: Initialized Host 4eb3a36d858c as Rank 6 2021-06-22T16:32:16 | mmf.utils.distributed: Initialized Host 4eb3a36d858c as Rank 7 2021-06-22T16:32:16 | mmf.utils.distributed: Initialized Host 4eb3a36d858c as Rank 4 2021-06-22T16:32:16 | mmf.utils.distributed: Initialized Host 4eb3a36d858c as Rank 1 2021-06-22T16:32:21 | mmf: Logging to: ./save/train.log 2021-06-22T16:32:21 | mmf_cli.run: Namespace(config_override=None, local_rank=None, opts=['config=./projects/krisp/configs/krisp/okvqa/train_val.yaml', 'run_type=train_val', 'dataset=okvqa', 'model=krisp']) 2021-06-22T16:32:21 | mmf_cli.run: Torch version: 1.8.1+cu102 2021-06-22T16:32:21 | mmf.utils.general: CUDA Device 0 is: GeForce RTX 2080 Ti 2021-06-22T16:32:21 | mmf_cli.run: Using seed 21664516 2021-06-22T16:32:21 | mmf.trainers.mmf_trainer: Loading datasets okvqa/defaults/annotations/annotations/graph_vocab/graph_vocab.pth.tar /home/aimaster/.cache/torch/mmf/data 2021-06-22T16:32:27 | mmf.datasets.multi_datamodule: Multitasking disabled by default for single dataset training 2021-06-22T16:32:27 | mmf.datasets.multi_datamodule: Multitasking disabled by default for single dataset training 2021-06-22T16:32:27 | mmf.datasets.multi_datamodule: Multitasking disabled by default for single dataset training 2021-06-22T16:32:27 | mmf.trainers.mmf_trainer: Loading model Import error with KRISP dependencies. Fix dependencies if you want to use KRISP Traceback (most recent call last): File "/home/aimaster/anaconda3/envs/mmf/bin/mmf_run", line 33, in sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')()) File "/home/aimaster/lab_storage/jinyeong/VQA/mmf/mmf_cli/run.py", line 129, in run nprocs=config.distributed.world_size, File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 6 terminated with the following error: Traceback (most recent call last): File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/aimaster/lab_storage/jinyeong/VQA/mmf/mmf_cli/run.py", line 66, in distributed_main main(configuration, init_distributed=True, predict=predict) File "/home/aimaster/lab_storage/jinyeong/VQA/mmf/mmf_cli/run.py", line 52, in main trainer.load() File "/home/aimaster/lab_storage/jinyeong/VQA/mmf/mmf/trainers/mmf_trainer.py", line 42, in load super().load() File "/home/aimaster/lab_storage/jinyeong/VQA/mmf/mmf/trainers/base_trainer.py", line 33, in load self.load_model() File "/home/aimaster/lab_storage/jinyeong/VQA/mmf/mmf/trainers/mmf_trainer.py", line 96, in load_model self.model = build_model(attributes) File "/home/aimaster/lab_storage/jinyeong/VQA/mmf/mmf/utils/build.py", line 87, in build_model model = model_class(config) File "/home/aimaster/lab_storage/jinyeong/VQA/mmf/mmf/models/krisp.py", line 39, in init self.build() File "/home/aimaster/lab_storage/jinyeong/VQA/mmf/mmf/models/krisp.py", line 75, in build from projects.krisp.graphnetwork_module import GraphNetworkModule File "/home/aimaster/lab_storage/jinyeong/VQA/mmf/projects/krisp/graphnetwork_module.py", line 21, in from torch_geometric.nn import BatchNorm, GCNConv, RGCNConv, SAGEConv File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch_geometric/init.py", line 5, in import torch_geometric.data File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch_geometric/data/init.py", line 1, in from .data import Data File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch_geometric/data/data.py", line 8, in from torch_sparse import coalesce, SparseTensor File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch_sparse/init.py", line 36, in from .storage import SparseStorage # noqa File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch_sparse/storage.py", line 21, in class SparseStorage(object): File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/jit/_script.py", line 974, in script _compile_and_register_class(obj, _rcb, qualified_name) File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/jit/_script.py", line 67, in _compile_and_register_class torch._C._jit_script_class_compile(qualified_name, ast, defaults, rcb) File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/jit/_recursive.py", line 757, in try_compile_fn return torch.jit.script(fn, _rcb=rcb) File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/jit/_script.py", line 990, in script qualified_name, ast, _rcb, get_default_args(obj) File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/jit/_recursive.py", line 757, in try_compile_fn return torch.jit.script(fn, _rcb=rcb) File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/jit/_script.py", line 986, in script ast = get_jit_def(obj, obj.name) File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/jit/frontend.py", line 271, in get_jit_def return build_def(ctx, fn_def, type_line, def_name, self_name=self_name) File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/jit/frontend.py", line 293, in build_def param_list = build_param_list(ctx, py_def.args, self_name) File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/jit/frontend.py", line 316, in build_param_list raise NotSupportedError(ctx_range, _vararg_kwarg_err) torch.jit.frontend.NotSupportedError: Compiled functions can't take variable number of arguments or use keyword-only arguments with defaults: File "/home/aimaster/lab_storage/jinyeong/VQA/mmf/mmf/utils/distributed.py", line 340 def warn(*args, **kwargs): ~~~~~~~ <--- HERE force = kwargs.pop("force", False) if is_master or force: 'get_layout' is being compiled since it was called from 'SparseStorage.set_value' File "/home/aimaster/anaconda3/envs/mmf/lib/python3.7/site-packages/torch_sparse/storage.py", line 210 layout: Optional[str] = None): if value is not None: if get_layout(layout) == 'csc': ~~~~~~~~~~~~~~~~~ <--- HERE value = value[self.csc2csr()] value = value.contiguous()

jiny419 avatar Jun 22 '21 07:06 jiny419

Hi @jiny419, thanks for using mmf,

Do you mind sharing the command you use to run? Tagging @KMarino to help with Krisp related issues.

hackgoofer avatar Jun 22 '21 18:06 hackgoofer

Yes, I didn’t include the pytorch geometric dependencies because they’re system and cuda version dependent. See the installation instructions for pytorch geometric for specific instructions on how to do this on your system.

https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html

KMarino avatar Jun 22 '21 20:06 KMarino

@ytsheng I runed "mmf_run config=./projects/krisp/configs/krisp/okvqa/train_val.yaml run_type=train_val dataset=okvqa model=krisp" with proper my project path. @KMarino I installed the dependencies like torch-sparse and torch-geometric appropriate to my cuda version, but I faced the error above, especially "torch.jit.frontend.NotSupportedError: Compiled functions can't take variable number of arguments or use keyword-only arguments with defaults" in distributed.py of mmf.

I think the warning function of distributed.py was conflicted with get_layout function of torch-sparse and now I solved it ! Thank you !

jiny419 avatar Jun 23 '21 05:06 jiny419

@ytsheng I runed "mmf_run config=./projects/krisp/configs/krisp/okvqa/train_val.yaml run_type=train_val dataset=okvqa model=krisp" with proper my project path. @KMarino I installed the dependencies like torch-sparse and torch-geometric appropriate to my cuda version, but I faced the error above, especially "torch.jit.frontend.NotSupportedError: Compiled functions can't take variable number of arguments or use keyword-only arguments with defaults" in distributed.py of mmf.

I think the warning function of distributed.py was conflicted with get_layout function of torch-sparse and now I solved it ! Thank you !

Could you elaborate the solution for the above conflict? Thank you!

ChanningPing avatar Jun 23 '21 21:06 ChanningPing

@ytsheng I runed "mmf_run config=./projects/krisp/configs/krisp/okvqa/train_val.yaml run_type=train_val dataset=okvqa model=krisp" with proper my project path. @KMarino I installed the dependencies like torch-sparse and torch-geometric appropriate to my cuda version, but I faced the error above, especially "torch.jit.frontend.NotSupportedError: Compiled functions can't take variable number of arguments or use keyword-only arguments with defaults" in distributed.py of mmf. I think the warning function of distributed.py was conflicted with get_layout function of torch-sparse and now I solved it ! Thank you !

Could you elaborate the solution for the above conflict? Thank you!

the warning function of distributed.py was conflicted with get_layout function of torch-sparsw, just comment the warning function

AndersonStra avatar Oct 28 '21 14:10 AndersonStra