the-algorithm-ml icon indicating copy to clipboard operation
the-algorithm-ml copied to clipboard

why I got this error message?

Open zhuchenxi opened this issue 1 year ago • 0 comments

When I run the ./images/init_venv.sh, then run this command: python3.10 -m torch.distributed.run --standalone --nnodes 1 --nproc_per_node 1 tml/projects/twhin/run.py --config_yaml_path="tml/projects/twhin/config/local.yaml" --save_dir="tml/model"

And It told this error: Traceback (most recent call last): File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 105, in app.run(main) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/absl/app.py", line 308, in run _run_main(main, args) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 97, in main run( File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 70, in run ctl.train( File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/custom_training_loop.py", line 192, in train outputs = train_step_fn() File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/custom_training_loop.py", line 57, in step_fn outputs = pipeline.progress(data_iterator) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/train_pipeline.py", line 582, in progress (losses, output) = cast(Tuple[torch.Tensor, Out], self._model(batch_i)) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/projects/twhin/models/models.py", line 147, in forward outputs = self.model(batch) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 255, in forward return self._dmp_wrapped_module(*args, **kwargs) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/projects/twhin/models/models.py", line 39, in forward outs = self.large_embeddings(batch.nodes) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/common/modules/embedding/embedding.py", line 55, in forward pooled_embs = self.ebc(sparse_features) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/types.py", line 594, in forward dist_input = self.input_dist(ctx, *input, **kwargs).wait().wait() File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/embeddingbag.py", line 424, in input_dist input_dist( File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/sharding/rw_sharding.py", line 303, in forward ) = bucketize_kjt_before_all2all( File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/embedding_sharding.py", line 169, in bucketize_kjt_before_all2all ) = torch.ops.fbgemm.block_bucketize_sparse_features( File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/_ops.py", line 442, in call return self._op(*args, **kwargs or {}) NotImplementedError: Could not run 'fbgemm::block_bucketize_sparse_features' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'fbgemm::block_bucketize_sparse_features' is only available for these backends: [CPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

Here is my CUDA version: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:3B:00.0 Off | 0 | | N/A 33C P0 41W / 300W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

Here is my pip list: Package Version


absl-py 1.4.0 aiofiles 22.1.0 aiohttp 3.8.3 aiosignal 1.3.1 appdirs 1.4.4 arrow 1.2.3 asttokens 2.2.1 astunparse 1.6.3 async-timeout 4.0.2 attrs 22.1.0 backcall 0.2.0 black 22.6.0 cachetools 5.3.0 cblack 22.6.0 certifi 2022.12.7 cfgv 3.3.1 charset-normalizer 2.1.1 click 8.1.3 cmake 3.25.0 Cython 0.29.32 decorator 5.1.1 distlib 0.3.6 distro 1.8.0 dm-tree 0.1.6 docker 6.0.1 docker-pycreds 0.4.0 docstring-parser 0.8.1 exceptiongroup 1.1.0 executing 1.2.0 fbgemm-gpu-cpu 0.3.2 filelock 3.8.2 fire 0.5.0 flatbuffers 1.12 frozenlist 1.3.3 fsspec 2022.11.0 gast 0.4.0 gcsfs 2022.11.0 gitdb 4.0.10 GitPython 3.1.31 google-api-core 2.8.2 google-auth 2.16.0 google-auth-oauthlib 0.4.6 google-cloud-core 2.3.2 google-cloud-storage 2.7.0 google-crc32c 1.5.0 google-pasta 0.2.0 google-resumable-media 2.4.1 googleapis-common-protos 1.56.4 grpcio 1.51.1 h5py 3.8.0 hypothesis 6.61.0 identify 2.5.17 idna 3.4 importlib-metadata 6.0.0 iniconfig 2.0.0 iopath 0.1.10 ipdb 0.13.11 ipython 8.10.0 jedi 0.18.2 Jinja2 3.1.2 keras 2.9.0 Keras-Preprocessing 1.1.2 libclang 15.0.6.1 libcst 0.4.9 Markdown 3.4.1 MarkupSafe 2.1.1 matplotlib-inline 0.1.6 moreorless 0.4.0 multidict 6.0.4 mypy 1.0.1 mypy-extensions 0.4.3 nest-asyncio 1.5.6 ninja 1.11.1 nodeenv 1.7.0 numpy 1.22.0 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 oauthlib 3.2.2 opt-einsum 3.3.0 packaging 22.0 pandas 1.5.3 parso 0.8.3 pathspec 0.11.0 pathtools 0.1.2 pexpect 4.8.0 pickleshare 0.7.5 pip 23.1 platformdirs 3.0.0 pluggy 1.0.0 portalocker 2.6.0 portpicker 1.5.2 pre-commit 3.0.4 prompt-toolkit 3.0.36 protobuf 3.20.2 psutil 5.9.4 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 10.0.1 pyasn1 0.4.8 pyasn1-modules 0.2.8 pydantic 1.9.0 pyDeprecate 0.3.2 Pygments 2.14.0 pyparsing 3.0.9 pyre-extensions 0.0.27 pytest 7.2.1 pytest-mypy 0.10.3 python-dateutil 2.8.2 pytz 2022.6 PyYAML 6.0 requests 2.28.1 requests-oauthlib 1.3.1 rsa 4.9 scikit-build 0.16.3 sentry-sdk 1.16.0 setproctitle 1.3.2 setuptools 65.5.0 six 1.16.0 smmap 5.0.0 sortedcontainers 2.4.0 stack-data 0.6.2 stdlibs 2022.10.9 tabulate 0.9.0 tensorboard 2.9.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorflow 2.9.3 tensorflow-estimator 2.9.0 tensorflow-io-gcs-filesystem 0.30.0 termcolor 2.2.0 toml 0.10.2 tomli 2.0.1 torch 1.13.1 torchmetrics 0.11.0 torchrec 0.3.2 torchsnapshot 0.1.0 torchx 0.3.0 tqdm 4.64.1 trailrunner 1.2.1 traitlets 5.9.0 typing_extensions 4.4.0 typing-inspect 0.8.0 urllib3 1.26.13 usort 1.0.5 virtualenv 20.19.0 wandb 0.13.11 wcwidth 0.2.6 websocket-client 1.4.2 Werkzeug 2.2.3 wrapt 1.14.1 yarl 1.8.2 zipp 3.12.1

Thank you very much for helping me about this configure problem~

zhuchenxi avatar Apr 20 '23 03:04 zhuchenxi