When I run the ./images/init_venv.sh,
then run this command:
python3.10 -m torch.distributed.run --standalone --nnodes 1 --nproc_per_node 1 tml/projects/twhin/run.py --config_yaml_path="tml/projects/twhin/config/local.yaml" --save_dir="tml/model"
And It told this error:
Traceback (most recent call last):
File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 105, in
app.run(main)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 97, in main
run(
File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 70, in run
ctl.train(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/custom_training_loop.py", line 192, in train
outputs = train_step_fn()
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/custom_training_loop.py", line 57, in step_fn
outputs = pipeline.progress(data_iterator)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/train_pipeline.py", line 582, in progress
(losses, output) = cast(Tuple[torch.Tensor, Out], self._model(batch_i))
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/projects/twhin/models/models.py", line 147, in forward
outputs = self.model(batch)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 255, in forward
return self._dmp_wrapped_module(*args, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/projects/twhin/models/models.py", line 39, in forward
outs = self.large_embeddings(batch.nodes)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/common/modules/embedding/embedding.py", line 55, in forward
pooled_embs = self.ebc(sparse_features)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/types.py", line 594, in forward
dist_input = self.input_dist(ctx, *input, **kwargs).wait().wait()
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/embeddingbag.py", line 424, in input_dist
input_dist(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/sharding/rw_sharding.py", line 303, in forward
) = bucketize_kjt_before_all2all(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/embedding_sharding.py", line 169, in bucketize_kjt_before_all2all
) = torch.ops.fbgemm.block_bucketize_sparse_features(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/_ops.py", line 442, in call
return self._op(*args, **kwargs or {})
NotImplementedError: Could not run 'fbgemm::block_bucketize_sparse_features' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'fbgemm::block_bucketize_sparse_features' is only available for these backends: [CPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].
Here is my CUDA version:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:3B:00.0 Off | 0 |
| N/A 33C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Here is my pip list:
Package Version
absl-py 1.4.0
aiofiles 22.1.0
aiohttp 3.8.3
aiosignal 1.3.1
appdirs 1.4.4
arrow 1.2.3
asttokens 2.2.1
astunparse 1.6.3
async-timeout 4.0.2
attrs 22.1.0
backcall 0.2.0
black 22.6.0
cachetools 5.3.0
cblack 22.6.0
certifi 2022.12.7
cfgv 3.3.1
charset-normalizer 2.1.1
click 8.1.3
cmake 3.25.0
Cython 0.29.32
decorator 5.1.1
distlib 0.3.6
distro 1.8.0
dm-tree 0.1.6
docker 6.0.1
docker-pycreds 0.4.0
docstring-parser 0.8.1
exceptiongroup 1.1.0
executing 1.2.0
fbgemm-gpu-cpu 0.3.2
filelock 3.8.2
fire 0.5.0
flatbuffers 1.12
frozenlist 1.3.3
fsspec 2022.11.0
gast 0.4.0
gcsfs 2022.11.0
gitdb 4.0.10
GitPython 3.1.31
google-api-core 2.8.2
google-auth 2.16.0
google-auth-oauthlib 0.4.6
google-cloud-core 2.3.2
google-cloud-storage 2.7.0
google-crc32c 1.5.0
google-pasta 0.2.0
google-resumable-media 2.4.1
googleapis-common-protos 1.56.4
grpcio 1.51.1
h5py 3.8.0
hypothesis 6.61.0
identify 2.5.17
idna 3.4
importlib-metadata 6.0.0
iniconfig 2.0.0
iopath 0.1.10
ipdb 0.13.11
ipython 8.10.0
jedi 0.18.2
Jinja2 3.1.2
keras 2.9.0
Keras-Preprocessing 1.1.2
libclang 15.0.6.1
libcst 0.4.9
Markdown 3.4.1
MarkupSafe 2.1.1
matplotlib-inline 0.1.6
moreorless 0.4.0
multidict 6.0.4
mypy 1.0.1
mypy-extensions 0.4.3
nest-asyncio 1.5.6
ninja 1.11.1
nodeenv 1.7.0
numpy 1.22.0
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
oauthlib 3.2.2
opt-einsum 3.3.0
packaging 22.0
pandas 1.5.3
parso 0.8.3
pathspec 0.11.0
pathtools 0.1.2
pexpect 4.8.0
pickleshare 0.7.5
pip 23.1
platformdirs 3.0.0
pluggy 1.0.0
portalocker 2.6.0
portpicker 1.5.2
pre-commit 3.0.4
prompt-toolkit 3.0.36
protobuf 3.20.2
psutil 5.9.4
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 10.0.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pydantic 1.9.0
pyDeprecate 0.3.2
Pygments 2.14.0
pyparsing 3.0.9
pyre-extensions 0.0.27
pytest 7.2.1
pytest-mypy 0.10.3
python-dateutil 2.8.2
pytz 2022.6
PyYAML 6.0
requests 2.28.1
requests-oauthlib 1.3.1
rsa 4.9
scikit-build 0.16.3
sentry-sdk 1.16.0
setproctitle 1.3.2
setuptools 65.5.0
six 1.16.0
smmap 5.0.0
sortedcontainers 2.4.0
stack-data 0.6.2
stdlibs 2022.10.9
tabulate 0.9.0
tensorboard 2.9.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorflow 2.9.3
tensorflow-estimator 2.9.0
tensorflow-io-gcs-filesystem 0.30.0
termcolor 2.2.0
toml 0.10.2
tomli 2.0.1
torch 1.13.1
torchmetrics 0.11.0
torchrec 0.3.2
torchsnapshot 0.1.0
torchx 0.3.0
tqdm 4.64.1
trailrunner 1.2.1
traitlets 5.9.0
typing_extensions 4.4.0
typing-inspect 0.8.0
urllib3 1.26.13
usort 1.0.5
virtualenv 20.19.0
wandb 0.13.11
wcwidth 0.2.6
websocket-client 1.4.2
Werkzeug 2.2.3
wrapt 1.14.1
yarl 1.8.2
zipp 3.12.1
Thank you very much for helping me about this configure problem~