omegaconf error (with ddp_spawn): Unsupported interpolation type hydra
I think omegaconf makes mistakes in ddp_spawn training when interpolating strings like ${hydra:xxxxxxx}
The simplest way to reproduce such an error on my machine is as follows:
- pull the repo
- run
python src/train.py trainer=ddp trainer.max_epochs=5 logger=csv
The command line output and error trace are (I leave out the part that seemed unimportant to me, mark by ########)
│ 18 │ test_loss │ MeanMetric │ 0 │
│ 19 │ val_acc_best │ MaxMetric │ 0 │
└────┴──────────────┴────────────────────┴────────┘
Trainable params: 68.0 K
Non-trainable params: 0
Total params: 68.0 K
Total estimated model params size (MB): 0
[2023-01-04 16:34:17,037][src.utils.utils][ERROR] -
Traceback (most recent call last):
File "/nvme/louzekun/playground/lightning-hydra-template-1.5.0/src/utils/utils.py", line 38, in wrap
metric_dict, object_dict = task_func(cfg=cfg)
######## (This part is about multiprocessing)
File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
######## (This part is about omegaconf)
File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/omegaconf/base.py", line 703, in _evaluate_custom_resolver
raise UnsupportedInterpolationType(
omegaconf.errors.UnsupportedInterpolationType: Unsupported interpolation type hydra
full_key: trainer.default_root_dir
object_type=dict
[2023-01-04 16:34:17,039][src.utils.utils][INFO] - Output dir: /nvme/louzekun/playground/lightning-hydra-template-1.5.0/logs/train/runs/2023-01-04_16-34-10
[2023-01-04 16:34:17,039][src.utils.utils][INFO] - Closing loggers...
Error executing job with overrides: ['trainer=ddp', 'trainer.max_epochs=5', 'logger=csv']
Traceback (most recent call last):
File "/nvme/louzekun/playground/lightning-hydra-template-1.5.0/src/train.py", line 122, in main
metric_dict, _ = train(cfg)
######## (This part repeated the same errors as above)
File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/omegaconf/base.py", line 703, in _evaluate_custom_resolver
raise UnsupportedInterpolationType(
omegaconf.errors.UnsupportedInterpolationType: Unsupported interpolation type hydra
full_key: trainer.default_root_dir
object_type=dict
One could find in the configs/trainer/default.yaml that trainer.default_root_dir=${paths.output_dir}, and further configs/paths/default.yaml writes output_dir: ${hydra:runtime.output_dir}
The error UnsupportedInterpolationType is raised at omegaconf/base.py:L702, where there is no resolver named 'hydra'.
It seems that the ${hydra:runtime.xxxxxx} works well before and after training (or the pl.Trainer cannot be properly instantiated, and there will be no logs like [2023-01-04 16:34:17,039][src.utils.utils][INFO] ... logs/train/runs/2023-01-04_16-34-10) and makes mistakes during ddp_spawn training (remember the Process 0 terminated with the following error in error trace).
To verify my guess, after cfg was created and before train(cfg) was called, I deleted the resolver 'hydra' from OmegaConf by OmegaConf.clear_resolver("hydra") and added a new resolver named hydra by adding __call__ method to a class based on HydraConfig.get(). The exact same error happened as above.
My python packages version:
# Name Version Build Channel
hydra-colorlog 1.2.0 pypi_0 pypi
hydra-core 1.3.1 pypi_0 pypi
hydra-optuna-sweeper 1.2.0 pypi_0 pypi
pytorch 1.12.1 py3.10_cuda11.3_cudnn8.3.2_0 pytorch
pytorch-cluster 1.6.0 py310_torch_1.12.0_cu113 pyg
pytorch-lightning 1.8.3 pypi_0 pypi
pytorch-mutex 1.0 cuda pytorch
pytorch-scatter 2.0.9 py310_torch_1.12.0_cu113 pyg
pytorch-sparse 0.6.15 py310_torch_1.12.0_cu113 pyg
torchaudio 0.12.1 py310_cu113 pytorch
torchmetrics 0.11.0 pyhd8ed1ab_0 conda-forge
torchvision 0.13.1 py310_cu113 pytorch
My GPUs are 8xA100-SXM4-80GB
My GPU driver version:
NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4
So is it my own mistake, or might there be some remedies? Thank you!
The problem is now partially solved by fixing vars in DictConfig
def fix_DictConfig(cfg: DictConfig):
"""fix all vars in the cfg config
this is an in-place operation"""
keys = list(cfg.keys())
for k in keys:
if type(cfg[k]) is DictConfig:
fix_DictConfig(cfg[k])
else:
setattr(cfg, k, getattr(cfg, k))
Think I am hitting the same. @nqhq-lou : Thanx so much for posting the fix above!!!!
I just added the fix_DictConfig right after train() call and it seems to work.
When you say "partial" what is missing? Thanx so much for help.
Think I am hitting the same. @nqhq-lou : Thanx so much for posting the fix above!!!!
I just added the fix_DictConfig right after train() call and it seems to work.
When you say "partial" what is missing? Thanx so much for help.
Great to hear that my solution was able to help!
I think the problem is due to a conflict between variable synchronization and the hydra resolver (or something else), as you can see the omegaconf interpolation problems occur after the ddp_spawn strategy is implemented. So I guess a "complete" solution should be able to fix the underlying interpolation error, rather than fixing the parameters with brutal force. If we want to use this interpolation feature, such as changing parameters on the fly via the hydra resolver, the current partial solution won't work anymore and we'll have to go back to find a complete solution.
Think I am hitting the same. @nqhq-lou : Thanx so much for posting the fix above!!!!
I just added the fix_DictConfig right after train() call and it seems to work.
When you say "partial" what is missing? Thanx so much for help.
I have encountered the same problem, and your solution works well. However, I am very confused that this kind of problem occurred after I used it normally for a period of time. Using any ddp policy for a long time will not cause this problem. I am confused about the cause of this bug.
I'm using the latest versions and this is still an issue for me. But the solution of @nqhq-lou works for me! Thanks :-)
❯ poetry show
aiohttp 3.8.4 Async http client/server framework (asyncio)
aiosignal 1.3.1 aiosignal: a list of registered asynchronous callbacks
antlr4-python3-runtime 4.9.3 ANTLR 4.9.3 runtime for Python 3.7
appdirs 1.4.4 A small Python module for determining appropriate platform-specific dirs, e.g. a "user data...
arrow 1.2.3 Better dates & times for Python
async-timeout 4.0.2 Timeout context manager for asyncio programs
attrs 22.2.0 Classes Without Boilerplate
boto3 1.26.70 The AWS SDK for Python
botocore 1.29.70 Low-level, data-driven core of boto 3.
bravado 11.0.3 Library for accessing Swagger-enabled API's
bravado-core 5.17.1 Library for adding Swagger support to clients and servers
certifi 2022.12.7 Python package for providing Mozilla's CA Bundle.
cfgv 3.3.1 Validate configuration and produce human readable error messages.
charset-normalizer 3.0.1 The Real First Universal Charset Detector. Open, modern and actively maintained alternative...
click 8.1.3 Composable command line interface toolkit
colorlog 6.7.0 Add colours to the output of Python's logging module.
distlib 0.3.6 Distribution utilities
docker-pycreds 0.4.0 Python bindings for the docker credentials store API
exceptiongroup 1.1.0 Backport of PEP 654 (exception groups)
filelock 3.9.0 A platform independent file lock.
fqdn 1.5.1 Validates fully-qualified domain names against RFC 1123, so that they are acceptable to mod...
frozenlist 1.3.3 A list-like structure which implements collections.abc.MutableSequence
fsspec 2023.1.0 File-system specification
future 0.18.3 Clean single-source support for Python 3 and 2
gitdb 4.0.10 Git Object Database
gitpython 3.1.30 GitPython is a python library used to interact with Git repositories
huggingface-hub 0.12.0 Client library to download and publish models, datasets and other repos on the huggingface....
hydra-colorlog 1.2.0 Enables colorlog for Hydra apps
hydra-core 1.3.1 A framework for elegantly configuring complex applications
identify 2.5.18 File identification library for Python
idna 3.4 Internationalized Domain Names in Applications (IDNA)
iniconfig 2.0.0 brain-dead simple config-ini parsing
isoduration 20.11.0 Operations with ISO 8601 durations
jmespath 1.0.1 JSON Matching Expressions
joblib 1.2.0 Lightweight pipelining with Python functions
jsonpointer 2.3 Identify specific nodes in a JSON document (RFC 6901)
jsonref 1.1.0 jsonref is a library for automatic dereferencing of JSON Reference objects for Python.
jsonschema 4.17.3 An implementation of JSON Schema validation for Python
lightning-utilities 0.6.0.post0 PyTorch Lightning Sample project.
markdown-it-py 2.1.0 Python port of markdown-it. Markdown parsing, done right!
mdurl 0.1.2 Markdown URL utilities
monotonic 1.6 An implementation of time.monotonic() for Python 2 & < 3.3
msgpack 1.0.4 MessagePack serializer
multidict 6.0.4 multidict implementation
neptune-client 0.16.17 Neptune Client
nodeenv 1.7.0 Node.js virtual environment builder
numpy 1.24.2 Fundamental package for array computing in Python
oauthlib 3.2.2 A generic, spec-compliant, thorough implementation of the OAuth request-signing logic
omegaconf 2.3.0 A flexible configuration library
packaging 23.0 Core utilities for Python packages
pandas 1.5.3 Powerful data structures for data analysis, time series, and statistics
pathtools 0.1.2 File system general utilities
pillow 9.4.0 Python Imaging Library (Fork)
platformdirs 3.0.0 A small Python package for determining appropriate platform-specific dirs, e.g. a "user dat...
pluggy 1.0.0 plugin and hook calling mechanisms for python
pre-commit 3.0.4 A framework for managing and maintaining multi-language pre-commit hooks.
protobuf 3.20.3 Protocol Buffers
psutil 5.9.4 Cross-platform lib for process and system monitoring in Python.
pyarrow 11.0.0 Python library for Apache Arrow
pygments 2.14.0 Pygments is a syntax highlighting package written in Python.
pyjwt 2.6.0 JSON Web Token implementation in Python
pyrootutils 1.0.4 Simple package for easy project root setup
pyrsistent 0.19.3 Persistent/Functional/Immutable data structures
pytest 7.2.1 pytest: simple powerful testing with Python
python-dateutil 2.8.2 Extensions to the standard Python datetime module
python-dotenv 0.21.1 Read key-value pairs from a .env file and set them as environment variables
pytorch-lightning 1.9.1 PyTorch Lightning is the lightweight PyTorch wrapper for ML researchers. Scale your models....
pytorch-ranger 0.1.1 Ranger - a synergistic optimizer using RAdam (Rectified Adam) and LookAhead in one codebase
pytz 2022.7.1 World timezone definitions, modern and historical
pyyaml 6.0 YAML parser and emitter for Python
regex 2022.10.31 Alternative regular expression module, to replace re.
requests 2.28.2 Python HTTP for Humans.
requests-oauthlib 1.3.1 OAuthlib authentication support for Requests.
rfc3339-validator 0.1.4 A pure python RFC3339 validator
rfc3987 1.3.8 Parsing and validation of URIs (RFC 3986) and IRIs (RFC 3987)
rich 13.3.1 Render rich text, tables, progress bars, syntax highlighting, markdown and more to the term...
s3transfer 0.6.0 An Amazon S3 Transfer Manager
scikit-learn 1.2.1 A set of python modules for machine learning and data mining
scipy 1.9.3 Fundamental algorithms for scientific computing in Python
sentry-sdk 1.15.0 Python client for Sentry (https://sentry.io)
setproctitle 1.3.2 A Python module to customize the process title
setuptools 67.2.0 Easily download, build, install, upgrade, and uninstall Python packages
simplejson 3.18.3 Simple, fast, extensible JSON encoder/decoder for Python
six 1.16.0 Python 2 and 3 compatibility utilities
smmap 5.0.0 A pure Python implementation of a sliding window memory map manager
swagger-spec-validator 3.0.3 Validation of Swagger specifications
tensorboardx 2.6 TensorBoardX lets you watch Tensors Flow without Tensorflow
threadpoolctl 3.1.0 threadpoolctl
tokenizers 0.13.2 Fast and Customizable Tokenizers
tomli 2.0.1 A lil' TOML parser
torch 1.13.1 Tensors and Dynamic neural networks in Python with strong GPU acceleration
torch-optimizer 0.3.0 pytorch-optimizer
torchmetrics 0.11.1 PyTorch native Metrics
torchvision 0.14.1 image and video datasets and models for torch deep learning
tqdm 4.64.1 Fast, Extensible Progress Meter
transformers 4.26.0.dev0 ../transformers State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
typing-extensions 4.4.0 Backported and Experimental Type Hints for Python 3.7+
uri-template 1.2.0 RFC 6570 URI Template Processor
urllib3 1.26.14 HTTP library with thread-safe connection pooling, file post, and more.
virtualenv 20.19.0 Virtual Python Environment builder
wandb 0.13.10 A CLI and library for interacting with the Weights and Biases API.
webcolors 1.12 A library for working with color names and color values formats defined by HTML and CSS.
websocket-client 1.5.1 WebSocket client for Python with low level API options
yarl 1.8.2 Yet another URL library
I fix this bug with the help of @nqhq-lou ,thanks! If others meets this bug, you should add it into function train in src/train.py, line 58. add fix_DictConfig().
Thanks for the bug fix. I use an old version (v1.4.0) which has the same issue in a multi-device mode. The issue has been resolved by adding the fix_DictConfig function to src/tasks/train_task.py.
log.info("Instantiating callbacks...")
fix_DictConfig(cfg)
callbacks: List[Callback] = utils.instantiate_callbacks(cfg.get("callbacks"))
Think I am hitting the same. @nqhq-lou : Thanx so much for posting the fix above!!!! I just added the fix_DictConfig right after train() call and it seems to work. When you say "partial" what is missing? Thanx so much for help.
Great to hear that my solution was able to help!
I think the problem is due to a conflict between variable synchronization and the
hydraresolver (or something else), as you can see theomegaconfinterpolation problems occur after the ddp_spawn strategy is implemented. So I guess a "complete" solution should be able to fix the underlying interpolation error, rather than fixing the parameters with brutal force. If we want to use this interpolation feature, such as changing parameters on the fly via thehydraresolver, the current partial solution won't work anymore and we'll have to go back to find a complete solution.
What should be the directions to look into for a complete solution? Should an issue be created under hydra or omegaconf?