[Environment Variable][1/N] Use thread-safe env variable API in c10
This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/119449
- :page_facing_up: Preview Python docs built from this PR
- :page_facing_up: Preview C++ docs built from this PR
- :question: Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours
Note: Links to docs will display an error until the docs builds have been completed.
:x: 9 New Failures
As of commit 6531b5572fe14163a6a45226e0f0bd99d0933761 with merge base 6e10f7d8c188769a2727750557da8042130abe93 ():
NEW FAILURES - The following jobs have failed:
- macos-arm64-binary-conda / conda-py3_10-cpu-build (gh)
/Users/runner/work/_temp/anaconda/conda-bld/pytorch_1727753357803/work/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'? - macos-arm64-binary-conda / conda-py3_11-cpu-build (gh)
/Users/runner/work/_temp/anaconda/conda-bld/pytorch_1727753357619/work/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'? - macos-arm64-binary-conda / conda-py3_12-cpu-build (gh)
/Users/runner/work/_temp/anaconda/conda-bld/pytorch_1727753359718/work/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'? - macos-arm64-binary-conda / conda-py3_9-cpu-build (gh)
/Users/runner/work/_temp/anaconda/conda-bld/pytorch_1727753352884/work/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'? - macos-arm64-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-build (gh)
/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'? - macos-arm64-binary-wheel / wheel-py3_10-cpu-build (gh)
/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'? - macos-arm64-binary-wheel / wheel-py3_11-cpu-build (gh)
/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'? - macos-arm64-binary-wheel / wheel-py3_12-cpu-build (gh)
/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'? - macos-arm64-binary-wheel / wheel-py3_9-cpu-build (gh)
/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'?
This comment was automatically generated by Dr. CI and updates every 15 minutes.
I would expect that a user setting something on os.env in python would trigger such a race? That was my assumption here.
I would expect that a user setting something on
os.envin python would trigger such a race? That was my assumption here.
Yes, but the current implementation would do nothing to avoid it, as putenv call from CPython would not be serialized against getenv calls made from libtorch
I would expect that a user setting something on
os.envin python would trigger such a race? That was my assumption here.
A specific reproducing example may be hard to create. Also note that Torch may be used in a pure multi-threading C++ environment.
@malfet @albanD set_env was added.
Can you please write a better description at what you are trying to achieve here? I haven't read the standard, but being familiar with the implementation details, I don't see why getenv on its own is not thread safe.
It is probably unsafe when there are a concurrent
putenvcalls, but you are not adding that one to your env utils.[Edit] To quote https://en.cppreference.com/w/cpp/utility/program/getenv looks like it is part of the standard:
This function is thread-safe (calling it from multiple threads does not introduce a data race) as long as no other function modifies the host environment. In particular, the POSIX functions setenv(), unsetenv(), and putenv() would introduce a data race if called without synchronization.
[Edit2] Is this about suppressing some clang-tidy warnings or solving a real problem? If later, then solution is to provide thread-safe
extern "C"implementation of those methods in the code, that would take over libc ones (all libc functions are weak symbols by design so that one can app can choose to override them) and provide thread safety from getenv calls fro say oneDNN, libcuda etc
The purpose is to suppress clang-tidy warnings and solve possible future problems by providing thread-safe getenv and setenv. Overwriting libc implementations may not be a good idea because we have to support MSVC and other no GNU platforms so that I prefer to use explicit c10 API.
@pytorchbot label "topic: not user facing"
@pytorchbot label ciflow/trunk
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.Stale pull requests will automatically be closed after 30 days of inactivity.
@pytorchmergebot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
@pytorchbot revert -m "your using TORCH_INTERNAL_ASSERT incorrectly" -c "ghfirst"
We're getting the following error from our compiler
xplat/caffe2/c10/util/env.cpp:37:9: error: implicit conversion turns string literal into bool: 'const char[32]' to 'bool' [-Werror,-Wstring-conversion]
"setenv failed for environment \"", name, "\", the error is: ", err);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
xplat/caffe2/c10/util/Exception.h:406:31: note: expanded from macro 'TORCH_INTERNAL_ASSERT'
if (C10_UNLIKELY_OR_CONST(!(cond)))
fixing the assert to start with a bool should fix things.
@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team
@cyyever your PR has been successfully reverted.
@PaliC Thank you for pointing it out, fixed.
@pytorchmergebot merge -i
Merge started
Your change will be merged while ignoring the following 19 checks: pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, linux.2xlarge), pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 5, 5, linux.4xlarge.nvidia.gpu), pull / linux-focal-py3.11-clang10 / test (dynamo, 2, 3, linux.2xlarge), pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (dynamo, 1, 3, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (dynamo, 2, 3, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (dynamo, 3, 3, linux.2xlarge), Lint / lintrunner-noclang / linux-job, trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-stable), linux-aarch64-binary-manywheel / manywheel-py3_10-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_11-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_8-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_9-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_12-cpu-aarch64-build / build, linux-binary-conda / conda-py3_9-cuda11_8-build / build, linux-binary-conda / conda-py3_11-cuda11_8-build / build, linux-binary-conda / conda-py3_12-cuda11_8-build / build, linux-binary-conda / conda-py3_10-cuda11_8-build / build, linux-binary-conda / conda-py3_8-cuda11_8-build / build
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
@pytorchbot revert -m "Broke gcc9 builds" -c ignoredsignal
@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team
@cyyever your PR has been successfully reverted.
@pytorchbot rebase
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here
Successfully rebased th_safe_env onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout th_safe_env && git pull --rebase)
Though I might be wrong, just a lots of unrelated failures, sorry about the revert
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Merge failed
Reason: 1 jobs have failed, first few of them are: windows-binary-conda / conda-py3_12-cpu-build
Details for Dev Infra team
Raised by workflow job
@pytorchmergebot merge -i
Merge started
Your change will be merged while ignoring the following 6 checks: linux-aarch64-binary-manywheel / manywheel-py3_10-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_8-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_11-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_9-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_12-cpu-aarch64-build / build, linux-binary-conda / conda-py3_12-cuda11_8-build / build
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
@pytorchbot revert -m "Broken internal signals, @albanD please help get this sorted :)" -c nosignal