pytorch [Environment Variable][1/N] Use thread-safe env variable API in c10

This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.

Feb 08 '24 10:02 cyyever

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/119449

:page_facing_up: Preview Python docs built from this PR
:page_facing_up: Preview C++ docs built from this PR
:question: Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

:x: 9 New Failures

As of commit 6531b5572fe14163a6a45226e0f0bd99d0933761 with merge base 6e10f7d8c188769a2727750557da8042130abe93 ():

NEW FAILURES - The following jobs have failed:

macos-arm64-binary-conda / conda-py3_10-cpu-build (gh) /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1727753357803/work/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'?
macos-arm64-binary-conda / conda-py3_11-cpu-build (gh) /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1727753357619/work/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'?
macos-arm64-binary-conda / conda-py3_12-cpu-build (gh) /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1727753359718/work/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'?
macos-arm64-binary-conda / conda-py3_9-cpu-build (gh) /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1727753352884/work/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'?
macos-arm64-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-build (gh) /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'?
macos-arm64-binary-wheel / wheel-py3_10-cpu-build (gh) /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'?
macos-arm64-binary-wheel / wheel-py3_11-cpu-build (gh) /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'?
macos-arm64-binary-wheel / wheel-py3_12-cpu-build (gh) /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'?
macos-arm64-binary-wheel / wheel-py3_9-cpu-build (gh) /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Distributions.mm:88:20: error: use of undeclared identifier 'MPSDataTypeBFloat16'; did you mean 'MPSDataTypeFloat16'?

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Feb 08 '24 10:02 pytorch-bot[bot]

I would expect that a user setting something on os.env in python would trigger such a race? That was my assumption here.

Feb 08 '24 16:02 albanD

I would expect that a user setting something on os.env in python would trigger such a race? That was my assumption here.

Yes, but the current implementation would do nothing to avoid it, as putenv call from CPython would not be serialized against getenv calls made from libtorch

Feb 08 '24 16:02 malfet

I would expect that a user setting something on os.env in python would trigger such a race? That was my assumption here.

A specific reproducing example may be hard to create. Also note that Torch may be used in a pure multi-threading C++ environment.

Feb 09 '24 01:02 cyyever

@malfet @albanD set_env was added.

Feb 09 '24 02:02 cyyever

Can you please write a better description at what you are trying to achieve here? I haven't read the standard, but being familiar with the implementation details, I don't see why getenv on its own is not thread safe.

It is probably unsafe when there are a concurrent putenv calls, but you are not adding that one to your env utils.

[Edit] To quote https://en.cppreference.com/w/cpp/utility/program/getenv looks like it is part of the standard:

This function is thread-safe (calling it from multiple threads does not introduce a data race) as long as no other function modifies the host environment. In particular, the POSIX functions setenv(), unsetenv(), and putenv() would introduce a data race if called without synchronization.

[Edit2] Is this about suppressing some clang-tidy warnings or solving a real problem? If later, then solution is to provide thread-safe extern "C" implementation of those methods in the code, that would take over libc ones (all libc functions are weak symbols by design so that one can app can choose to override them) and provide thread safety from getenv calls fro say oneDNN, libcuda etc

The purpose is to suppress clang-tidy warnings and solve possible future problems by providing thread-safe getenv and setenv. Overwriting libc implementations may not be a good idea because we have to support MSVC and other no GNU platforms so that I prefer to use explicit c10 API.

Feb 09 '24 09:02 cyyever

@pytorchbot label "topic: not user facing"

Feb 09 '24 15:02 cyyever

@pytorchbot label ciflow/trunk

Feb 09 '24 15:02 cyyever

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

Apr 09 '24 19:04 github-actions[bot]

@pytorchmergebot merge

Apr 16 '24 04:04 cyyever

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status here

Apr 16 '24 04:04 pytorchmergebot

@pytorchbot revert -m "your using TORCH_INTERNAL_ASSERT incorrectly" -c "ghfirst"

We're getting the following error from our compiler

xplat/caffe2/c10/util/env.cpp:37:9: error: implicit conversion turns string literal into bool: 'const char[32]' to 'bool' [-Werror,-Wstring-conversion]
        "setenv failed for environment \"", name, "\", the error is: ", err);
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
xplat/caffe2/c10/util/Exception.h:406:31: note: expanded from macro 'TORCH_INTERNAL_ASSERT'
  if (C10_UNLIKELY_OR_CONST(!(cond)))

fixing the assert to start with a bool should fix things.

Apr 17 '24 23:04 PaliC

@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team

Apr 17 '24 23:04 pytorchmergebot

@cyyever your PR has been successfully reverted.

Apr 17 '24 23:04 pytorchmergebot

@PaliC Thank you for pointing it out, fixed.

Apr 18 '24 00:04 cyyever

@pytorchmergebot merge -i

Apr 18 '24 13:04 cyyever

Merge started

Your change will be merged while ignoring the following 19 checks: pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, linux.2xlarge), pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 5, 5, linux.4xlarge.nvidia.gpu), pull / linux-focal-py3.11-clang10 / test (dynamo, 2, 3, linux.2xlarge), pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (dynamo, 1, 3, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (dynamo, 2, 3, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (dynamo, 3, 3, linux.2xlarge), Lint / lintrunner-noclang / linux-job, trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-stable), linux-aarch64-binary-manywheel / manywheel-py3_10-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_11-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_8-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_9-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_12-cpu-aarch64-build / build, linux-binary-conda / conda-py3_9-cuda11_8-build / build, linux-binary-conda / conda-py3_11-cuda11_8-build / build, linux-binary-conda / conda-py3_12-cuda11_8-build / build, linux-binary-conda / conda-py3_10-cuda11_8-build / build, linux-binary-conda / conda-py3_8-cuda11_8-build / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status here

Apr 18 '24 13:04 pytorchmergebot

@pytorchbot revert -m "Broke gcc9 builds" -c ignoredsignal

Apr 18 '24 18:04 malfet

@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team

Apr 18 '24 18:04 pytorchmergebot

@cyyever your PR has been successfully reverted.

Apr 18 '24 18:04 pytorchmergebot

@pytorchbot rebase

Apr 18 '24 18:04 malfet

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Apr 18 '24 19:04 pytorchmergebot

Successfully rebased th_safe_env onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout th_safe_env && git pull --rebase)

Apr 18 '24 19:04 pytorchmergebot

Though I might be wrong, just a lots of unrelated failures, sorry about the revert

Apr 18 '24 19:04 malfet

@pytorchbot merge

Apr 18 '24 19:04 malfet

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status here

Apr 18 '24 19:04 pytorchmergebot

Merge failed

Reason: 1 jobs have failed, first few of them are: windows-binary-conda / conda-py3_12-cpu-build

Details for Dev Infra team

Raised by workflow job

Apr 18 '24 20:04 pytorchmergebot

@pytorchmergebot merge -i

Apr 19 '24 13:04 cyyever

Merge started

Your change will be merged while ignoring the following 6 checks: linux-aarch64-binary-manywheel / manywheel-py3_10-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_8-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_11-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_9-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_12-cpu-aarch64-build / build, linux-binary-conda / conda-py3_12-cuda11_8-build / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status here

Apr 19 '24 13:04 pytorchmergebot

@pytorchbot revert -m "Broken internal signals, @albanD please help get this sorted :)" -c nosignal

Apr 22 '24 14:04 jeanschmidt