pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

[Environment Variable][1/N] Use thread-safe env variable API in c10

Open cyyever opened this issue 1 year ago • 41 comments

This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.

cyyever avatar Feb 08 '24 10:02 cyyever

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/119449

Note: Links to docs will display an error until the docs builds have been completed.

:x: 9 New Failures

As of commit 6531b5572fe14163a6a45226e0f0bd99d0933761 with merge base 6e10f7d8c188769a2727750557da8042130abe93 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Feb 08 '24 10:02 pytorch-bot[bot]

I would expect that a user setting something on os.env in python would trigger such a race? That was my assumption here.

albanD avatar Feb 08 '24 16:02 albanD

I would expect that a user setting something on os.env in python would trigger such a race? That was my assumption here.

Yes, but the current implementation would do nothing to avoid it, as putenv call from CPython would not be serialized against getenv calls made from libtorch

malfet avatar Feb 08 '24 16:02 malfet

I would expect that a user setting something on os.env in python would trigger such a race? That was my assumption here.

A specific reproducing example may be hard to create. Also note that Torch may be used in a pure multi-threading C++ environment.

cyyever avatar Feb 09 '24 01:02 cyyever

@malfet @albanD set_env was added.

cyyever avatar Feb 09 '24 02:02 cyyever

Can you please write a better description at what you are trying to achieve here? I haven't read the standard, but being familiar with the implementation details, I don't see why getenv on its own is not thread safe.

It is probably unsafe when there are a concurrent putenv calls, but you are not adding that one to your env utils.

[Edit] To quote https://en.cppreference.com/w/cpp/utility/program/getenv looks like it is part of the standard:

This function is thread-safe (calling it from multiple threads does not introduce a data race) as long as no other function modifies the host environment. In particular, the POSIX functions setenv(), unsetenv(), and putenv() would introduce a data race if called without synchronization.

[Edit2] Is this about suppressing some clang-tidy warnings or solving a real problem? If later, then solution is to provide thread-safe extern "C" implementation of those methods in the code, that would take over libc ones (all libc functions are weak symbols by design so that one can app can choose to override them) and provide thread safety from getenv calls fro say oneDNN, libcuda etc

The purpose is to suppress clang-tidy warnings and solve possible future problems by providing thread-safe getenv and setenv. Overwriting libc implementations may not be a good idea because we have to support MSVC and other no GNU platforms so that I prefer to use explicit c10 API.

cyyever avatar Feb 09 '24 09:02 cyyever

@pytorchbot label "topic: not user facing"

cyyever avatar Feb 09 '24 15:02 cyyever

@pytorchbot label ciflow/trunk

cyyever avatar Feb 09 '24 15:02 cyyever

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

github-actions[bot] avatar Apr 09 '24 19:04 github-actions[bot]

@pytorchmergebot merge

cyyever avatar Apr 16 '24 04:04 cyyever

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Apr 16 '24 04:04 pytorchmergebot

@pytorchbot revert -m "your using TORCH_INTERNAL_ASSERT incorrectly" -c "ghfirst"

We're getting the following error from our compiler

xplat/caffe2/c10/util/env.cpp:37:9: error: implicit conversion turns string literal into bool: 'const char[32]' to 'bool' [-Werror,-Wstring-conversion]
        "setenv failed for environment \"", name, "\", the error is: ", err);
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
xplat/caffe2/c10/util/Exception.h:406:31: note: expanded from macro 'TORCH_INTERNAL_ASSERT'
  if (C10_UNLIKELY_OR_CONST(!(cond))) 

fixing the assert to start with a bool should fix things.

PaliC avatar Apr 17 '24 23:04 PaliC

@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot avatar Apr 17 '24 23:04 pytorchmergebot

@cyyever your PR has been successfully reverted.

pytorchmergebot avatar Apr 17 '24 23:04 pytorchmergebot

@PaliC Thank you for pointing it out, fixed.

cyyever avatar Apr 18 '24 00:04 cyyever

@pytorchmergebot merge -i

cyyever avatar Apr 18 '24 13:04 cyyever

Merge started

Your change will be merged while ignoring the following 19 checks: pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, linux.2xlarge), pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 5, 5, linux.4xlarge.nvidia.gpu), pull / linux-focal-py3.11-clang10 / test (dynamo, 2, 3, linux.2xlarge), pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (dynamo, 1, 3, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (dynamo, 2, 3, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (dynamo, 3, 3, linux.2xlarge), Lint / lintrunner-noclang / linux-job, trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-stable), linux-aarch64-binary-manywheel / manywheel-py3_10-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_11-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_8-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_9-cpu-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_12-cpu-aarch64-build / build, linux-binary-conda / conda-py3_9-cuda11_8-build / build, linux-binary-conda / conda-py3_11-cuda11_8-build / build, linux-binary-conda / conda-py3_12-cuda11_8-build / build, linux-binary-conda / conda-py3_10-cuda11_8-build / build, linux-binary-conda / conda-py3_8-cuda11_8-build / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Apr 18 '24 13:04 pytorchmergebot

@pytorchbot revert -m "Broke gcc9 builds" -c ignoredsignal

malfet avatar Apr 18 '24 18:04 malfet

@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot avatar Apr 18 '24 18:04 pytorchmergebot

@cyyever your PR has been successfully reverted.

pytorchmergebot avatar Apr 18 '24 18:04 pytorchmergebot

@pytorchbot rebase

malfet avatar Apr 18 '24 18:04 malfet

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot avatar Apr 18 '24 19:04 pytorchmergebot

Successfully rebased th_safe_env onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout th_safe_env && git pull --rebase)

pytorchmergebot avatar Apr 18 '24 19:04 pytorchmergebot

Though I might be wrong, just a lots of unrelated failures, sorry about the revert

malfet avatar Apr 18 '24 19:04 malfet

@pytorchbot merge

malfet avatar Apr 18 '24 19:04 malfet

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Apr 18 '24 19:04 pytorchmergebot

Merge failed

Reason: 1 jobs have failed, first few of them are: windows-binary-conda / conda-py3_12-cpu-build

Details for Dev Infra team Raised by workflow job

pytorchmergebot avatar Apr 18 '24 20:04 pytorchmergebot

@pytorchmergebot merge -i

cyyever avatar Apr 19 '24 13:04 cyyever

@pytorchbot revert -m "Broken internal signals, @albanD please help get this sorted :)" -c nosignal

jeanschmidt avatar Apr 22 '24 14:04 jeanschmidt