Mempool use_on_oom order
Reorder oom mitigation steps so that we reuse optional mempools before expensive releasing cached blocks.
Additionally, make sure mempools are removed from use_on_oom_pools upon deletion. New test before fix:
======================================================================
ERROR: test_deleted_mempool_not_used_on_oom (__main__.TestMemPool.test_deleted_mempool_not_used_on_oom)
Test that a deleted mempool with use_on_oom=True is properly removed from use_on_oom_pools.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/danielsjohnson/oss_pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3325, in wrapper
method(*args, **kwargs)
File "/home/danielsjohnson/oss_pytorch/pytorch/test/test_cuda.py", line 5696, in test_deleted_mempool_not_used_on_oom
c = torch.randn(20 * nelem_1mb, device="cuda")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: it->second->use_count > 0 INTERNAL ASSERT FAILED at "/home/danielsjohnson/oss_pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":2700, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
python test/test_cuda.py TestMemPool.test_deleted_mempool_not_used_on_oom
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.691s
FAILED (errors=1)
Segmentation fault (core dumped)
New test after fix:
----------------------------------------------------------------------
Ran 1 test in 0.651s
OK
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/169699
- :page_facing_up: Preview Python docs built from this PR
- :page_facing_up: Preview C++ docs built from this PR
- :question: Need help or want to give feedback on the CI? Visit the bot commands wiki
Note: Links to docs will display an error until the docs builds have been completed.
:white_check_mark: No Failures
As of commit 425d93cc78ba06f08f601cdbf49102652b1c8701 with merge base a4b91a3164bed39d8e7934c21fd10e97ac831603 ():
:green_heart: Looks good so far! There are no failures yet. :green_heart:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
@pytorchbot label "topic: not user facing"
@dsjohns2 has imported this pull request. If you are a Meta employee, you can view this in D88523791.
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
@pytorchbot revert -m "Failing internal test due to shadow variables. Will reland with fix."
❌ 🤖 pytorchbot command failed:
@pytorchbot revert: error: the following arguments are required: -c/--classification
usage: @pytorchbot revert -m MESSAGE -c
{nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}
Try @pytorchbot --help for more info.
@pytorchbot revert -m="Failing internal test due to shadow variables. Will reland with fix." -c=nosignal
@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team
@dsjohns2 your PR has been successfully reverted.
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Merge failed
Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4)
Details for Dev Infra team
Raised by workflow job
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Merge failed
Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4)
Details for Dev Infra team
Raised by workflow job
@pytorchbot merge -i
Merge started
Your change will be merged while ignoring the following 1 checks: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4)
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Merge failed
Reason: 1 mandatory check(s) failed. The first few are:
Dig deeper by viewing the failures on hud
@pytorchbot merge -r
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here
Successfully rebased dsjohns2/mempool_use_on_oom_order onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout dsjohns2/mempool_use_on_oom_order && git pull --rebase)
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Merge failed
Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check
Details for Dev Infra team
Raised by workflow job
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Merge failed
Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check
Details for Dev Infra team
Raised by workflow job
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here