pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

Mempool use_on_oom order

Open dsjohns2 opened this issue 2 weeks ago • 27 comments

Reorder oom mitigation steps so that we reuse optional mempools before expensive releasing cached blocks.

Additionally, make sure mempools are removed from use_on_oom_pools upon deletion. New test before fix:

======================================================================
ERROR: test_deleted_mempool_not_used_on_oom (__main__.TestMemPool.test_deleted_mempool_not_used_on_oom)
Test that a deleted mempool with use_on_oom=True is properly removed from use_on_oom_pools.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/danielsjohnson/oss_pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3325, in wrapper
    method(*args, **kwargs)
  File "/home/danielsjohnson/oss_pytorch/pytorch/test/test_cuda.py", line 5696, in test_deleted_mempool_not_used_on_oom
    c = torch.randn(20 * nelem_1mb, device="cuda")
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: it->second->use_count > 0 INTERNAL ASSERT FAILED at "/home/danielsjohnson/oss_pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":2700, please report a bug to PyTorch. 
To execute this test, run the following from the base repo dir:
    python test/test_cuda.py TestMemPool.test_deleted_mempool_not_used_on_oom
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.691s
FAILED (errors=1)
Segmentation fault (core dumped)

New test after fix:

----------------------------------------------------------------------
Ran 1 test in 0.651s
OK

dsjohns2 avatar Dec 05 '25 19:12 dsjohns2

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/169699

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit 425d93cc78ba06f08f601cdbf49102652b1c8701 with merge base a4b91a3164bed39d8e7934c21fd10e97ac831603 (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Dec 05 '25 19:12 pytorch-bot[bot]

@pytorchbot label "topic: not user facing"

dsjohns2 avatar Dec 05 '25 19:12 dsjohns2

@dsjohns2 has imported this pull request. If you are a Meta employee, you can view this in D88523791.

meta-codesync[bot] avatar Dec 05 '25 22:12 meta-codesync[bot]

@pytorchbot merge

ngimel avatar Dec 06 '25 00:12 ngimel

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Dec 06 '25 00:12 pytorchmergebot

@pytorchbot revert -m "Failing internal test due to shadow variables. Will reland with fix."

dsjohns2 avatar Dec 08 '25 18:12 dsjohns2

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}

Try @pytorchbot --help for more info.

pytorch-bot[bot] avatar Dec 08 '25 18:12 pytorch-bot[bot]

@pytorchbot revert -m="Failing internal test due to shadow variables. Will reland with fix." -c=nosignal

dsjohns2 avatar Dec 08 '25 18:12 dsjohns2

@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot avatar Dec 08 '25 18:12 pytorchmergebot

@dsjohns2 your PR has been successfully reverted.

pytorchmergebot avatar Dec 08 '25 18:12 pytorchmergebot

@pytorchbot merge

dsjohns2 avatar Dec 08 '25 18:12 dsjohns2

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Dec 08 '25 18:12 pytorchmergebot

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4)

Details for Dev Infra team Raised by workflow job

pytorchmergebot avatar Dec 08 '25 19:12 pytorchmergebot

@pytorchbot merge

dsjohns2 avatar Dec 08 '25 19:12 dsjohns2

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Dec 08 '25 19:12 pytorchmergebot

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4)

Details for Dev Infra team Raised by workflow job

pytorchmergebot avatar Dec 08 '25 19:12 pytorchmergebot

@pytorchbot merge -i

ngimel avatar Dec 08 '25 19:12 ngimel

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Dec 08 '25 20:12 pytorchmergebot

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

pytorchmergebot avatar Dec 08 '25 20:12 pytorchmergebot

@pytorchbot merge -r

dsjohns2 avatar Dec 09 '25 18:12 dsjohns2

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot avatar Dec 09 '25 18:12 pytorchmergebot

Successfully rebased dsjohns2/mempool_use_on_oom_order onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout dsjohns2/mempool_use_on_oom_order && git pull --rebase)

pytorchmergebot avatar Dec 09 '25 18:12 pytorchmergebot

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Dec 09 '25 18:12 pytorchmergebot

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team Raised by workflow job

pytorchmergebot avatar Dec 09 '25 18:12 pytorchmergebot

@pytorchbot merge

dsjohns2 avatar Dec 10 '25 02:12 dsjohns2

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Dec 10 '25 02:12 pytorchmergebot

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team Raised by workflow job

pytorchmergebot avatar Dec 10 '25 02:12 pytorchmergebot

@pytorchbot merge

dsjohns2 avatar Dec 10 '25 22:12 dsjohns2

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Dec 10 '25 22:12 pytorchmergebot