iree
iree copied to clipboard
Handle GPU resource management in ctest
Currently the integration tests running on real GPUs are frequently failing due to running out of GPU memory (https://source.cloud.google.com/results/invocations/5a6a9122-9c10-4a92-a4c5-d8fad601b58a/targets/iree%2Fgcp_ubuntu%2Fcmake-bazel%2Flinux%2Fx86-turing%2Fmain/log):
RuntimeError: Error invoking function: /tmpfs/src/github/iree/iree/hal/vulkan/status_util.c:53: RESOURCE_EXHAUSTED; VK_ERROR_OUT_OF_DEVICE_MEMORY; vmaCreateBuffer; while invoking native function hal.allocator.allocate; while calling import
This is just a case of trying to run too many things that use GPU memory at once (CTEST_PARALLEL_LEVEL=$(nproc)
). We should come up with a way to avoid this. https://cmake.org/cmake/help/latest/manual/ctest.1.html#resource-allocation details some really nice robust handling. We could also run large tests at a lower level of parallelism than small tests (and run them in two batches). Related to https://github.com/google/iree/issues/5121, which would plumb through timeouts.
For now I'm just going to drop the parallelism level on this build.
Interesting. The log showed that we are failing on ResNet, which is a large model. I'm wondering whether it's also because we are having too many buffer allocations there too and we can probably optimize some away.
The specific model failing varies. Entirely possible that we're also using more GPU memory than we should be, but this is probably not the best way to surface that ;-P
Somewhat related to this: https://github.com/google/iree/issues/5152
Phoenix also pointed out that there's no reason to be running tests without driver=vulkan
in this build, which should greatly reduce the number of tests we need to run
165/1414 Test #903: integrations/tensorflow/e2e/keras/applications/large_cifar10_tests__applications__iree_vulkan__model__ResNet50 ............***Failed 57.49 sec
https://source.cloud.google.com/results/invocations/bb36cd88-2151-4e8a-b25d-22e5ad591eaa/targets/iree%2Fgcp_ubuntu%2Fcmake-bazel%2Flinux%2Fx86-turing%2Fmain/log
:cry:
But of course ctest doesn't support filtering by the intersection of labels :roll_eyes: Luckily the first label we want to filter on is also in the test's name.
I feel the underlying issue is the same as #5268, which is fixed now. Can we revert #5163 and try to see if it works? Would be good to close this if everything is okay.
Well I ended up going even further with #5166, disabling parallelism entirely. I think we should try rolling that back. Not sure if we still would want to close this, since proper resource management would maybe be nice, but :shrug:
I'm hitting this again in https://github.com/iree-org/iree/pull/9905 where I get timeouts on buffer mapping tests. Dropping the parallelism to 2 makes testing take 4 minutes instead of 3 and gets rid of the timeouts.
Unassigning myself from issues that I'm not actively working on