iree icon indicating copy to clipboard operation
iree copied to clipboard

Handle GPU resource management in ctest

Open GMNGeoffrey opened this issue 3 years ago • 9 comments

Currently the integration tests running on real GPUs are frequently failing due to running out of GPU memory (https://source.cloud.google.com/results/invocations/5a6a9122-9c10-4a92-a4c5-d8fad601b58a/targets/iree%2Fgcp_ubuntu%2Fcmake-bazel%2Flinux%2Fx86-turing%2Fmain/log):

RuntimeError: Error invoking function: /tmpfs/src/github/iree/iree/hal/vulkan/status_util.c:53: RESOURCE_EXHAUSTED; VK_ERROR_OUT_OF_DEVICE_MEMORY; vmaCreateBuffer; while invoking native function hal.allocator.allocate; while calling import

This is just a case of trying to run too many things that use GPU memory at once (CTEST_PARALLEL_LEVEL=$(nproc)). We should come up with a way to avoid this. https://cmake.org/cmake/help/latest/manual/ctest.1.html#resource-allocation details some really nice robust handling. We could also run large tests at a lower level of parallelism than small tests (and run them in two batches). Related to https://github.com/google/iree/issues/5121, which would plumb through timeouts.

For now I'm just going to drop the parallelism level on this build.

GMNGeoffrey avatar Mar 18 '21 19:03 GMNGeoffrey

Interesting. The log showed that we are failing on ResNet, which is a large model. I'm wondering whether it's also because we are having too many buffer allocations there too and we can probably optimize some away.

antiagainst avatar Mar 18 '21 19:03 antiagainst

The specific model failing varies. Entirely possible that we're also using more GPU memory than we should be, but this is probably not the best way to surface that ;-P

GMNGeoffrey avatar Mar 18 '21 19:03 GMNGeoffrey

Somewhat related to this: https://github.com/google/iree/issues/5152

antiagainst avatar Mar 18 '21 19:03 antiagainst

Phoenix also pointed out that there's no reason to be running tests without driver=vulkan in this build, which should greatly reduce the number of tests we need to run

GMNGeoffrey avatar Mar 18 '21 22:03 GMNGeoffrey

165/1414 Test #903: integrations/tensorflow/e2e/keras/applications/large_cifar10_tests__applications__iree_vulkan__model__ResNet50 ............***Failed 57.49 sec

https://source.cloud.google.com/results/invocations/bb36cd88-2151-4e8a-b25d-22e5ad591eaa/targets/iree%2Fgcp_ubuntu%2Fcmake-bazel%2Flinux%2Fx86-turing%2Fmain/log

:cry:

GMNGeoffrey avatar Mar 18 '21 23:03 GMNGeoffrey

But of course ctest doesn't support filtering by the intersection of labels :roll_eyes: Luckily the first label we want to filter on is also in the test's name.

GMNGeoffrey avatar Mar 18 '21 23:03 GMNGeoffrey

I feel the underlying issue is the same as #5268, which is fixed now. Can we revert #5163 and try to see if it works? Would be good to close this if everything is okay.

antiagainst avatar Apr 14 '21 21:04 antiagainst

Well I ended up going even further with #5166, disabling parallelism entirely. I think we should try rolling that back. Not sure if we still would want to close this, since proper resource management would maybe be nice, but :shrug:

GMNGeoffrey avatar Apr 14 '21 21:04 GMNGeoffrey

I'm hitting this again in https://github.com/iree-org/iree/pull/9905 where I get timeouts on buffer mapping tests. Dropping the parallelism to 2 makes testing take 4 minutes instead of 3 and gets rid of the timeouts.

GMNGeoffrey avatar Jul 27 '22 17:07 GMNGeoffrey

Unassigning myself from issues that I'm not actively working on

GMNGeoffrey avatar Apr 19 '23 22:04 GMNGeoffrey