reframe icon indicating copy to clipboard operation
reframe copied to clipboard

Strange behaviour when using conditional dependency

Open paulmelis opened this issue 7 months ago • 5 comments

I have a test with an optional dependency, see below. I use the Blender_CompileShaders test to force a one-time action (when the NVIDIA driver has changed) of compiling NVIDIA shaders before rendering, which can take quite some time, and I don't want that time to pollute the actual render results of Blender_RIOW. But I do want to keep track of the precompile time, hence having it as a separate test that gets logged.

@rfm.simple_test
class Blender_CompileShaders(rfm.RunOnlyRegressionTest):

    descr = 'Force Blender CUDA shader compilation'

    valid_systems = ['snellius:gpu_a100', 'snellius:gpu_h100']
    ...

class BlenderTestBase(rfm.RunOnlyRegressionTest):

    descr = 'Blender %s render benchmark' % BLENDER_VERSION

    valid_systems = [
        'snellius:rome', 'snellius:genoa', 'snellius:fat', 'snellius:gpu_a100', 'snellius:gpu_h100', 'snellius:himem_4tb', 'snellius:himem_8tb'
    ]
 
    ...

def dep_gpu_only(src, dst):
    print(src, dst, dst[0].startswith('gpu_'))
    return dst[0].startswith('gpu_')

@rfm.simple_test
class Blender_RIOW(BlenderTestBase):

    descr = 'Blender render benchmark'

    @run_after('init')
    def inject_dependencies(self):
        self.depends_on('Blender_CompileShaders', how=dep_gpu_only)

    ....

The funky thing here is that the Blender_RIOW test is run on all of our nodes, including non-GPU ones, while the Blender_CompileShaders dependency only makes sense on GPU nodes. Hence the valid_systems = ['snellius:gpu_a100', 'snellius:gpu_h100'] in that class.

However, this seems to trip up Reframe somewhat. When I run the test on a GPU node all is well and I can see the dep_gpu_only() call being made and returning True:

snellius paulm@int4 08:59 ~/reframe-surf$ reframe -C settings_files/settings.py -c production_tests --mode=production --system snellius:gpu_a100 -r -n 'Blender_CompileShaders' -n 'Blender_RIOW'
[ReFrame Setup]
  version:           4.6.1
  command:           '/sw/arch/RHEL8/EB_production/2023/software/ReFrame/4.6.1/bin/reframe -C settings_files/settings.py -c production_tests --mode=production --system snellius:gpu_a100 -r -n Blender_CompileShaders -n Blender_RIOW'
  launched by:       paulm@int4
  working directory: '/gpfs/home4/paulm/reframe-surf'
  settings files:    '<builtin>', 'settings_files/settings.py'
  check search path: (R) '/gpfs/home4/paulm/reframe-surf/production_tests'
  stage directory:   '/scratch-shared/paulm/reframe_output/staging/2024-07-16_08-59-27'
  output directory:  '/home/paulm/.reframe/production/output/2024-07-16_08-59-27'
  log files:         '/gpfs/home4/paulm/reframe-surf/reframe.log', '/gpfs/home4/paulm/reframe-surf/reframe.out'
('gpu_a100', 'eb-foss') ('gpu_a100', 'eb-foss') True
('gpu_a100', 'eb-foss') ('gpu_a100', 'eb-foss') True
[==========] Running 2 check(s)
[==========] Started on Tue Jul 16 08:59:42 2024+0200

[----------] start processing checks
[ RUN      ] Blender_CompileShaders /ed1c9d95 @snellius:gpu_a100+eb-foss
 [       OK ] (1/2) Blender_CompileShaders /ed1c9d95 @snellius:gpu_a100+eb-foss
P: kernel_loading: 0.45999999999999996 s (r:0, l:None, u:None)
[ RUN      ] Blender_RIOW /214f6d42 @snellius:gpu_a100+eb-foss
[       OK ] (2/2) Blender_RIOW /214f6d42 @snellius:gpu_a100+eb-foss
P: render: 6.28 s (r:0, l:None, u:None)
P: max_error: 0.00784314 unitless (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 2/2 test case(s) from 2 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Tue Jul 16 09:01:00 2024+0200

===============================================================================================================================================================================
PERFORMANCE REPORT
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[Blender_CompileShaders /ed1c9d95 @snellius:gpu_a100:eb-foss]
  num_tasks_per_node: 1
  num_gpus_per_node: 4
  num_cpus_per_task: 72
  num_tasks: 1
  performance:
    - kernel_loading: 0.45999999999999996 s (r: 0 s l: -inf% u: +inf%)
[Blender_RIOW /214f6d42 @snellius:gpu_a100:eb-foss]
  num_tasks_per_node: 1
  num_gpus_per_node: 4
  num_cpus_per_task: 72
  num_tasks: 1
  performance:
    - render: 6.28 s (r: 0 s l: -inf% u: +inf%)
    - max_error: 0.00784314 unitless (r: 0 unitless l: -inf% u: +inf%)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Log file(s) saved in '/gpfs/home4/paulm/reframe-surf/reframe.log', '/gpfs/home4/paulm/reframe-surf/reframe.out'

But when I run it on a non-GPU node I get warnings related to dependency resolution, dep_gpu_only() never gets called, and two tests are (incorrectly) skipped:

snellius paulm@int4 09:01 ~/reframe-surf$ reframe -C settings_files/settings.py -c production_tests --mode=production --system snellius:genoa -r -n 'Blender_CompileShaders' -n 'Blender_RIOW'
[ReFrame Setup]
  version:           4.6.1
  command:           '/sw/arch/RHEL8/EB_production/2023/software/ReFrame/4.6.1/bin/reframe -C settings_files/settings.py -c production_tests --mode=production --system snellius:genoa -r -n Blender_CompileShaders -n Blender_RIOW'
  launched by:       paulm@int4
  working directory: '/gpfs/home4/paulm/reframe-surf'
  settings files:    '<builtin>', 'settings_files/settings.py'
  check search path: (R) '/gpfs/home4/paulm/reframe-surf/production_tests'
  stage directory:   '/scratch-shared/paulm/reframe_output/staging/2024-07-16_09-02-09'
  output directory:  '/home/paulm/.reframe/production/output/2024-07-16_09-02-09'
  log files:         '/gpfs/home4/paulm/reframe-surf/reframe.log', '/gpfs/home4/paulm/reframe-surf/reframe.out'

WARNING: could not resolve dependency: ('Blender_RIOW', 'snellius:genoa', 'eb-foss') -> 'Blender_CompileShaders'
WARNING: could not resolve dependency: ('Blender_HoleInTheRoof', 'snellius:genoa', 'eb-foss') -> 'Blender_CompileShaders'
WARNING: skipping all dependent test cases
  - ('Blender_RIOW', 'snellius:genoa', 'eb-foss')
  - ('Blender_HoleInTheRoof', 'snellius:genoa', 'eb-foss')

[==========] Running 0 check(s)
[==========] Started on Tue Jul 16 09:02:27 2024+0200

[----------] start processing checks
[----------] all spawned checks have finished

[  PASSED  ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Tue Jul 16 09:02:27 2024+0200

Log file(s) saved in '/gpfs/home4/paulm/reframe-surf/reframe.log', '/gpfs/home4/paulm/reframe-surf/reframe.out'

Now I can understand that Blender_CompileShaders gets filtered out due to its valid_systems not including the system I'm running the test on. But why would this cause the self.depends_on() in Blender_RIOW to not call dep_gpu_only() at all? Shouldn't it evaluate that function first, and only when the dependency is needed check if it can be found?

Also interesting to see it list the 2nd test case Blender_HoleInTheRoof in the output, which is indeed defined, but I don't ask for it with -n on the command-line.

This is with Reframe 4.6.1

Edit: some wording

paulmelis avatar Jul 16 '24 07:07 paulmelis