1click-hpc scaling issues due to prolog tagging api

We got into scaling issue with the tagging in prolog script

I understand the prolog is ran at every step and when many nodes are involved the job fails with timeouts

we need to find another place to do the tagging and I understand that the comment is job related but some other tags can be done only once when the instances are created, either because of the min value in the configuration or created by slurm

I am looking at places where this could be done.

maybe it can be done at the headnode instead in the PrologSlurmctld https://slurm.schedmd.com/prolog_epilog.html

Jul 22 '22 20:07 rvencu

some comment on this topic from slurm support team

Considering the nature of this command in that it needs to run in parallel but async from the other prologs/epilogs. I think a SPANK plugin would fit better than a PREP plugin and avoid the need to write any non-trivial code.

For instance, this is a popular plugin to use lua with SPANK:

https://github.com/stanford-rc/slurm-spank-lua

I think the slurm_spank_init_post_opt() is likely the function to call the tagging command.

Jul 22 '22 22:07 rvencu

looking more closely I notice the loop in the prolog script. the prolog script runs on every compute node and at every step execution and

RPC calls to the headnode (with scontrol) is discouraged
tagging all nodes from every node is making the problem exponential (n ^ 2)

I think we can still keep this in the prolog, find own instance ID with curl and make the node tag itself with a single call. there will be only n calls to the tagging API

not perfect like async tagging but much better anyway I think

Jul 23 '22 11:07 rvencu

I changed the prolog script to PrologSlurmctld and any job larger than 30 nodes crashes

Then I tried this approach inside the prolog.sh

host=$(curl http://169.254.169.254/latest/meta-data/instance-id)
aws ec2 create-tags --region $cfn_region --resources ${host} --tags Key=aws-parallelcluster-username,Value=${SLURM_JOB_USER} Key=aws-parallelcluster-jobid,Value=${SLURM_JOBID} Key=aws-parallelcluster-partition,Value=${SLURM_JOB_PARTITION}

This works for 40 nodes, will test with larger jobs too. But could not find a way to transport the comments yet

Jul 24 '22 11:07 rvencu

1click-hpc 1click-hpc copied to clipboard

scaling issues due to prolog tagging api

1click-hpc
1click-hpc copied to clipboard