batch-shipyard
batch-shipyard copied to clipboard
v3.9.1 nodeprep fails with ERROR - Intel MPI not found
Problem Description
Pool node fails to execute start task shipyard_nodeprep.sh
with the following error:
2020-05-19T19:16:52,483089149+0000 - ERROR - Intel MPI not found
Batch Shipyard Version
3.9.1
Steps to Reproduce
Submit a job, pool attempts to resize but fails
Start task failed
FailureExitCode: The task exited with an exit code representing a failure
Expected Results
Job runs
Actual Results
The shipyard_nodeprep.sh
startup script appears to be looking in the wrong location for mpivars.sh
per the script:
1597 # check for intel mpi
1598 if [ -f /opt/intel/compilers_and_libraries/linux/mpi/bin64/mpivars.sh ]; then
1599 log INFO "Intel MPI found"
1600 else
1601 log ERROR "Intel MPI not found"
1602 exit 1
1603 fi
I ssh into the node that was created and find mpivars.sh
but it is in a different location:
# find /opt/intel/ -name mpivars.sh
/opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/mpivars.sh
/opt/intel/compilers_and_libraries_2018.5.274/linux/mpi/intel64/bin/mpivars.sh
Redacted Configuration
pool.yaml
pool_specification:
id: ampe-docker-native
vm_configuration:
platform_image:
offer: CentOS-HPC
publisher: OpenLogic
sku: '7.7'
native: true
vm_count:
dedicated: 0
low_priority: 0
vm_size: STANDARD_HC44rs
autoscale:
evaluation_interval: 00:05:00
scenario:
name: active_tasks
maximum_vm_count:
dedicated: 2
low_priority: 2
maximum_vm_increment_per_evaluation:
dedicated: -1
low_priority: -1
# inter_node_communication_enabled: true
ssh:
username: shipyard
jobs.yaml
job_specifications:
- id: ampe-docker-shipyard-j5
tasks:
- docker_image: stvdwtt/ampe:azure_test
command: /home/builduser/AMPE/build/source/ampe2d /home/builduser/AMPE/examples/Dendrite2D/dendrite.input
Additional Logs
INSERT ADDITIONAL LOGS HERE
Additonal Comments
FYI...changing the platform_image sku to 7.6 completes the nodeprep.
Thanks, most likely this is an intel MPI location change in 7.7+ images.