launcher
launcher copied to clipboard
launcher for stampede2
I'm porting my applications from TACC's stampede to stampede2 system. I'm using launcher 3.0.1 and getting these errors on stderr:
Ncat: Invalid -d delay "c405-132" (must be greater than 0). QUITTING.
Ncat: Invalid -d delay "c405-132" (must be greater than 0). QUITTING.
Ncat: Invalid -d delay "c405-132" (must be greater than 0). QUITTING.
Ncat: Invalid -d delay "c405-132" (must be greater than 0). QUITTING.
Ncat: Invalid -d delay "c405-132" (must be greater than 0). QUITTING.
and stdout seems to indicate problem talking to task server
------------- SUMMARY ---------------
Number of hosts: 1
Working directory: /scratch/01114/vdj/vdj/job-59884011666018791-242ac11c-0001-007-igblast_test
Processes per host: 3
Total processes: 3
Total jobs: 3
Scheduling method: dynamic
-------------------------------------
Launcher: Starting parallel tasks...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
I tried using the system module instead, which seems to be a more recent version and that is working better, the jobs are running now. Still getting a couple errors but not sure if it's affecting anything.
/opt/apps/launcher/launcher-3.1/paramrun: line 171: [: -eq: unary operator expected
/opt/apps/launcher/launcher-3.1/paramrun: line 211: [: -eq: unary operator expected
The first issue is related to a change in netcat, which was noticed on LS5 and is now the case on S2. I believe the current master branch has this resolved.
For the second error, I'd suggest submitting a TACC ticket. I'm not at TACC anymore and don't currently have access to the systems to diagnose.
Those last two errors are from if statements that expect a variable called LAUNCHER_BIND to be non null. They look harmless, but also not hard to rewrite them more defensively.
Should the environment variables be setup different for stampede2? Supposedly each node has 63 cores.
Normally I define LAUNCHER_PPN to be the number of process to run simultaneously on a node, but I'm seeing weird behavior. I run with LAUNCHER_PPN=8, connect to the node and run top
and it shows each igblastn
process using about 50% CPU. Here is a snapshot:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
237141 vdj 20 0 415464 68828 15604 S 57.8 0.1 1:40.66 igblastn
237079 vdj 20 0 415616 71296 16008 S 56.9 0.1 1:53.81 igblastn
237156 vdj 20 0 415584 68968 15928 S 56.9 0.1 1:23.45 igblastn
237125 vdj 20 0 415452 68636 15644 S 56.6 0.1 1:47.74 igblastn
237109 vdj 20 0 415516 74856 15808 S 55.9 0.1 1:50.45 igblastn
237033 vdj 20 0 415584 71752 15876 S 51.6 0.1 2:27.99 igblastn
237298 vdj 20 0 415572 64628 15556 S 51.6 0.1 0:14.28 igblastn
Now if I set LAUNCHER_PPN=40, then I have 40 igblastn
process but they are only using 10% CPU each?! It's like they are throttled, the CPU% is exactly 5x less, the same multiple that I increased LAUNCHER_PPN by.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
180938 vdj 20 0 415532 73020 15668 S 10.9 0.1 1:16.09 igblastn
180847 vdj 20 0 415488 59732 15808 S 10.6 0.1 1:16.03 igblastn
180861 vdj 20 0 415584 72248 15812 S 10.6 0.1 1:15.88 igblastn
180866 vdj 20 0 415540 67872 15796 S 10.6 0.1 1:15.58 igblastn
180899 vdj 20 0 415520 59044 15808 S 10.6 0.1 1:15.82 igblastn
180903 vdj 20 0 415692 67168 15884 S 10.6 0.1 1:15.58 igblastn
180912 vdj 20 0 415648 67900 15808 S 10.6 0.1 1:16.68 igblastn
It shouldn't be an I/O thing because the files that igblastn
processes are small, ~3MB input and ~40MB output.
If I run a single igblastn
, it uses 400% CPU, i.e. 8x faster than LAUNCHER_PPN=8.
That looks suspiciously like an igblastn
specific thing. Are manually setting -num_threads
? It looks like by default, igblast uses 4 threads, which explains why a single igblastn
is using 400% CPU.
On Stampede2, the normal queue has Intel Xeon Phi processors with 68 cores. The skx-normal
queue has Skylake nodes with 48 cores. Maybe setting LAUNCHER_BIND=1
on the Xeon Phi nodes will help. Launcher isn't throttling the CPU, but depending on how the tasks are being distributed on the processor, it could be exposing bottlenecks in memory or something. Do you see the same thing on the Skylake nodess?
I tried on the Skylake nodes and it works as expected, with 8 parallel process each are using 400% CPU. So the issue does seem specific to the KNL nodes.
Also tried LAUNCHER_BIND=1
for KNL nodes but it produces errors and igblastn isn't even run.