phylanx icon indicating copy to clipboard operation
phylanx copied to clipboard

Can't run on multiple localities with MPI

Open stevenrbrandt opened this issue 5 years ago • 4 comments

Run command

mpirun -np 4 /usr/local/build/bin/physl --dump-counters=py-csv.txt --dump-newick-tree=py-tree.txt --dump-dot=py-graph.txt --performance --print=result.py --hpx:run-hpx-main --hpx:thread=1 cannon.physl

Contents of cannon.physl

define$53$0(cannon$53$0, size$53$11, block(define$54$4(array1$54$4, random_d$54$13(list$54$22(size$54$23, size$54$29), find_here$54$36(), num_localities$54$49())), define$55$4(array2$55$4, random_d$55$13(list$55$22(size$55$23, size$55$29), find_here$55$36(), num_localities$55$49())), define$56$4(v1$56$4, cannon_product_d$56$9(array1$56$26, array2$56$34)), define$57$4(v2$57$4, dot_d$57$9(array1$57$15, array2$57$23)), all$58$11(__eq$58$15(v1$58$15, v2$58$21))))
cannon(120)

Generated from

def cannon(size):
    array1 = random_d([size, size], find_here(), num_localities())
    array2 = random_d([size, size], find_here(), num_localities())
    v1 = cannon_product_d(array1, array2)
    v2 = dot_d(array1, array2)
    return all(v1 == v2)

Output:

physl: exception caught:
the given component id does not belong to a local object: HPX(bad_parameter)
physl: exception caught:
the given component id does not belong to a local object: HPX(bad_parameter)
physl: exception caught:
the given component id does not belong to a local object: HPX(bad_parameter)
physl: exception caught:
the given component id does not belong to a local object: HPX(bad_parameter)

stevenrbrandt avatar Oct 06 '20 18:10 stevenrbrandt

@stevenrbrandt Are you able to run if using srun instread of mpirun?

NanmiaoWu avatar Oct 06 '20 19:10 NanmiaoWu

I tested on latest stable HPX, blaze, blaze_tensor, and phylanx. I created a file, named test_c.physl, which contains

define$53$0(cannon$53$0, size$53$11, block(define$54$4(array1$54$4, random_d$54$13(list$54$22(size$54$23, size$54$29), find_here$54$36(), num_localities$54$49())), define$55$4(array2$55$4, random_d$55$13(list$55$22(size$55$23, size$55$29), find_here$55$36(), num_localities$55$49())), define$56$4(v1$56$4, cannon_product_d$56$9(array1$56$26, array2$56$34)), define$57$4(v2$57$4, dot_d$57$9(array1$57$15, array2$57$23)), all$58$11(__eq$58$15(v1$58$15, v2$58$21))))
cannon(120)

Tested on qbc, the error info is:

[nanmiao@qbc2 bin]$ srun -N 1 -n 4 /home/nanmiao/dev/src/phylanx/build/bin/physl --dump-counters=py-csv.txt --dump-newick-tree=py-tree.txt --dump-dot=py-graph.txt --performance --print=result.py --hpx:run-hpx-main --hpx:thread=1 /home/nanmiao/dev/src/phylanx/examples/algorithms/als/test_c.physl 
physl: exception caught:
test_c.physl(58, 15): __eq:: cannot broadcast a matrix into a differently sized matrix: HPX(bad_parameter)
physl: exception caught:
test_c.physl(58, 15): __eq:: cannot broadcast a matrix into a differently sized matrix: HPX(bad_parameter)
physl: exception caught:
test_c.physl(58, 15): __eq:: cannot broadcast a matrix into a differently sized matrix: HPX(bad_parameter)
physl: exception caught:
test_c.physl(58, 15): __eq:: cannot broadcast a matrix into a differently sized matrix: HPX(bad_parameter)

NanmiaoWu avatar Oct 06 '20 21:10 NanmiaoWu

@NanmiaoWu no, I'm running inside a singularity image.

stevenrbrandt avatar Oct 07 '20 00:10 stevenrbrandt

@stevenrbrandt, this could have been caused by a problem in HPX. Could you please try https://github.com/STEllAR-GROUP/hpx/pull/5004 to see if this fixes your issue?

After applying the patch, I see the same error as @NanmiaoWu, caused by dot_d returning a badly-sized array. See https://github.com/STEllAR-GROUP/phylanx/issues/1284 for the corresponding ticket.

hkaiser avatar Oct 08 '20 22:10 hkaiser