Assertion error when running on different machines
Issue description
Whenever I try to run Pilgrim to trace a MPI program running on a local machine, I have no issue. However, once I try to run it on another machine, I get the following issue:
src/pilgrim_mpi_objects.c:172: create_request_id: Assertion 'entry == NULL' failed.
Steps to reproduce
I'm using mpich 4.0.2, and the latest version of Pilgrim. I have two nodes available, localnode and remotenode
mpirun -np N --host localnode,remotenode -LD_PRELOAD <path to libpilgrim.so> <my executable> yields the aforementioned error as soon as N is greater than 1. If I remove the remote node, I can get N as big as I want it to be.
Possible fix
The mentionned line is the following:
int create_request_id(MPI_Request *req, bool from_universal_pool, int func_id, int src_or_dst, int tag, int comm) {
if(req==NULL || *req == MPI_REQUEST_NULL)
return invalid_request_id;
RequestHash *entry = request_hash_entry(req);
assert(entry == NULL); // <- this one
I've removed this assertion, and so far I've seen nothing weird happening. I have no idea as to whether that assertion is important.
Hi @khatharsis42 can you try tracing different applications to see if you get the same error? And is it possible to share your code so I can debug?
I'm using Pilgrim to trace a few mini-apps, and I've seen that particular bug when tracing AMG and Lulesh (once I use enough MPI processes, no problem with 8 but the bug appears when using 27). Interestingly, I've had no issue with Kripke.
Thanks. I'll test AMG and get back to you.