core icon indicating copy to clipboard operation
core copied to clipboard

Breaking changes with PCU

Open flagdanger opened this issue 1 year ago • 50 comments

Breaking changes with PCU

  • Removed pcu.cc and all uses of it
  • Changed functions that optionally took a PCUObj to take one

flagdanger avatar Jun 05 '24 15:06 flagdanger

@cwsmith This draft pull request is our starting point for having breaking changes of core with the new PCU as discussed in Pull request #388 . The biggest thing that I don't like is the way parama group works because it still requires doing the comm switch operations.

However, my concern is that breaking that particular code will be a much more substantial thing that we want to force ourselves or the users deal with.

We are certainly open to suggestions on if there is a way out of that comm switch without totally blowing things up.

jacobmerson avatar Jun 06 '24 05:06 jacobmerson

@jacobmerson @flagdanger Can we meet next week to briefly discuss? I see the call to switch in test/ptnParma.cc but would like to get the bigger picture view of the issue before digging into this more. I'll send an email.

cwsmith avatar Jun 07 '24 19:06 cwsmith

There is a PCU usage example in the pumi users guide we should update: https://scorec.rpi.edu/pumi/PUMI.pdf (see section 12.2 page 72) The source for that users guide is here: https://github.com/SCOREC/pumi-docs/tree/master/userGuide

A pumi intro doc is here: https://github.com/SCOREC/pumi-intro we use this for the finite element programming doc.

A maintainers guide for pumi is here: https://github.com/SCOREC/core/wiki/Maintainer's-Guide it describes the process to create a release.

cwsmith avatar Jun 10 '24 16:06 cwsmith

@cwsmith Thanks for the review! This is a bit of a painful one as we had to touch so many files. Flynn is currently working on running the DeltaWing case. And once we have that sorted we will update this pull request.

@flagdanger please take a look at Cameron's review. Once you have a chance to look it over we can discuss the best approach for the areas that require more than simple changes.

jacobmerson avatar Jun 26 '24 19:06 jacobmerson

@jacobmerson @cwsmith The DeltaWing case is running, and I'm fixing the small changes, Cameron, you've mentioned now.

flagdanger avatar Jun 26 '24 20:06 flagdanger

@cwsmith Do you have a model to use to partition the deltaWing_500kMetric.smb mesh, so we can run your adaptPumi case on multiple processors?

flagdanger avatar Jul 01 '24 14:07 flagdanger

@flagdanger Try passing '.null' as the argument; we don't have the CAD model for this mesh.

cwsmith avatar Jul 01 '24 14:07 cwsmith

@cwsmith @jacobmerson Made changes based on Cameron's comments, I believe this is ready to be looked at again.

flagdanger avatar Jul 22 '24 18:07 flagdanger

/runtests

cwsmith avatar Jul 25 '24 17:07 cwsmith

/runtests

cwsmith avatar Jul 26 '24 22:07 cwsmith

Test Result: failure (details)

github-actions[bot] avatar Jul 26 '24 22:07 github-actions[bot]

@flagdanger Would you please take a look at the the CI build failure with simmetrix enabled:

https://github.com/SCOREC/core/actions/runs/10118195635/job/27984611512#step:3:405

cd /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/_temp/build/mds && /opt/scorec/spack/rhel9/v0201_4/install/linux-rhel9-x86_64/gcc-12.3.0/mpich-4.1.1-xpoyz4tqgfxtrm6m7qq67q4ccp5pnlre/bin/mpicc -DHAVE_SIMMETRIX -DOMPI_SKIP_MPICXX -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/_temp/build/mds -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/pcu -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/pcu/reel -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/gmi -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/lion -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/can -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/mth -g -Werror -Wall -Wextra -Wno-strict-overflow  -O3 -DNDEBUG -MD -MT mds/CMakeFiles/mds.dir/mds_apf.c.o -MF CMakeFiles/mds.dir/mds_apf.c.o.d -o CMakeFiles/mds.dir/mds_apf.c.o -c /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/mds/mds_apf.c
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc: In function 'apf::Mesh2* apf::createMesh(pParMesh, pcu::PCU)':
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1108:31: error: cannot convert 'pcu::PCU' to 'pcu::PCU*'
 1108 |     m->init(getSerendipity(), PCUObj);
      |                               ^~~~~~
      |                               |
      |                               pcu::PCU
In file included from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh2.h:14,
                 from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.h:6,
                 from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1:
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh.h:111:40: note:   initializing argument 2 of 'void apf::Mesh::init(apf::FieldShape*, pcu::PCU*)'
  111 |     void init(FieldShape* s, pcu::PCU *PCUObj);
      |                              ~~~~~~~~~~^~~~~~
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1110:33: error: cannot convert 'pcu::PCU' to 'pcu::PCU*'
 1110 |     m->init(getLagrange(order), PCUObj);
      |                                 ^~~~~~
      |                                 |
      |                                 pcu::PCU
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh.h:111:40: note:   initializing argument 2 of 'void apf::Mesh::init(apf::FieldShape*, pcu::PCU*)'
  111 |     void init(FieldShape* s, pcu::PCU *PCUObj);
      |                              ~~~~~~~~~~^~~~~~

The environment and cmake commands used for the CI build on a SCOREC RHEL9 system are here:

https://github.com/SCOREC/core/blob/master/.github/workflows/simmetrix_enabled_pr_comment_trigger_self_hosted.yml#L34-L59

cwsmith avatar Jul 27 '24 15:07 cwsmith

@flagdanger Would you please take a look at the the CI build failure with simmetrix enabled:

https://github.com/SCOREC/core/actions/runs/10118195635/job/27984611512#step:3:405

cd /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/_temp/build/mds && /opt/scorec/spack/rhel9/v0201_4/install/linux-rhel9-x86_64/gcc-12.3.0/mpich-4.1.1-xpoyz4tqgfxtrm6m7qq67q4ccp5pnlre/bin/mpicc -DHAVE_SIMMETRIX -DOMPI_SKIP_MPICXX -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/_temp/build/mds -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/pcu -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/pcu/reel -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/gmi -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/lion -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/can -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/mth -g -Werror -Wall -Wextra -Wno-strict-overflow  -O3 -DNDEBUG -MD -MT mds/CMakeFiles/mds.dir/mds_apf.c.o -MF CMakeFiles/mds.dir/mds_apf.c.o.d -o CMakeFiles/mds.dir/mds_apf.c.o -c /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/mds/mds_apf.c
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc: In function 'apf::Mesh2* apf::createMesh(pParMesh, pcu::PCU)':
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1108:31: error: cannot convert 'pcu::PCU' to 'pcu::PCU*'
 1108 |     m->init(getSerendipity(), PCUObj);
      |                               ^~~~~~
      |                               |
      |                               pcu::PCU
In file included from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh2.h:14,
                 from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.h:6,
                 from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1:
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh.h:111:40: note:   initializing argument 2 of 'void apf::Mesh::init(apf::FieldShape*, pcu::PCU*)'
  111 |     void init(FieldShape* s, pcu::PCU *PCUObj);
      |                              ~~~~~~~~~~^~~~~~
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1110:33: error: cannot convert 'pcu::PCU' to 'pcu::PCU*'
 1110 |     m->init(getLagrange(order), PCUObj);
      |                                 ^~~~~~
      |                                 |
      |                                 pcu::PCU
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh.h:111:40: note:   initializing argument 2 of 'void apf::Mesh::init(apf::FieldShape*, pcu::PCU*)'
  111 |     void init(FieldShape* s, pcu::PCU *PCUObj);
      |                              ~~~~~~~~~~^~~~~~

The environment and cmake commands used for the CI build on a SCOREC RHEL9 system are here:

https://github.com/SCOREC/core/blob/master/.github/workflows/simmetrix_enabled_pr_comment_trigger_self_hosted.yml#L34-L59

Yes I'll look now, thanks.

flagdanger avatar Jul 27 '24 17:07 flagdanger

@cwsmith Fixed the things you mentioned, and can run tests again.

flagdanger avatar Jul 31 '24 20:07 flagdanger

/runtests

cwsmith avatar Aug 01 '24 18:08 cwsmith

@cwsmith Can you run again, it failed with this error: This request was automatically failed because there were no enabled runners online to process the request for more than 1 days

flagdanger avatar Aug 05 '24 14:08 flagdanger

/runtests

cwsmith avatar Aug 05 '24 15:08 cwsmith

Test Result: failure (details)

github-actions[bot] avatar Aug 05 '24 15:08 github-actions[bot]

@flagdanger Thanks for checking in on this.

It looks like there build errors in the phasta code:

/lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/ph.cc: In function 'apf::Mesh2* ph::loadMesh(gmi_model*&, const char*, pcu::PCU*)':
/lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/ph.cc:169:27: error: too few arguments to function 'apf::Mesh2* apf::createMesh(pParMesh, pcu::PCU*)'
  169 |     mesh = apf::createMesh(sim_mesh);

https://github.com/SCOREC/core/actions/runs/10251693152/job/28360310000#step:3:793

Please let me know if you need help setting up a build that reproduces this. I can meet on webex etc.

cwsmith avatar Aug 05 '24 15:08 cwsmith

@cwsmith Thank you, could you run them again? And yes, if this fails again relating to simmetrix, then I will want to be able to reproduce these errors on my machine.

flagdanger avatar Aug 06 '24 15:08 flagdanger

/runtests

cwsmith avatar Aug 06 '24 15:08 cwsmith

Test Result: failure (details)

github-actions[bot] avatar Aug 06 '24 16:08 github-actions[bot]

~~It looks like the current errors are promoted warnings that are unrelated to this PR~~:

https://github.com/SCOREC/core/actions/runs/10269910071/job/28416451838#step:3:805

In member function 'virtual bool ph::PhastaSharing::isOwned(apf::MeshEntity*)',
    inlined from 'virtual bool ph::PhastaSharing::isOwned(apf::MeshEntity*)' at /lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:42:30,
    inlined from 'virtual bool ph::PhastaSharing::isOwned(apf::MeshEntity*)' at /lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:39:8,
    inlined from 'void ph::getLinks(apf::Mesh*, int, Links&, BCs&)' at /lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:111:23:
/lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:41:9: error: array subscript 'ph::PhastaSharing[0]' is partly outside array bounds of 'unsigned char [16]' [-Werror=array-bounds]
   41 |     if (isDG)
      |         ^~~~
In constructor 'ph::PhastaSharing::PhastaSharing(apf::Mesh*)',
    inlined from 'void ph::getLinks(apf::Mesh*, int, Links&, BCs&)' at /lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:100:22:
/lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:23:39: note: object of size 16 allocated by 'operator new'
   23 |     helperN = new apf::NormalSharing(m);
      |                                       ^

~~I'll see if I can resolve these quickly in develop.~~ The build is successful on develop but fails on this branch.

@flagdanger Would you please take a look at the changes to phasta/phLinks.cc? I can help debug if needed.

(ins)smithc11@asics: /space/smithc11/pumiDev/core (pcu-object-breaking)$ git diff develop phasta/phLinks.cc
diff --git a/phasta/phLinks.cc b/phasta/phLinks.cc
index 683bf2c1..73a2a0ff 100644
--- a/phasta/phLinks.cc
+++ b/phasta/phLinks.cc
@@ -1,4 +1,3 @@
-#include <PCU.h>
 #include "phLinks.h"
 #include "phAdjacent.h"
 #include <apf.h>
@@ -54,7 +53,7 @@ struct PhastaSharing : public apf::Sharing {
     if ( ! mesh->hasMatching())
       return;
     /* filter out matches which are on the same part as the global master */
-    int self = PCU_Comm_Self();
+    int self = mesh->getPCU()->Self();
     size_t i = 0;
     for (size_t j = 0; j < copies.getSize(); ++j)
       if (copies[j].peer != self)
@@ -98,40 +97,39 @@ struct PhastaSharing : public apf::Sharing {
 
 void getLinks(apf::Mesh* m, int dim, Links& links, BCs& bcs)
 {
-  PhastaSharing* shr = new PhastaSharing(m);
-  PCU_Comm_Begin();
+  PhastaSharing shr(m);
+  m->getPCU()->Begin();
   apf::MeshIterator* it = m->begin(dim);
   apf::MeshEntity* v;
   while ((v = m->iterate(it))) {
     apf::ModelEntity* me = m->toModel(v);
-    shr->isDG = ph::isInterface(m->getModel(),(gmi_ent*) me,bcs.fields["DG interface"]);
+    shr.isDG = ph::isInterface(m->getModel(),(gmi_ent*) me,bcs.fields["DG interface"]);
 /* the alignment is such that the owner part's
    array follows the order of its vertex iterator
    traversal. The owner dictates the order to the
    other part by sending remote copies */
-    if ( ! shr->isOwned(v))
+    if ( ! shr.isOwned(v))
       continue;
     apf::CopyArray remotes;
-    shr->getCopies(v, remotes);
+    shr.getCopies(v, remotes);
     for (size_t i = 0; i < remotes.getSize(); ++i) {
       /* in matching we may accumulate multiple occurrences
          of the same master in the outgoing links array
          to a part that contains multiple copies of it. */
       links[LinkKey(1, remotes[i].peer)].push_back(v);
-      PCU_COMM_PACK(remotes[i].peer, remotes[i].entity);
+      m->getPCU()->Pack(remotes[i].peer, remotes[i].entity);
     }
   }
   m->end(it);
-  PCU_Comm_Send();
-  while (PCU_Comm_Listen()) {
-    int peer = PCU_Comm_Sender();
-    while (!PCU_Comm_Unpacked()) {
+  m->getPCU()->Send();
+  while (m->getPCU()->Listen()) {
+    int peer = m->getPCU()->Sender();
+    while (!m->getPCU()->Unpacked()) {
       apf::MeshEntity* v;
-      PCU_COMM_UNPACK(v);
+      m->getPCU()->Unpack(v);
       links[LinkKey(0, peer)].push_back(v);
     }
   }
-  delete shr;
 }
 
 /* encode the local links into a big array of integers
@@ -218,7 +216,7 @@ static apf::MeshEntity* getOtherElem(apf::Mesh* m, apf::MeshEntity* elem,
     return 0;
   apf::Matches matches;
   m->getMatches(face, matches);
-  int self = PCU_Comm_Self();
+  int self = m->getPCU()->Self();
   for (size_t i = 0; i < matches.getSize(); ++i)
     if (matches[i].peer == self)
       return m->getUpward(matches[i].entity, 0);

cwsmith avatar Aug 06 '24 16:08 cwsmith

@cwsmith Okay I believe it is fixed.

flagdanger avatar Aug 07 '24 16:08 flagdanger

/runtests

cwsmith avatar Aug 07 '24 17:08 cwsmith

Test Result: failure (details)

github-actions[bot] avatar Aug 07 '24 17:08 github-actions[bot]

@cwsmith Okay can you run again, I'm not getting these errors on my machine.

flagdanger avatar Aug 07 '24 19:08 flagdanger

/runtests

cwsmith avatar Aug 07 '24 19:08 cwsmith

Test Result: failure (details)

github-actions[bot] avatar Aug 07 '24 19:08 github-actions[bot]

@cwsmith Could you run tests again?

flagdanger avatar Aug 08 '24 21:08 flagdanger