opm-simulators icon indicating copy to clipboard operation
opm-simulators copied to clipboard

Use serial ZOLTAN load balancer by default.

Open blattms opened this issue 1 year ago • 11 comments

We have experienced rather poor partitioning when using the parallel version of ZOLTAN. We are not sure what the cause for this is and wheher we can fix this by using different defaults or different underlying partioners.

For the time being we simply change the default. This will cause some time increase when loadbalancing the grid but the rest of the simulator might actually be faster and compensate for this.

blattms avatar Feb 19 '24 09:02 blattms

jenkins build this please

blattms avatar Feb 19 '24 09:02 blattms

This is a welcome change until we figure out how to make good partitions in parallel.

alfbr avatar Feb 19 '24 09:02 alfbr

the error for SPE1CASE1 looks a bit suspicious

=== Executing comparison for files if these exists in reference folder ===
	 - SPE1CASE1_It0_Pr0.h5
dataset: </GLOBAL_CELL_INDEX/P0> and </GLOBAL_CELL_INDEX/P0>
105 differences found
dataset: </PRESSURE/P0> and </PRESSURE/P0>
105 differences found
dataset: </topologies/topo/elements/connectivity/P0> and </topologies/topo/elements/connectivity/P0>
828 differences found

I need to find out what this means. Topology should definitely not change here. This seems to be damaris output, though. Might actually point to an issue there.

@akva2 How can I see the full output of ACTIONX_M1, the interesting part is cut because of all the time steps.

blattms avatar Feb 19 '24 10:02 blattms

that's the damaris test so topology arrays changes since the partitioning changes. I'll provide an update for the files once this is good to go. Alternatively you can switch the test to use parallel partitioning.

i don't know where the test output threshold is configured so I've sent you the output on slack.

akva2 avatar Feb 19 '24 10:02 akva2

Thanks the change in ACTION_M1 is a bit concerning, PBUB for restart is different for one cell:

329: Keyword: PBUB, origin Restart, sequence 15
329: Global index (zero based)   = 519
329: Grid coordinate             = (4, 8, 5)
329: (first value, second value) = (190.788, 187.465)
329: 
329: Program threw an exception: [/var/lib/jenkins/workspace/opm-simulators-PR-builder/deps/opm-common/test_util/EclRegressionTest.cpp:215] Deviations exceed tolerances.
329: The absolute deviation is 3.32342529296875, and the tolerance limit is 0.02.
329: The relative deviation is 0.017419448128747354, and the tolerance limit is 0.01.
329: Comparing '/.../results/parallel/flow+actionx_m1/ACTIONX_M1' to '/.../results/parallel/flow+actionx_m1/mpi/ACTIONX_M1'.

I might need to run this on my system so, what exactly changes. Maybe the actions are triggered at different times now.

blattms avatar Feb 19 '24 12:02 blattms

I have no real clue. The reported difference is for restart of report step 15 it seems.

That is a bit strange because: When comparing parallel simulations of master and this branch first and last time step in during report step 3 are slightly different (40 seconds) for case ACTION_M1. Actions during that report step are triggered in the same time steps, just the steps are 40 seconds later.

blattms avatar Feb 19 '24 16:02 blattms

@bska Do you have a hint where the difference for the restart output of the last report step might come from?

blattms avatar Feb 21 '24 08:02 blattms

Do you have a hint where the difference for the restart output of the last report step might come from?

I'm afraid I don't. This test has been unstable/sensitive for a long time. I reduced its maximum timestep size in commit a2fa381 (PR #4749), but I guess that just hid the problem instead of actually solving it. Add to that that we way we compute PBUB is also somewhat unstable and we have a recipe for problems which are hard to track down. In this case I'd be tempted to just remove the PBPD property from RPTRST which will remove the PBUB and PDEW restart file arrays.

bska avatar Feb 22 '24 08:02 bska

In this case I'd be tempted to just remove the PBPD property from RPTRST which will remove the PBUB and PDEW restart file arrays.

Shall we do that then? @tskille, as the original case is from you: Would removing be OK with you?

blattms avatar Feb 23 '24 16:02 blattms

I'm rerunning the build check here, mostly to recreate the detailed failure descriptions on the CI system.

bska avatar Mar 07 '24 10:03 bska

jenkins build this please

bska avatar Mar 07 '24 10:03 bska