moose icon indicating copy to clipboard operation
moose copied to clipboard

Add python 3.11 support

Open milljm opened this issue 1 year ago • 23 comments

Add Python 3.11 to our array of python versions to support.

Closes #26560

milljm avatar Jan 16 '24 13:01 milljm

Job Documentation on 1aa97f5 wanted to post the following:

View the site here

This comment will be updated on new commits.

moosebuild avatar Jan 17 '24 03:01 moosebuild

Job Coverage on 1aa97f5 wanted to post the following:

Framework coverage

Coverage did not change

Modules coverage

Coverage did not change

Full coverage reports

Reports

This comment will be updated on new commits.

moosebuild avatar Jan 17 '24 04:01 moosebuild

Job Apptainer build GCC min on 5433882 : invalidated by @milljm

Error during download/extract/detection of FBLASLAPACK

moosebuild avatar Jan 17 '24 12:01 moosebuild

There is a lot of fix in order to make Python 3.11 work...

milljm avatar Jan 17 '24 13:01 milljm

On Linux, there does not appear to be a solution for Python 3.11 and HDF5=1.12.x=mpi_mpich* and VTK

# is a conflict
conda create -n testing python=3.11 vtk hdf5=1.12.1=mpi_mpich_*

If we bump HDF5 to 1.14, there is a solution. But historically doing so takes a lot of work.

Edit: Gross. Moving to hdf5 1.14 requires we move to mpich 4.1.2. Which we know causes issues.

milljm avatar Jan 17 '24 14:01 milljm

I'm curious whether you run into this failure. It has always failed for me with my system 3.11 python

misc/signal_handler.test_signal: Working Directory: /home/lindad/projects/PR_user_training/test/tests/misc/signal_handler
misc/signal_handler.test_signal: Running command: /home/lindad/projects/PR_user_training/test/moose_test-opt -i simple_transient_diffusion_scaled.i --error --error-override --no-gdb-backtrace
misc/signal_handler.test_signal: Python exception encountered:
misc/signal_handler.test_signal: 
misc/signal_handler.test_signal: Traceback (most recent call last):
misc/signal_handler.test_signal:   File "/home/lindad/projects/PR_user_training/python/TestHarness/schedulers/RunParallel.py", line 54, in run
misc/signal_handler.test_signal:     job.run()
misc/signal_handler.test_signal:   File "/home/lindad/projects/moose/python/TestHarness/schedulers/Job.py", line 231, in run
misc/signal_handler.test_signal:     self.__tester.run(self.timer, self.options)
misc/signal_handler.test_signal:   File "/home/lindad/projects/moose/python/TestHarness/testers/Tester.py", line 420, in run
misc/signal_handler.test_signal:     self.runCommand(timer, options)
misc/signal_handler.test_signal:   File "/home/lindad/projects/moose/python/TestHarness/testers/SignalTester.py", line 70, in runCommand
misc/signal_handler.test_signal:     self.send_signal(self.process.pid)
misc/signal_handler.test_signal:   File "/home/lindad/projects/moose/python/TestHarness/testers/SignalTester.py", line 43, in send_signal
misc/signal_handler.test_signal:     out_dupe.seek(0)
misc/signal_handler.test_signal:   File "/usr/lib/python3.11/tempfile.py", line 947, in seek
misc/signal_handler.test_signal:     return self._file.seek(*args)
misc/signal_handler.test_signal:            ^^^^^^^^^^^^^^^^^^^^^^
misc/signal_handler.test_signal: ValueError: seek of closed file
misc/signal_handler.test_signal: 

lindsayad avatar Jan 18 '24 18:01 lindsayad

I'm curious whether you run into this failure. It has always failed for me with my system 3.11 python

Yes, this PR certainly does run into this... https://civet.inl.gov/job/2004305/ Just one of many issues I will have to tackle!

milljm avatar Jan 18 '24 18:01 milljm

I'm glad it's not just my environment 😅

lindsayad avatar Jan 18 '24 18:01 lindsayad

Let’s dive into this tomorrow

On Tue, Jan 23, 2024 at 4:40 PM Patrick Behne @.***> wrote:

@.**** commented on this pull request.

In python/TestHarness/testers/SignalTester.py https://github.com/idaholab/moose/pull/26566#discussion_r1464122325:

@@ -39,15 +39,16 @@ def send_signal(self,pid):

         #first, make a true duplicate of the stdout file so we don't mess with the seek on the actual file
         out_dupe = copy.copy(self.outfile)
  •        #go to the beginning of the file and see if its actually started running the binary
    
  •        out_dupe.seek(0)
    
  •        output = out_dupe.read()
    
  •        if not out_dupe.closed:
    

pid_example.txt https://github.com/idaholab/moose/files/14030841/pid_example.txt

I have simplified the logic into the attached script. The 'test' is running echo test ; sleep 1. When the script is run with line 37 commented and line 38 uncommented (i.e., time.sleep(5)), os.kill is still able to send a signal to the pid even though the process should have already been terminated (the timer gives a wait time > 5 s). However, when the script is run with line 37 uncommented and line 38 commented (i.e., os.waitpid), os.kill is not able to send the signal because the process has already terminated. HOWEVER, the time gives a wait time on the order of 1 s, so it seems reasonable that using time.sleep(5) without waitpid would give the subprocess PLENTY of time to terminate. I am not sure how to explain this behavior. Are you? @permcody https://github.com/permcody?

— Reply to this email directly, view it on GitHub https://github.com/idaholab/moose/pull/26566#discussion_r1464122325, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXFOICI7YGVOGADF2GOU3LYQBDATAVCNFSM6AAAAABB42EGVOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQNBQGEZTAMBRG4 . You are receiving this because you were mentioned.Message ID: @.***>

permcody avatar Jan 24 '24 01:01 permcody

That was fun. I rebased against my origin, forgetting that I had included a merge from Casey's origin in the mix (his PETSc update), also forgetting that Casey's branch was merged into Next.

So... redid it all, rebasing against upstream/next properly.

milljm avatar Jan 29 '24 14:01 milljm

Another git adventure brought to you by milljm.

Thank you @cticenhour for helping me sort it out!

milljm avatar Jan 29 '24 17:01 milljm

Another git adventure brought to you by milljm.

Thank you @cticenhour for helping me sort it out!

lizard-hehe

loganharbour avatar Jan 29 '24 17:01 loganharbour

Job Precheck on 1aa97f5 wanted to post the following:

The following file(s) are changed:

conda/libmesh/meta.yaml

The libmesh conda configuration has changed; ping @idaholab/moose-dev-ops

moosebuild avatar Jan 31 '24 18:01 moosebuild

This is almost ready for 'Next' scrutiny.... Just need some help with the exodiff. @lindsayad if you would like to take a stab at it?

milljm avatar Feb 14 '24 19:02 milljm

wow didn't expect there to be so many!

lindsayad avatar Feb 15 '24 17:02 lindsayad

These failures are crazy. It's like MOOSE meshing has changed with this PR but only on ARM even though this doesn't touch PETSc/libMesh/MOOSE code. Tests are even failing in serial so it's not the mpich bump either

lindsayad avatar Feb 15 '24 19:02 lindsayad

I just looked at the gold/partial_circle_rad_in.e from the meshgenerators/circular_correction_generator.partial_curve_rad failure, and it's definitely not a robust test. The mesh is being generated via XYDelaunay, and the trouble with "generate a Delaunay triangulation of this boundary" in regression tests is that typically the result is not unique.

Looking at the bottom right corner of the mesh, where the exodiff complains, we can see nodes at (0,-0.8), (0,-0.9), (0,-1), (-0.1,-1), (-0.2,-1). The three bottom right nodes are going to form one triangle ... and then the remaining part of those nodes' convex hull is a perfectly symmetrical trapezoid. In the gold solution, the trapezoid is divided by the diagonal connecting (-0.2,-1) to (0,-0.9), and that's a Delaunay mesh, but based on the exodiff the failing solution divides that trapezoid by the diagonal connecting (-0.1,-1) to (0,-0.8), and technically that's still a success because that's also a Delaunay mesh. Yay symmetry.

I'm not sure what the best solution is here. When I created my own XYDelaunay tests I tried to add asymmetries to the domain to prevent any ambiguity from showing up in them ... but Civet is saying that two of my own tests have failed too!?

roystgnr avatar Feb 15 '24 21:02 roystgnr

What the hell. I just looked at the diffs on this PR. I stand by my explanation of why those tests might get knocked over by a feather, but where is the feather? There should at least be something changing the executables, the underlying malloc/free, something that could get us iterating over elements in a different order.

roystgnr avatar Feb 15 '24 21:02 roystgnr

That's what I'm saying. The failures on this PR are insane

lindsayad avatar Feb 15 '24 21:02 lindsayad

example stack traces from the parallel hangs:

process 0
    frame #14: 0x000000010196cfd4 libmpifort.12.dylib`mpi_allreduce_ + 144
    frame #15: 0x000000010af695b8 libpetsc.3.20.dylib`mumps_propinfo_ + 72
    frame #16: 0x000000010ae96b5c libpetsc.3.20.dylib`dmumps_solve_driver_ + 15164
    frame #17: 0x000000010aef86f4 libpetsc.3.20.dylib`dmumps_ + 1748
    frame #18: 0x000000010aefda2c libpetsc.3.20.dylib`dmumps_f77_ + 6892
    frame #19: 0x000000010aef683c libpetsc.3.20.dylib`dmumps_c + 3552
    frame #20: 0x000000010a5e3ff4 libpetsc.3.20.dylib`MatSolve_MUMPS + 688
    frame #21: 0x000000010a60f3c4 libpetsc.3.20.dylib`MatSolve + 292
    frame #22: 0x000000010a9b5f6c libpetsc.3.20.dylib`PCApply_LU + 84
    frame #23: 0x000000010a99f96c libpetsc.3.20.dylib`PCApply + 204
    frame #24: 0x000000010a9a15ec libpetsc.3.20.dylib`PCApplyBAorAB + 632
    frame #25: 0x000000010aba0af0 libpetsc.3.20.dylib`KSPSolve_GMRES + 1112
    frame #26: 0x000000010ab48aac libpetsc.3.20.dylib`KSPSolve_Private + 1056
    frame #27: 0x000000010ab48648 libpetsc.3.20.dylib`KSPSolve + 16
    frame #28: 0x000000010abda684 libpetsc.3.20.dylib`SNESSolve_NEWTONLS + 1316
    frame #29: 0x000000010ac18ff8 libpetsc.3.20.dylib`SNESSolve + 1372
process 1
    frame #14: 0x0000000104cb0fd4 libmpifort.12.dylib`mpi_allreduce_ + 144
    frame #15: 0x000000010e2b98b0 libpetsc.3.20.dylib`mumps_sol_rhsmapinfo_ + 160
    frame #16: 0x000000010e1debdc libpetsc.3.20.dylib`dmumps_solve_driver_ + 31676
    frame #17: 0x000000010e23c6f4 libpetsc.3.20.dylib`dmumps_ + 1748
    frame #18: 0x000000010e241a2c libpetsc.3.20.dylib`dmumps_f77_ + 6892
    frame #19: 0x000000010e23a83c libpetsc.3.20.dylib`dmumps_c + 3552
    frame #20: 0x000000010d927ff4 libpetsc.3.20.dylib`MatSolve_MUMPS + 688
    frame #21: 0x000000010d9533c4 libpetsc.3.20.dylib`MatSolve + 292
    frame #22: 0x000000010dcf9f6c libpetsc.3.20.dylib`PCApply_LU + 84
    frame #23: 0x000000010dce396c libpetsc.3.20.dylib`PCApply + 204
    frame #24: 0x000000010dce55ec libpetsc.3.20.dylib`PCApplyBAorAB + 632
    frame #25: 0x000000010dee4af0 libpetsc.3.20.dylib`KSPSolve_GMRES + 1112
    frame #26: 0x000000010de8caac libpetsc.3.20.dylib`KSPSolve_Private + 1056
    frame #27: 0x000000010de8c648 libpetsc.3.20.dylib`KSPSolve + 16
    frame #28: 0x000000010df1e684 libpetsc.3.20.dylib`SNESSolve_NEWTONLS + 1316
    frame #29: 0x000000010df5cff8 libpetsc.3.20.dylib`SNESSolve + 1372

lindsayad avatar Feb 15 '24 21:02 lindsayad

I don't know much about troubleshooting the hangs, but what I can tell you is that non-INL profile'd machines (no JAMF), seem to not hang. -my personal Apple Si machine is fine. But I still get the same exodiff that Civet is reporting.

EDIT: Just saying this, so you don't spin your wheels too much on the hangs. Unless you want to! I have INL's IM staff looking into the hangs.

milljm avatar Feb 15 '24 21:02 milljm

I don't know much about troubleshooting the hangs, but what I can tell you is that non-INL profile'd machines (no JAMF), seem to not hang. -my personal Apple Si machine is fine.

It sounds like most recently on your personal machine you do get the hang though right? That was with your newest mpich stack?

lindsayad avatar Feb 16 '24 00:02 lindsayad

I don't know much about troubleshooting the hangs, but what I can tell you is that non-INL profile'd machines (no JAMF), seem to not hang. -my personal Apple Si machine is fine.

It sounds like most recently on your personal machine you do get the hang though right? That was with your newest mpich stack?

Yeah, crazy! This is the first I've seen it hang on non-INL equipment. Simple hello-world examples run to completion just fine.

milljm avatar Feb 16 '24 14:02 milljm

Closing in favor of #26839, which includes (will include) all the Python 3.11 fixes

milljm avatar May 01 '24 14:05 milljm