raccoon icon indicating copy to clipboard operation
raccoon copied to clipboard

reduce memory usage of nucleation model implementation in raccoon

Open BoZeng1997 opened this issue 1 year ago • 30 comments

Two nucleation models for phase-field fracture are memory consuming. Either in how the material object is coded, or how the model is implemented in input deck level (or both).

source code

https://github.com/BoZeng1997/raccoon/blob/c24df81ba4ef97f1b3490821daa631d961e3e68d/src/materials/KLRNucleationMicroForce.C https://github.com/BoZeng1997/raccoon/blob/c24df81ba4ef97f1b3490821daa631d961e3e68d/include/materials/KLRNucleationMicroForce.h

how the model is implemented

https://github.com/BoZeng1997/raccoon/tree/c24df81ba4ef97f1b3490821daa631d961e3e68d/tutorials/surfing_boundary_problem The current implementation is for sure not the best way. It requires dispx dispy dispz to be transfered to the subapp. Then the subapp would compute stress tensor invariant I1 and J2. One way to improve it a little bit is by computing I1 and J2 in the mainapp then transfer it to subapp. I am waiting to see if there is even better way of improvement.

BoZeng1997 avatar Jul 05 '23 17:07 BoZeng1997

Please take a look at it when you have time @permcody . Thanks.

BoZeng1997 avatar Jul 05 '23 17:07 BoZeng1997

I am checking with my derivative size setting. The old 70GB per cpu case were done with size=900 and no wonder it is so memory consuming. I just found out that the minimum size to run the same problem on different machine is different. Does this make sense? Or is it a sign of some bug in my code or inappropriate compilation setting? I am running the same problem (same mesh, input deck, num of cpus, moose version ...) on a workstation and on the Duke cluster. Both of them are in mamba environment. On the workstation, --with-derivative-size=150 runs the problem fine. On the cluster, size=300 reported We caught a MetaPhysicL error in while performing element or face loops. This is potentially due to AD not having a sufficiently large derivative container size. To increase the AD container size, you can run configure in the MOOSE root directory with the '--with-derivative-size=<n>' option and then recompile. Other causes of MetaPhysicL logic errors include evaluating functions where they are not defined or differentiable like sqrt (which gets called for vector norm functions) or log with arguments <= 0 Any comment? @permcody @recuero

BoZeng1997 avatar Aug 29 '23 17:08 BoZeng1997

That's a bit surprising to me. For both cases (150 and 300), did you configure MOOSE and then compile your application?

recuero avatar Aug 29 '23 17:08 recuero

yes. ./configure --with-derivative-size=<n> in moose/scripts/ then compile.

BoZeng1997 avatar Aug 29 '23 17:08 BoZeng1997

You have std::sqrts in your models for AD objects. You could protect against a derivative divide by zero by adding a positive epsilon (see https://github.com/idaholab/moose/blob/ee15815834405de6cc5ccccd988d42a38c0dac6c/modules/contact/src/constraints/ComputeFrictionalForceLMMechanicalContact.C#L223). That might help since it's mentioned in the message itself.

recuero avatar Aug 29 '23 19:08 recuero

thanks for the advice. But I am not sure if I understand it. So the goal is to protect possible division by zero coming from std::sqrt terms not in the part of the code we can see (because in the code there is no division by sqrt() explicitly) but somewhere during computing, is that right? And I should apply this small residual to all std::sqrt(not_a_number) terms to implement the protection. Also, for the situation I just mentioned, I was not using the new Material code from me. The issue exist before I use my new material object. I am now testing if the same issue occurs on moose only test files. I will post the result when it comes out.

BoZeng1997 avatar Aug 29 '23 20:08 BoZeng1997

The issue is in the derivative of sqrt(ADReal(0)), which is ~ 1/sqrt(0). It may be that a similar issue is found in other parts of the code, not necessarily yours. You could run the model through the debugger and find out what's triggering that MetaPhysicL error.

recuero avatar Aug 29 '23 20:08 recuero

It is weird. On the duke cluster, opt hit the above error but dbg runs fine. How should I learn from the different behaviors? the dbg executable was compiled under the same environment as opt

BoZeng1997 avatar Aug 31 '23 20:08 BoZeng1997

That behavior seems a bit odd to me... @lindsayad

recuero avatar Aug 31 '23 20:08 recuero

You have std::sqrts in your models for AD objects. You could protect against a derivative divide by zero by adding a positive epsilon (see https://github.com/idaholab/moose/blob/ee15815834405de6cc5ccccd988d42a38c0dac6c/modules/contact/src/constraints/ComputeFrictionalForceLMMechanicalContact.C#L223). That might help since it's mentioned in the message itself.

I have already applied a treatment to object that will be used in std::sqrt https://github.com/BoZeng1997/raccoon/blob/f583027e2e500111e5f264841fe45353aa630ac4/src/materials/KLRNucleationMicroForce.C#L94-L99 In this case, do I still need to apply a small epsilon in sqrt()?

BoZeng1997 avatar Aug 31 '23 21:08 BoZeng1997

I would run your input with valgrind to make sure there are no uninitialized values

lindsayad avatar Aug 31 '23 21:08 lindsayad

I would run your input with valgrind to make sure there are no uninitialized values

I will post the input and mesh very soon. It is not the example listed at the beginning of this issue.

BoZeng1997 avatar Aug 31 '23 21:08 BoZeng1997

You should do that, not me 😄

lindsayad avatar Aug 31 '23 21:08 lindsayad

I'm optimistic about valgrind telling us something useful.

hugary1995 avatar Aug 31 '23 21:08 hugary1995

You should do that, not me 😄

oops, sorry i misunderstood.

BoZeng1997 avatar Aug 31 '23 21:08 BoZeng1997

I think this is the valgrind msg related to uninitialized value(s). It was printed before the moose executable printed the ad derivative size error.

==2055451== Invalid read of size 8
==2055451==    at 0x421A4D7: f_ca4ea86d12991e15 (in /hpc/group/dolbowlab/bz75/annular/fracture/fullsolve/nuc/.jitcache/ca4ea86d12991e15.so)
==2055451==    by 0x8A49D3C: ADFParser::Eval(MetaPhysicL::DualNumber<double, MetaPhysicL::SemiDynamicSparseNumberArray<double, unsigned long, MetaPhysicL::NWrapper<150ul> >, true> const*) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x8AC2A46: FunctionParserUtils<true>::evaluate(std::shared_ptr<ADFParser>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x831405C: ParsedMaterialHelper<true>::computeQpProperties() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x82F9668: Material::computeProperties() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x8215A33: FEProblemBase::reinitMaterials(unsigned short, unsigned int, bool) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x7768D6D: NonlinearThread::onElement(libMesh::Elem const*) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x77842BF: ThreadedElementLoopBase<libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> >::operator()(libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> const&, bool) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x7BD77BE: void libMesh::Threads::parallel_reduce<libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*>, ComputeResidualThread>(libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> const&, ComputeResidualThread&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x7C7CB18: NonlinearSystemBase::computeResidualInternal(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x7C7E37A: NonlinearSystemBase::computeResidualTags(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x8287F1E: FEProblemBase::computeResidualTags(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==  Address 0x1409acf8 is 8 bytes before a block of size 32 alloc'd
==2055451==    at 0x4C38913: operator new(unsigned long) (vg_replace_malloc.c:472)
==2055451==    by 0x552563B: void std::vector<unsigned long, std::allocator<unsigned long> >::_M_realloc_insert<unsigned long>(__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >, unsigned long&&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/modules/phase_field/lib/libphase_field-opt.so.0.0.0)
==2055451==    by 0x859EB61: MooseMesh::nodeToActiveSemilocalElemMap() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x777DFAE: BoundaryNodeIntegrityCheckThread::BoundaryNodeIntegrityCheckThread(FEProblemBase&, TheWarehouse::QueryCache<> const&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x82B9B2C: FEProblemBase::initialSetup() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x887F5BA: Transient::init() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x8C20E5E: MooseApp::executeExecutioner() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x8C29041: MooseApp::run() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x10B135: main (in /hpc/group/dolbowlab/bz75/projects/raccoon/raccoon-opt)
==2055451== 

How should I look for the cause of this uninitialized value? Is it an AD variable on the boundary?

BoZeng1997 avatar Aug 31 '23 23:08 BoZeng1997

Judging by the back trace it seems the issue is coming from a parsed material: Can you double check your input? Maybe uninitialized values, as Alex pointed out? Or divide by zero,...

recuero avatar Sep 01 '23 14:09 recuero

@dschwen do you think this is a false positive in the JIT code?

lindsayad avatar Sep 01 '23 14:09 lindsayad

Judging by the back trace it seems the issue is coming from a parsed material: Can you double check your input? Maybe uninitialized values, as Alex pointed out? Or divide by zero,...

I am trying with constant material properties or linear material properties to see if that clear the issue. Can you explain what is an uninitialized values in the input deck? I thought for all quantities in the input deck, when we create them in the input deck, the initial value must be provided to complete the definition.

BoZeng1997 avatar Sep 01 '23 14:09 BoZeng1997

I thought for all quantities in the input deck, when we create them in the input deck, the initial value must be provided to complete the definition.

I thought so too. Just suggested that you double check in case you see an issue.

recuero avatar Sep 01 '23 14:09 recuero

Can you do that valgrind check with a dbg executable? JIT compilation keeps the function sources in that case and we could check exactly what's going on here.

dschwen avatar Sep 01 '23 16:09 dschwen

Can you do that valgrind check with a dbg executable? JIT compilation keeps the function sources in that case and we could check exactly what's going on here.

but running in dbg executable does not trigger the error. Assertion _dynamic_n <= N' failed. only when running in opt

BoZeng1997 avatar Sep 01 '23 16:09 BoZeng1997

That suggests that there is some kind of non-deterministic error. Valgrind will catch this if that's the case regardless of the method you run with. Also how do you know that is the assertion you're triggering? I thought that you were just getting a general MetaPhysciL exception, the cause of which was unknown?

lindsayad avatar Sep 01 '23 17:09 lindsayad

That suggests that there is some kind of non-deterministic error. Valgrind will catch this if that's the case regardless of the method you run with.

valgrind --leak-check=full --track-origins=yes --show-leak-kinds=all on dbg executable did not catch any memory error. Here is the summary

==2575156== HEAP SUMMARY:
==2575156==     in use at exit: 0 bytes in 0 blocks
==2575156==   total heap usage: 3,767 allocs, 3,767 frees, 3,093,909 bytes allocated
==2575156== 
==2575156== All heap blocks were freed -- no leaks are possible
==2575156== 
==2575156== For lists of detected and suppressed errors, rerun with: -s
==2575156== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Also how do you know that is the assertion you're triggering? I thought that you were just getting a general MetaPhysciL exception, the cause of which was unknown?

Sorry i missed this part of the error msg. I only posted the part after ***ERROR***. The complete error msg is here

Assertion _dynamic_n <= N' failed.
/hpc/group/dolbowlab/bz75/moose-compilers/mambaforge3/envs/moose/libmesh/include/metaphysicl/dynamic_std_array_wrapper.h, line 74, compiled Jun 18 2023 at 15:36:32

*** ERROR ***
We caught a MetaPhysicL error in while performing element or face loops. This is potentially due to AD not having a sufficiently large derivative container size. To increase the AD container size, you can run configure in the MOOSE root directory with the '--with-derivative-size=<n>' option and then recompile. Other causes of MetaPhysicL logic errors include evaluating functions where they are not defined or differentiable like sqrt (which gets called for vector norm functions) or log with arguments <= 0

Is it normal that libmesh was compile on Jun 18? I updated mamba this Monday.

BoZeng1997 avatar Sep 01 '23 18:09 BoZeng1997

I thought the issues was an uninitialized access ...

dschwen avatar Sep 01 '23 18:09 dschwen

@BoZeng1997 what method were you running with when you got the valgrind error?

lindsayad avatar Sep 02 '23 05:09 lindsayad

I thought the issues was an uninitialized access ...

I got invalid read error msg when running opt executable with valgrind. I am not sure if that means uninitialized values.

what method were you running with when you got the valgrind error?

opt only.

BoZeng1997 avatar Sep 02 '23 06:09 BoZeng1997

Well the next thing I would try is gdb with ‘catch throw’ and see what you can learn when the metaphysicl exception is thrown. It would be good to get a stack trace

lindsayad avatar Sep 02 '23 14:09 lindsayad

This is what I can get with gbd+opt.

Time Step 1, time = 49.5, dt = 0.5
Assertion `_dynamic_n <= N' failed.
/hpc/group/dolbowlab/bz75/moose-compilers/mambaforge3/envs/moose/libmesh/include/metaphysicl/dynamic_std_array_wrapper.h, line 74, compiled Jun 18 2023 at 15:36:32

Thread 1 "raccoon-opt" hit Catchpoint 2 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x555557b2efa0, tinfo=0x7ffff7dab038 <typeinfo for MetaPhysicL::LogicError>, 
    dest=0x7ffff7e047c0 <MetaPhysicL::LogicError::~LogicError()>)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516830325/work/build/x86_64-conda-linux-gnu/libstdc++-v3/libsupc++/eh_throw.cc:80
80      /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516830325/work/build/x86_64-conda-linux-gnu/libstdc++-v3/libsupc++/eh_throw.cc: No such file or directory.
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-228.el8.x86_64
(gdb) bt
#0  __cxxabiv1::__cxa_throw (obj=0x555557b2efa0, tinfo=0x7ffff7dab038 <typeinfo for MetaPhysicL::LogicError>, dest=0x7ffff7e047c0 <MetaPhysicL::LogicError::~LogicError()>)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516830325/work/build/x86_64-conda-linux-gnu/libstdc++-v3/libsupc++/eh_throw.cc:80
#1  0x00007ffff7e03778 in f_ca4ea86d12991e15.cold () from .jitcache/ca4ea86d12991e15.so
#2  0x00007ffff5709d3d in ADFParser::Eval(MetaPhysicL::DualNumber<double, MetaPhysicL::SemiDynamicSparseNumberArray<double, unsigned long, MetaPhysicL::NWrapper<150ul> >, true> const*) ()
   from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#3  0x00007ffff5782a47 in FunctionParserUtils<true>::evaluate(std::shared_ptr<ADFParser>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#4  0x00007ffff4fd405d in ParsedMaterialHelper<true>::computeQpProperties() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#5  0x00007ffff4fb9669 in Material::computeProperties() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#6  0x00007ffff4ed5a34 in FEProblemBase::reinitMaterials(unsigned short, unsigned int, bool) () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#7  0x00007ffff4428d6e in NonlinearThread::onElement(libMesh::Elem const*) () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#8  0x00007ffff44442c0 in ThreadedElementLoopBase<libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> >::operator()(libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> const&, bool) () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#9  0x00007ffff48977bf in void libMesh::Threads::parallel_reduce<libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*>, ComputeResidualThread>(libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> const&, ComputeResidualThread&) () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#10 0x00007ffff493cb19 in NonlinearSystemBase::computeResidualInternal(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) ()
   from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#11 0x00007ffff493e37b in NonlinearSystemBase::computeResidualTags(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) ()
   from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#12 0x00007ffff4f47f1f in FEProblemBase::computeResidualTags(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) ()
   from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#13 0x00007ffff4e8dc36 in FEProblemBase::computeResidualInternal(libMesh::NumericVector<double> const&, libMesh::NumericVector<double>&, std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#14 0x00007ffff4e8d5fe in FEProblemBase::computeResidualL2Norm() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#15 0x00007ffff5551691 in FixedPointSolve::solve() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#16 0x00007ffff4cd985e in TimeStepper::step() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#17 0x00007ffff553d6ee in Transient::takeStep(double) () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#18 0x00007ffff553a577 in Transient::execute() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#19 0x00007ffff58e0e47 in MooseApp::executeExecutioner() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#20 0x00007ffff58e9042 in MooseApp::run() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#21 0x0000555555557136 in main ()

opt on cluster with ad derivative size 150 runs after I cleaned the folder .jitcache/. I think this small issue is solved. What is stored in .jitcache/ ? This folder being not cleaned sometimes cause my other simulations having zero residual always.

BoZeng1997 avatar Sep 08 '23 20:09 BoZeng1997

Oh I forgot about this ... if you change your derivative size configuration there are problems with the .jitcache, and the current solution is to do what you did: blow away the .jitcache directory before running. I know that @dschwen is aware of this and I could have sworn we have an issue for it, but I'm struggling to find it at the moment. Sorry for the trouble!

lindsayad avatar Sep 09 '23 00:09 lindsayad