libmesh icon indicating copy to clipboard operation
libmesh copied to clipboard

Exodus Can't Write Large Meshes

Open friedmud opened this issue 6 years ago • 11 comments

Seen this several times now... Exodus simply can't handle meshes over about 200M elements:

MooseMesh::prepare()
 Mesh Information:
  elem_dimensions()={3}
  spatial_dimension()=3
  n_nodes()=259279152
    n_local_nodes()=259279152
  n_elem()=253736832
    n_local_elem()=253736832
    n_active_elem()=253736832
  n_subdomains()=11
  n_partitions()=1
  n_processors()=1
  n_threads()=1
  processor_id()=0
Error writing element blocks.
Stack frames: 13
0: libMesh::print_trace(std::ostream&)
1: libMesh::MacroFunctions::report_error(char const*, int, char const*, char const*)
2: libMesh::ExodusII_IO_Helper::write_elements(libMesh::MeshBase const&, bool)
3: libMesh::ExodusII_IO::write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
4: MeshOnlyAction::act()
5: Action::timedAct()
6: ActionWarehouse::executeActionsWithAction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
7: ActionWarehouse::executeAllActions()
8: MooseApp::runInputFile()
9: MooseApp::run()
10: /home/gastdr/projects/lemhi_new/moose/test/moose_test-opt() [0x402297]
11: __libc_start_main
12: /home/gastdr/projects/lemhi_new/moose/test/moose_test-opt() [0x40250c]
[0] ../src/mesh/exodusII_io_helper.C, line 1417, compiled Mar  6 2019 at 09:36:38
Error closing Exodus file.

No idea what the problem is - but it's a serious bummer.

Of course, I'm not trying to RUN with these meshes... they are just intermediaries before being split. The workaround for now is to use XDR instead... but writing XDR meshes that large is prohibitively slow... so there aren't really any good solutions!

friedmud avatar Mar 11 '19 17:03 friedmud

Forgive my ignorance here, but I thought Nemesis was the parallel version of Exodus that is meant for large meshes?

pbauman avatar Mar 11 '19 17:03 pbauman

It is - but all of our mesh generation routines are serial only. These Exodus files happen at a certain step in our mesh generation process... then we take them and split them into Nemesis.

It is technically possible to generate serially on each processor - then directly split (without the Exodus intermediary). But the mesh generation itself can take several hours... and I might no know how many different numbers of processors I want to run on - so it's useful to have the Exodus file output so if I need a new splitting I can easily do that.

friedmud avatar Mar 11 '19 17:03 friedmud

One thing that might be relatively easy to try is building libmesh with HDF5 support. Then Exodus will write files in the NetCDF4 format, which should be better at handling larger filesizes.

If that doesn't work... there may be a bug in the writer itself that is only exposed by really large meshes :cry:

jwpeterson avatar Mar 11 '19 18:03 jwpeterson

I'm pretty sure the problem is on our end. Two main things:

  1. We should be passing 64bit flags to Exodus when the library is configured with dof_id_type = 8bytes: https://gsjaardema.github.io/seacas/html/index.html#int64

  2. There are TONS of ints running around in exodusII_io_helper! That's not going to help at all! All of those should be turned into dof_id_type....

friedmud avatar Mar 19 '19 21:03 friedmud

Our Exodus files are written in either 64bit-offset mode or NetCDF4 (run ncdump -k to see the type). But you shouldn’t be running into int limits unless you have 2 billion nodes... in your last email it was 200 million. I agree that we should update the Exodus writer at some point... we have a much older version (5.22) at the moment so I don’t think it has the 64-bit API that you linked.

On Mar 19, 2019, at 4:49 PM, Derek Gaston [email protected] wrote:

I'm pretty sure the problem is on our end. Two main things:

We should be passing 64bit flags to Exodus when the library is configured with dof_id_type = 8bytes: https://gsjaardema.github.io/seacas/html/index.html#int64

There are TONS of ints running around in exodusII_io_helper! That's not going to help at all! All of those should be turned into dof_id_type....

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jwpeterson avatar Mar 19 '19 22:03 jwpeterson

Hmmm - not exactly. Check out the info here:

https://gsjaardema.github.io/seacas/html/exodus_formats.html

In 64-bit offset mode it runs into trouble at 134M elements: when writing the connectivity.

BTW: I tried to pass:

EX_ALL_INT64_DB | EX_ALL_INT64_API

as flags - and it didn't complain during the writing - but it segfaults on the reading. I guess that even if our Exodus API supported those flags... then we would have to change our reading routines to use 64bit arrays instead of 32bit...

friedmud avatar Mar 19 '19 23:03 friedmud

Hmmm - not exactly. Check out the info here: https://gsjaardema.github.io/seacas/html/exodus_formats.html In 64-bit offset mode it runs into trouble at 134M elements: when writing the connectivity.

Hmm, OK, these numbers do make sense given the 4 GiB (2^30 bytes) limit for any single dataset in the file. The 134M number is specific to HEX8s, it gets worse (39.7M elements max) if you are using HEX27s. So, the limiting factor for the "Large Model (64-bit offset)" file format is never going to be numeric_limits<int>::max() (typically 2^31)... it's going to be the numbers above.

If you write in the "Netcdf-4 Non-Classic" format (which is now our default if HDF5 is available), then numeric_limits<int>::max() is going to be the limiting factor, but the current implementation should still allow you to have up to 2^31 nodes and 2^31 elements. Storing just the connectivity for that many HEX8s would require 2^31 * 4 bytes * 8 = 64 Gib!

jwpeterson avatar Mar 20 '19 16:03 jwpeterson

I'm wondering if there have been developments regarding issue #2065.

I'm running OpenMC simulations and it has been crashing when requesting libMesh to write an ExodusII/Nemesis output.

I'm using Ubuntu 22.04.4 LTS with a compiled version of OpenMC 0.15.0 with libMesh. I’ve tried pointing OpenMC to the libMesh from MOOSE, and also building libMesh from scratch. Both cases give the same error.

Below is the output from OpenMC built with MOOSE’s libMesh while running this notebook:

(...)
       99/1    0.23078    0.23149 +/- 0.00072
      100/1    0.23166    0.23150 +/- 0.00072
 Creating state point statepoint.100.h5...
 Writing file: tally_1.100.e for unstructured mesh 1
libMesh terminating:
Error creating ExodusII/Nemesis mesh file.
Stack frames: 16
0: libMesh::print_trace(std::ostream&)
1: libMesh::MacroFunctions::report_error(char const*, int, char const*, char const*, std::ostream&)
2: libMesh::ExodusII_IO_Helper::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
3: libMesh::ExodusII_IO::write_nodal_data_common(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool)
4: libMesh::ExodusII_IO::write_nodal_data_discontinuous(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<double, std::allocator<double> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)
5: libMesh::ExodusII_IO::write_discontinuous_exodusII(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, libMesh::EquationSystems const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*)
6: openmc::LibMesh::write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
7: openmc::write_unstructured_mesh_results()
8: openmc_statepoint_write
9: openmc::finalize_batch()
10: openmc_next_batch
11: openmc_run
12: main
13: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f0b2d6b9d90]
14: __libc_start_main
15: openmc(+0xce35) [0x56249e8b9e35]
[0] ../src/mesh/exodusII_io_helper.C, line 2185, compiled Sep 10 2024 at 09:42:37

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

I’d appreciate any help. Thanks Luiz

luiz-bn avatar Sep 17 '24 00:09 luiz-bn

Hello @luiz-bn,

How large is the mesh you are trying to write? Do you have libmesh compiled with HDF5 support?

I'm not sure if the error message you are reporting,

Error creating ExodusII/Nemesis mesh file.

is the same as the original error reported on this Issue, which occurred while writing the element blocks, not while just creating the file.

jwpeterson avatar Sep 17 '24 18:09 jwpeterson

Hi @jwpeterson, Thanks for getting back to me and apologies for the delayed reply. I've been investigating the issue further, and it's likely that OpenMC hasn't been updated to work with the latest versions of libMesh. Some OpenMC users have reported the same error as me here https://openmc.discourse.group/t/error-when-writing-libmesh-exodus-file/4912.

luiz-bn avatar Dec 10 '24 04:12 luiz-bn

If they're running with a recent libMesh when hitting map_find() error: key “255” not found in file …/src/mesh/exodusII_io_helper.C on line 544, that's the line with auto & maps_for_dim = libmesh_map_find(conversion_map, this->num_dim);, and here the key is this->num_dim, i.e. the dimensionality we've either autodetected or been requested to write. But ... there's no way we'll ever autodetect that our mesh is 255-dimensional ... could someone have manually called MeshBase::set_spatial_dimension(d) with a d large enough to truncate??

It would be interesting if you could get in there with a debugger, and catch calls to ExodusII_IO_Helper::write_as_dimension() and MeshBase::set_spatial_dimension() to see if someone's doing that wrong, and also break at around 2279 to see which branch we end up using there.

Either way I'd say open a new issue; this seems to be unrelated to the 32-bit vs 64-bit writes. (unless you really are trying to write hundreds of millions of elements?)

roystgnr avatar Dec 10 '24 14:12 roystgnr