Trixi.jl icon indicating copy to clipboard operation
Trixi.jl copied to clipboard

HDF5 issue with parallel execution on clusters

Open peyvanahmad opened this issue 2 years ago • 3 comments

Hello,

I am trying to run Mach3 step test problem on a cluster using MPI. The program raises an error related to HDF5 when the mesh file or solution file is being written in the "out" folder. Sometime when I change the number of cores the simulation runs and solution files are written just fine but after some time into the simulation the HDF5 error will be raised again. Here is a sample of the error that I get:

ERROR: ERROR: LoadError: LoadError: HDF5.API.H5Error: Error getting attribute name libhdf5 Stacktrace: [1] H5Aget_name: Invalid arguments to routine/Inappropriate type not an attribute Stacktrace: [1] macro expansion @ ~/.julia/packages/HDF5/wWr4z/src/api/HDF5.API.H5Error: Error getting attribute name libhdf5 Stacktrace:error.jl:18 [inlined] [2] [1] h5a_get_name(H5Aget_name: Invalid arguments to routine/Inappropriate typeattr_id:: not an attribute

The operating system on the cluster is Red Hat. I am using openmpi_4.0.0_gcc version and the file system is GPFS.

peyvanahmad avatar Apr 03 '22 02:04 peyvanahmad

Thanks for reaching out, @peyvanahmad! Could you let me know a little more about the circumstances of when the error occurs:

  • Does it happen only when running in parallel or also when running in serial?
  • Does it happen reproducibly, i.e., does it always fail with a given setup, or is it a sporadoc occurence?
  • How many ranks do you use and how far into the simulation does it fail (i.e., after how many time steps)?
  • Do you use the cluster's HDF5 library (by setting JULIA_HDF5_PATH or the one auto-installed as a JLL package by Julia?
  • If you use the cluster's HDF5 library: Do you try to write to the files in parallel or do you use Trixi's current serialized I/O approaach?
  • Can you reproduce the error with a different MPI library on your cluster, i.e., something other than openmpi_4.0.0_gcc?

Finally, it would be helpful if you could post the full error message, or at least the full stacktrace to figure out in which routine the error occurs.

sloede avatar Apr 03 '22 03:04 sloede

It happens when running in parallel. It happens sporadically. Sometimes the simulation goes through to the end in parallel and sometimes the HDF5 error occurs. When I run on a single computational node the error appears less frequently. I used 32 ranks on a single computational node (I have attached the complete error file here). The error happened at 88,000 time step I use the auto-installed HDF5 package by Julia. I try to run the code with a different MPI library but openmpi_4.0.0_gcc is the one that is very robust.

On Sat, Apr 2, 2022 at 11:55 PM Michael Schlottke-Lakemper < @.***> wrote:

Thanks for reaching out, @peyvanahmad https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpeyvanahmad&data=04%7C01%7Capeyva2%40groute.uic.edu%7Cafbb35cf341342f1ca9708da1525da83%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637845549565766903%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=1vEAgx9ToM10LtEUgAhlve%2BDuQQF2Xtd%2B%2Bt7PbvBOB0%3D&reserved=0! Could you let me know a little more about the circumstances of when the error occurs:

Finally, it would be helpful if you could post the full error message, or at least the full stacktrace to figure out in which routine the error occurs.

— Reply to this email directly, view it on GitHub https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftrixi-framework%2FTrixi.jl%2Fissues%2F1109%23issuecomment-1086771100&data=04%7C01%7Capeyva2%40groute.uic.edu%7Cafbb35cf341342f1ca9708da1525da83%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637845549565766903%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=c3iN8aHmrkR%2FNTLIEjSEFGOBJXfleex8GkYDe1E5Qnk%3D&reserved=0, or unsubscribe https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAGKTJB73EN3HMIHNUNT4STLVDEJEVANCNFSM5SMQCHBA&data=04%7C01%7Capeyva2%40groute.uic.edu%7Cafbb35cf341342f1ca9708da1525da83%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637845549565766903%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=WU9RzMuwGlYcLwd10GWgn%2FHYx5Q%2B9WILazHdSND87Cg%3D&reserved=0 . You are receiving this because you were mentioned.Message ID: @.***>

-- Ahmad Peyvan Ph.D Candidate Department of Mechanical and Industrial Engineering University of Illinois at Chicago (UIC) Email: @.***

peyvanahmad avatar Apr 05 '22 14:04 peyvanahmad

I have attached the complete error file here

Unfortunately I cannot find anything - can you please try again to add it to the GitHub issue on the website directly? Also, it would be great if you can include the elixir you're using the exact Julia command you used for starting it.

sloede avatar Apr 06 '22 12:04 sloede