ADIOS2 SST queue limit default

As observed in Issue #2601, the SST queue limit default of infinite queueing is dangerous when the network or reader might not handle data production rates, causing the writer to run out of memory. While no specific number is obviously a better default across the board, some solution should be considered. E.G. Maybe if the queue limit is left at the default, a more "reasonable" count-based limit could be guessed based upon the size of the first timestep on rank 0 and distributed to the other ranks during normal EndStep() communication? (Or based on the 2nd timestep, given that the first might often be unique?)

May 07 '21 12:05 eisenhauer

Maybe there is no need to think this too complicatedly? I can imagine that a warning printed to stderr if one rank exceeds some predefined memory limit can already help a lot in identifying the reason for a crash.

May 10 '21 08:05 franzpoeschel

I think the question would be, what's that limit? What would have worked for you? Do you know how many timesteps got queued before you ran out of memory? One huge timestep isn't necessarily a problem, as long as there's memory for it...

May 10 '21 13:05 eisenhauer

That's probably a non-trivial question, yeah. If you want to go with this solution, it depends strongly on the job setup and on the compute hardware what makes sense. On Summit, each compute core has access to ~12GB of main memory (which is already a lot, the cluster where we observed the issue has 1.4GB per core and that's already been a problem in other contexts). So, maybe the "currently usual" memory per core (whatever that would be) and divide that by 2 as a safe-guard (and also since I/O should not use more than half the system resources)? Otherwise, exposing this as a further engine parameter to specify might make sense – but since this issue is on default configurations, a sensible default should be found.

May 11 '21 11:05 franzpoeschel