vt
vt copied to clipboard
Create LDMS stream publish for phase data
https://ovis-hpcreadthedocs.readthedocs.io/en/latest/ldms-streams.html#how-to-make-a-data-connector
You'll need to include/import the following files: ldms.h, ldmsd_stream.h, util.h For example, in C++ code, add the following:
#include <ldms/ldms.h>
#include <ldms/ldmsd_stream.h>
#include <ovis_util/util.h>
You'll also need the object:
ldms_t* ldms
The function you'll need to add for publishing messages is:
ldmsd_stream_publish( (*ldms), <NAME_OF_SCHEMA>, <TYPE_OF_MSG>, <MSG_OBJECT>)
So, for example, if someone wanted to send Kokkos data as a JSON to their database, the function would look like this:
ldmsd_stream_publish( (*ldms), "kokkos-perf-data", LDMSD_STREAM_JSON,
Title should read LDMS?
Also, vt's internal diagnostics seem to be perfect for feeding out to LDMS.
The URL for the LDMS documentation has recently been updated to: https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-streams.html#how-to-make-a-data-connector
The URL for the LDMS documentation has recently been updated to: https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-streams.html#how-to-make-a-data-connector
I'm getting this compile-time error when I just include the three files listed above:
root@b86898199925:/build/vt# ninja
[1/2] Building CXX object examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
FAILED: examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
/usr/bin/ccache /usr/lib/ccache/g++ -DJSON_USE_IMPLICIT_CONVERSIONS=1 -I/vt/lib/CLI -I/vt/lib/json/include -I/vt/lib/brotli/c/include -I/vt/lib/libfort/lib -I/build/vt/release -I/vt/src -I/build/vt/lib/checkpoint/src -I/vt/lib/checkpoint/src -isystem /vt/lib/fmt/include -isystem /vt/lib/EngFormat-Cpp/include -O3 -DNDEBUG -fdiagnostics-color=always -Wall -pedantic -Wshadow -Wno-unknown-pragmas -Wsign-compare -ftemplate-backtrace-limit=100 -Werror -std=c++17 -MD -MT examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -MF examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o.d -o examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -c /vt/examples/hello_world/hello_world.cc
In file included from /usr/local/include/ldms/ldmsd_stream.h:6,
from /vt/examples/hello_world/hello_world.cc:47:
/usr/local/include/ldms/ldms_xprt.h:401:26: error: declaration of 'void (* ldms_xprt::app_ctxt_free_fn)(void*)' changes meaning of 'app_ctxt_free_fn' [-fpermissive]
401 | app_ctxt_free_fn app_ctxt_free_fn;
| ^~~~~~~~~~~~~~~~
In file included from /vt/examples/hello_world/hello_world.cc:46:
/usr/local/include/ldms/ldms.h:649:16: note: 'app_ctxt_free_fn' declared here as 'typedef void (* app_ctxt_free_fn)(void*)'
649 | typedef void (*app_ctxt_free_fn)(void *ctxt);
| ^~~~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.
This is the script I used to install LDMS in the container:
https://github.com/DARMA-tasking/vt/blob/2183-create-ldma-stream-publish-for-phase-data/ci/deps/ldms.sh
Can you please try to run the following and see if the issue still occurs?
cd ovis
./autogen.sh
./packaging/make-all-top.sh
I've never encountered this kind of error and usually use the "make-all-top.sh" to build LDMS. This script automatically configures LDMS with the common flags that our team uses (build is under .../ovis/LDMS_install).
In the meantime, I'm going to reach out others who are more experienced with this kind of error.
UPDATE: What version of LDMS is being installed and what is the output of g++ --version of the container?
I was able to successfully build that LDMS and test it with vt (locally). Next I'll try to do the same within our Docker containers.
The URL for the LDMS documentation has recently been updated to: https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-streams.html#how-to-make-a-data-connector
I'm getting this compile-time error when I just include the three files listed above:
root@b86898199925:/build/vt# ninja [1/2] Building CXX object examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o FAILED: examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o /usr/bin/ccache /usr/lib/ccache/g++ -DJSON_USE_IMPLICIT_CONVERSIONS=1 -I/vt/lib/CLI -I/vt/lib/json/include -I/vt/lib/brotli/c/include -I/vt/lib/libfort/lib -I/build/vt/release -I/vt/src -I/build/vt/lib/checkpoint/src -I/vt/lib/checkpoint/src -isystem /vt/lib/fmt/include -isystem /vt/lib/EngFormat-Cpp/include -O3 -DNDEBUG -fdiagnostics-color=always -Wall -pedantic -Wshadow -Wno-unknown-pragmas -Wsign-compare -ftemplate-backtrace-limit=100 -Werror -std=c++17 -MD -MT examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -MF examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o.d -o examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -c /vt/examples/hello_world/hello_world.cc In file included from /usr/local/include/ldms/ldmsd_stream.h:6, from /vt/examples/hello_world/hello_world.cc:47: /usr/local/include/ldms/ldms_xprt.h:401:26: error: declaration of 'void (* ldms_xprt::app_ctxt_free_fn)(void*)' changes meaning of 'app_ctxt_free_fn' [-fpermissive] 401 | app_ctxt_free_fn app_ctxt_free_fn; | ^~~~~~~~~~~~~~~~ In file included from /vt/examples/hello_world/hello_world.cc:46: /usr/local/include/ldms/ldms.h:649:16: note: 'app_ctxt_free_fn' declared here as 'typedef void (* app_ctxt_free_fn)(void*)' 649 | typedef void (*app_ctxt_free_fn)(void *ctxt); | ^~~~~~~~~~~~~~~~ ninja: build stopped: subcommand failed.This is the script I used to install LDMS in the container:
https://github.com/DARMA-tasking/vt/blob/2183-create-ldma-stream-publish-for-phase-data/ci/deps/ldms.sh
I get the same error when using 4.3.11 version (or older). Issue is no longer present when using OVIS-4 branch source code.
The URL for the LDMS documentation has recently been updated to: https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-streams.html#how-to-make-a-data-connector
I'm getting this compile-time error when I just include the three files listed above:
root@b86898199925:/build/vt# ninja [1/2] Building CXX object examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o FAILED: examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o /usr/bin/ccache /usr/lib/ccache/g++ -DJSON_USE_IMPLICIT_CONVERSIONS=1 -I/vt/lib/CLI -I/vt/lib/json/include -I/vt/lib/brotli/c/include -I/vt/lib/libfort/lib -I/build/vt/release -I/vt/src -I/build/vt/lib/checkpoint/src -I/vt/lib/checkpoint/src -isystem /vt/lib/fmt/include -isystem /vt/lib/EngFormat-Cpp/include -O3 -DNDEBUG -fdiagnostics-color=always -Wall -pedantic -Wshadow -Wno-unknown-pragmas -Wsign-compare -ftemplate-backtrace-limit=100 -Werror -std=c++17 -MD -MT examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -MF examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o.d -o examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -c /vt/examples/hello_world/hello_world.cc In file included from /usr/local/include/ldms/ldmsd_stream.h:6, from /vt/examples/hello_world/hello_world.cc:47: /usr/local/include/ldms/ldms_xprt.h:401:26: error: declaration of 'void (* ldms_xprt::app_ctxt_free_fn)(void*)' changes meaning of 'app_ctxt_free_fn' [-fpermissive] 401 | app_ctxt_free_fn app_ctxt_free_fn; | ^~~~~~~~~~~~~~~~ In file included from /vt/examples/hello_world/hello_world.cc:46: /usr/local/include/ldms/ldms.h:649:16: note: 'app_ctxt_free_fn' declared here as 'typedef void (* app_ctxt_free_fn)(void*)' 649 | typedef void (*app_ctxt_free_fn)(void *ctxt); | ^~~~~~~~~~~~~~~~ ninja: build stopped: subcommand failed.This is the script I used to install LDMS in the container: https://github.com/DARMA-tasking/vt/blob/2183-create-ldma-stream-publish-for-phase-data/ci/deps/ldms.sh
I get the same error when using 4.3.11 version (or older). Issue is no longer present when using OVIS-4 branch source code.
@JacobDomagala Thank you catching this and letting me know. The LDMS team and I will look into it. Feel free to reach out if you come across any more issues!
@Snell1224 @vsurjadidjaja
Here is a screenshot of the form the data will take from our current JSON statistics file. This data will be incrementally submitted phase-by-phase as the data is computed. A phase is roughly equivalent to a timestep in an application. After a phase runs, the load balancer might be run depending on the configuration. Thus, we always have pre-LB statistics and we might have a migration count and post-LB statistics depending on whether it ran or not.
So after each phase completes, we will submit this:
{
"id": 4, // A unique phase ID
"ts": 40.0, // The timestamp
"migration count": 1, // number of migrations [optional]
"pre-LB": {
"Object_comm": { },
"Object_load_modeled": { },
"Object_load_raw": { },
"Rank_comm": { },
"Rank_load_modeled": { },
"Rank_load_raw": { }
},
"post-LB": { // [optional]
// Same as pre-LB
}
}
Each one of the keys (Object_comm, Object_load_modeled, ...) in pre- and post-LB will include the following statistics:
"avg": 7190.222222222223, // mean
"car": 9.0, // cardinality
"imb": 0.2739522808752626, // imbalance (max/avg-1)
"kur": -1.7815486080524885, // kurtosis
"max": 9160.0, // maximum
"min": 5880.0, // minimum
"npr": 9.0,
"skw": 0.515228148796637, // skewness
"std": 1310.1329733467003, // standard deviation
"sum": 64712.0, // sum
"var": 1716448.4078502655 // variance
For the stream publish key, I propose "vtLBStats".
@Snell1224 @vsurjadidjaja I'm a little confused as to how I should convert the output of gettime() to be consistent with what you need.
@Snell1224 @vsurjadidjaja I'm a little confused as to how I should convert the output of
gettime()to be consistent with what you need.
We use epoch time for analyzing streams data so we send this in the JSON message. As for when to record/get the time, that's more of a preference thing. I'm not too familiar with VT but if you don't need to monitor the start/duration/end time of each phase, then getting the time whenever you send the JSON message will work.
The example below shows what we do for Darshan and how we collect the end time of an I/O event (again this is just a preference):
static inline struct timespec abs_timespec(void)
{
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
return(tp);
}
struct timespec tspec_start, tspec_end;
tspec_start = abs_timespec()
// IO stuff happening here
tspec_end = abs_timespec()
// Do other stuff and send message
micro_s = tspec_end.tv_nsec/1.0e3;
sprintf(jb11,"{.....,\"timestamp\":%lu.%.6lu}]}", ....., tspec_end.tv_sec, micro_s);