LLview icon indicating copy to clipboard operation
LLview copied to clipboard

"GPFS Daemon" code not available - Goal : Adding Lustre FS to LLview

Open Matth-L opened this issue 1 month ago • 2 comments

Hello,

Now that I understand how to implement metrics, I was wondering how to make Lustre work with LLview.

I can see that some code written in Perl is called in the configs/server/workflows/actions.inp, however this folder is not in the repository. I was wondering if this was because this file contained private information or something else. I’ll likely need to write a new script to collect Lustre data. If successful, I plan to create a PR. Having an existing template, .ie GPFS, would be extremely helpful as a reference.

I was wondering if it would be possible to get access to this file, or perhaps a minimal/stripped-down version. This would help me understand what LLview expects from a filesystem, so I can replicate and adapt the structure for Lustre.

Matth-L avatar Nov 24 '25 09:11 Matth-L

Hi Matthias, the issue is not much on having private information, but rather that we have a very convoluted setup to get the information from GPFS. It involves appending the IO information into separate files, that are then tailed into smaller ones and then parsed in parallel to be able to keep the LLview workflow within the 1 min. This setup is very specific to us, and would probably not work anywhere else. I wanted to try a more generic workflow, but I would need some time for this (and maybe I'd also need some changes from the admins side, and they also have not much time at the moment). It'd be very useful to have a plugin for Lustre, as I know many HPC centres use it. How is the data obtained from it? Is it possible to do queries on it? If so, you may consider using what is currently called the prometheus plugin, but that is more a general rest API one. So, if that is the case, you can either try to generalise it more, or use it as a basis for a new plugin for Lustre. On the other hand, if your setup works also with exporting the information into files, you can check the files.py plugin. We use that to read the error files that are written by our healthchecker on the epilogue of the jobs (when there are system errors). But I tried to implemented the plugin in a generic way, to read and parse information from files given regexs. If you think none of this would be a good starting point, maybe you can give me more information on how Lustre works, so I can try to think if anything we have here may be helpful.

filipesmg avatar Nov 24 '25 18:11 filipesmg

Hello @filipesmg,

Thanks for the answer. I didn't know that the GPFS script was kind of JSC-related.

Indeed, I saw that plugins existed to get Lustre metrics from Prometheus, .e.g lustre_exporter. I believe that using this plugin and following the same workflow I described for adding a CPU model should allow us to generate the Lustre data.

It is also possible to directly query the Lustre FS by using commands like so:

Client activity can be monitored to troubleshoot issues. Typical client operation statistics are in the stats file,
which only includes non-zero parameters:
lctl get_param llite..stats • Client read-write extent statistics may help troubleshoot read-write extents for the file system or a process: lctl get_param llite.FSNAME-.extents_stats
lctl get_param llite.FSNAME-.extents_stats_per_process • To look up statistics for I/O requests to a disk: lctl get_param obdfilter.FSNAME-.brw_stats

I think that the former would be simpler, more readable, and would prevent the creation of another Python file to extract data, but I might be underestimating the task at hand.

I'll check it out when I get the chance and keep you updated.

Since this issue is no longer relevant, feel free to close it. Thanks again.

Matth-L avatar Nov 25 '25 08:11 Matth-L