gdbgui icon indicating copy to clipboard operation
gdbgui copied to clipboard

Extension to mpi programs

Open incardon opened this issue 3 years ago • 16 comments

  • [] I have added an entry to docs/changelog.md

Summary of changes

This diff add support to work with mpi programs in Gdbgui.

The main changes regard adding the mpi_rank element to debug session a way to search the gdbsession by rank extend run_gdb_command with the process id that is default to -1. -1 in MPI based debug session -1 mean run the command on all sessions. in case of non-MPI is ingnored and -1 mean run on the single underline gdb session. An additional HTML page on the server to give to the client access to the MPI servers informations. Process information is attached to all gdb response answer in case of non MPI process information is -1. process information has been also attached to some function in Actions now process information of the message process, but because it is a defaulted to -1 so should not produce problem if there is code around calling the function in the old form.

inferior_program_paused has been changed to process processor information

change_process_on_focus: and refresh_state_for_gdb_pause has been added to process change of processor on focus and update of the GUI information

connect_to_gdbserver_mpi is added to handle the connection to MPI gdb server

Several global variables to initial_store_data has been added to store information for each processor

A processo status bar has been added when MPI programs has been debugged

gdbgui/src/js/Breakpoints.tsx has been fixed to remove breakpoint placed in templated code

All around there are change to handle process information, when needed

Test plan

2 tests has been added one for the JSX side that test server + Gdbgui TypeScript code using puppeteer and chrome headless, emulating click events and keyboard typing. The test go trough a full debug session.

The second test test is for the python server the test send command directly to the server to go through a debug session. Tested by running

# command(s) to exercise these changes

The best is to look at the

.github/workflows/tests.yml

for the required machine setup.

The rest is:

launching an mpi program with the script ./gdbgui-mpi/launch_mpi_debugger 6 ./my_program -arg1 -arg2

where 6 is the number of process. Launch gdbgui, select connect to MPI gdbservers use *:6000 as input test for the server

incardon avatar Aug 24 '20 06:08 incardon

I know is a big commit. I will first start to pull from the actual master. But would be nice to have a point of discussion in case there is interest to merge in the actual gdbgui.

incardon avatar Aug 24 '20 06:08 incardon

ok I started to merge and I found the first problem when I try to use the server ...

File ".../gdbgui/gdbgui/server/sessionmanager.py", line 9, in from pygdbmi.IoManager import IoManager ModuleNotFoundError: No module named 'pygdbmi.IoManager'

Where do I get IoManager ? I do not see in the repo of pygdbmi

incardon avatar Aug 24 '20 19:08 incardon

It is in add_reader branch ... of pygdbmi ... and is failing in your CI

Also I am not able to make this last refactor working:

The server die with:

File "/home/i-bird/Desktop/MOSAIC/OpenFPM_project/gdbgui/gdbgui/cli.py", line 254, in main gdbgui.server.server.run_server( AttributeError: module 'gdbgui.server' has no attribute 'server'

Not sure how it should works.

OK fixed with import gdbgui.server.server in line 25 of cli.py

incardon avatar Aug 24 '20 19:08 incardon

Hi, thanks for the PR! Very impressive.

I just landed a whole bunch of changes to gdbgui recently. Unfortunately those changes have caused conflicts with this PR. Most of the code that was changed was on the backend. The frontend code was not affected very much other than the switch from .js/.jsx to .ts/.tsx. I just merged the pygdbmi add-reader branch into master and published a new version of pygdbmi (0.10.0.0) which is required by the latest version of gdbgui (0.14.0.0).

I looked at your screenshot in the related issue. The only thing I have to suggest is to move the processor selection to the right pane.

I am looking forward to seeing the merge conflicts resolved and giving it a try. I am curious -- is there a particular project or company these changes already being used in, or did you make this just in case you will need it in the future?

cs01 avatar Aug 25 '20 04:08 cs01

We are a research group specialized in Parallel simulations and I am one of the software developer in HPC. From long time I was searching for a small project that could the adapted to become an opensource parallel debugger for the HPC community. And this fullfill all the requirements. In few days was possible to create something working

incardon avatar Aug 25 '20 07:08 incardon

We are a research group specialized in Parallel simulations and I am one of the software developer in HPC. From long time I was searching for a small project that could the adapted to become an opensource parallel debugger for the HPC community. And this fullfill all the requirements. In few days was possible to create something working

Do you have a website I could check out?

cs01 avatar Aug 25 '20 07:08 cs01

http://mosaic.mpi-cbg.de/

While for the simulation library

http://openfpm.mpi-cbg.de

incardon avatar Aug 25 '20 07:08 incardon

How do I cancel a workflow that got stuck ?

incardon avatar Aug 29 '20 14:08 incardon

Can you move this to "draft" until you are ready for review, then request my review when it's ready?

cs01 avatar Aug 29 '20 15:08 cs01

Github instructions say that I should be able to see a button Cancel Check Suite ... but I do not see it ... is it related to permissions ?

incardon avatar Aug 29 '20 16:08 incardon

Github instructions say that I should be able to see a button Cancel Check Suite ... but I do not see it ... is it related to permissions ?

Maybe? It's at the top right. I imagine it will stop after some hardcoded limit like 24 hrs or something. I'll cancel it for you now. image

cs01 avatar Aug 29 '20 17:08 cs01

Thanks....

incardon avatar Aug 29 '20 18:08 incardon

hmmmm test.yml is not anymore triggered ... sound this GitHub CI is a bit buggy.

Anyway. I think is good enough to have a second round of review:

Let's start from some note. More or less I have done all the changes Moves to typescript and pty and made the changes to the refactored server.

What has not been done is the what you said you will do it.

Plus few things I would like to discuss

The only thing I have to suggest is to move the processor selection to the right pane.

This part I would like to discuss. The processors buttons are quite fundamental for both as controllers and the information they give. The position on top is also what parallel debuggers like Allinea DDT choose . So I would like that you could reconsider or eventually give a double positioning option

I think the gdbgui-mpi folder might fit better in examples/mpi. There are already examples in the examples folder.

I think i have to explain what this folder contain:

print_nodes: is a fundamental program to make the mpi debugging working. In particular when we use launch_mpi_debugger a pre-run of this program collect the name and processor rank of all the nodes writing a file nodes_name this file is than read by the server to understand the set of servers at which it has to connect. This is to give the possibility to debug program that effectively run on multiple machines. Because this is an MPI program itself I used as an example, but is fundamental for functioning.

main.cpp is the source code of this small program

compile.sh: a small script to compile print_nodes

launch_mpi_debugger is a wrapper to launch an mpi program with gdb-server in particular if you launch the program with mpirun -np 6 ./my_program -option1 -option2, you launch in debug mode with launch_mpi_wrapper 6 ./my_program -option1 -option2

launch_gdb_server: is another script used by launch_mpi_debugger to open a different port for each process

init.py and main.py has been removed you launch exactly as before. python -m gdbgui

////////////////////////// Some note while converting to pty ///////////////////////////////////////////////////////////////////////

The first problem I found was the message ^connected\r\n that is sent once the client connect to a gdbserver after target remote.

I found very weird behaviour from GDB in pty that I am not able to explain. In particular I was logging out raw_output in IOManager in pygdbmi to understant which messages are processed. Additionally I was also trying to manually create gdb-sessions and attach to bash opened pty's in order to manually reproduce the behaviour and this non sensical behavior came out.

In a standard gdbgui with one gdb session connecting to a gdbserver the message ^connected\r\n is not sent at all. In an MPI gdb session where the first session is opened in the standard way and the others MPI session are opened with

    gdb_command = request.args.get("gdb_command", app.config["gdb_command"])
    mi_version = request.args.get("mi_version", "mi2")
    manager.add_new_debug_session(
        gdb_command=gdb_command, mi_version=mi_version, client_id=request.sid
    )

The result is that, the first session does not receive the connected message the others sessions receive the connected message. Note: the first session compared to the other MPI are opened differently. There is a connect event that open the session followed by an emit to the client to notify the opening of the session and an answer to run --list-feature and --list-feature-target command on the opened GDB session. In the MPI sessions there is not such thing. The sessions are opened and the request.sid appended to them.

Tring to reproduce manually with gdb I get a further different behaviour:

In particular I start a program with gdb-server and open 3 bash one with gdb with mi2 interpreter and I use -ex to connect to the pty (like gdbgui). In this case independently that i launch --list-features ... or not the ^connected messages are not present in the pty (bashes connected). It is instead present in the main gdb console. Even more weird in the Pty all messages about reading symbols are dropped.

As final note, because the message on my gdb (GNU gdb (GDB) Fedora 8.3.50.20190824-30.fc31 is ^connected\r\n (very window new line) pygdbmi fail to parse the message in particular is parsed as type: output message is None and payload is ^connected\r\n.

Because all the behavior of the connected message ended up to be a full messed. I went to not rely fully on it. If i get a message like a stop before connected a connected Action is triggered before process the stop message. I wasted a lot of time in order to understand the behaviour of this connected message when gdb is connected to pty. I gave up to understand, because i get inconsistent behaviours and I failed to reproduce them manually. In any case it looks like the approach I used solve all problems above

incardon avatar Aug 29 '20 20:08 incardon

Ok now is ready

incardon avatar Aug 30 '20 07:08 incardon

Thanks again for putting this together. This is a big diff so it will probably take me a few passes to review, and will be spread out over time since this is done in my spare time with no pay.

Can you update the description? The template was not filled out.

cs01 avatar Sep 05 '20 20:09 cs01

Sure, I understand. Let me know

incardon avatar Sep 06 '20 13:09 incardon