flang icon indicating copy to clipboard operation
flang copied to clipboard

Multiple occurrences of a local symbol while debugging OpenMP code

Open pawosm-arm opened this issue 3 years ago • 9 comments

It took a while, but the problem revealed itself eventually. Consider following piece of a Fortran code:

program hello_f
  use omp_lib
  integer:: x = 40
  integer:: tid, nthreads, tmp_x

  !$omp parallel private(tid, tmp_x) firstprivate(x) shared(nthreads) num_threads(4)
  tid = omp_get_thread_num()
  x = x + tid
  !$omp barrier
  if (tid == 0) then
     nthreads = omp_get_num_threads()
     write (*,*) 'There are', nthreads, 'threads'
  end if
  !$omp barrier
  tmp_x = x
  if (tid == 2) then
      write (*,*) 'post barrier tag'
  end if
  !$omp barrier
  !$omp end parallel
end program hello_f

When started in gdb with breakpoint at line 16 (if (tid == 2) then) one can list local variables as such:

(gdb) info locals
omp_sched_static = 1
omp_sched_auto = 4
omp_proc_bind_true = 1
omp_lock_hint_none = 0
omp_lock_hint_contended = 2
omp_lock_hint_nonspeculative = 4
omp_lock_kind = 4
omp_lock_hint_kind = 4
omp_lock_hint_uncontended = 1
x = <optimized out>
omp_integer_kind = 4
omp_logical_kind = 4
omp_proc_bind_kind = 4
omp_sched_dynamic = 2
omp_proc_bind_false = 0
tid = 3
tmp_x = 43
omp_sched_kind = 4
omp_real_kind = 4
omp_proc_bind_master = 2
omp_proc_bind_spread = 4
x = 43
(gdb)

WARNING! The actual output will vary depending on the gdb version! Use the latest one, as older gdb version does not show the symptoms of the problem at this stage.

The problem in the list above is that the x variable is listed twice!:

x = <optimized out>
...
x = 43

If you're unlucky and the <optimized out> value is listed before the known value of x, the gdb's print command cannot display the value of x. This makes the debugger unreliable.

~~A reversion of the patch introduced by PR #895 solves the problem: the x variable appears only once in the locals list on all of the gdb versions.~~

~~One more observation I've made: The patch in PR #895 introduces a test case. I was expecting it to fail as I reverted the change in lldebug.cpp. It wasn't the case: with the change in lldebug.cpp reverted this test case is still passing which makes it harder to work on a fix that does not break the expected behavior.~~

pawosm-arm avatar Mar 29 '21 10:03 pawosm-arm

Hello @SouraVX, are you able to take a closer look at this?

pawosm-arm avatar Apr 06 '21 16:04 pawosm-arm

Hello @pawosm-arm thanks for the issue creation and the test case :) apologies for delayed response, somehow the initial notification of the issue skipped my radar :(. I will take a look at this and get back to you soon.

SouraVX avatar Apr 06 '21 16:04 SouraVX

@pawosm-arm thanks! for the comment regarding not observing the test case failure after revert :) That's very unfortunate :( . The patch was meant to fix pretty important problem(as described in detail in the PR itself).

This PR belongs to the times when there was no testing infra for debug info in flang, so a critical check missed up while upstreaming :( I had a quick look on the test case, here's updated test this should fail after you revert.

!RUN: %flang -g -S -emit-llvm %s -o - | FileCheck %s

!Ensure that there is no redundant LexicalBlock created with scope
!pointing to Subprogram and Local variable is pointing to Subprogram scope.
!CHECK: [[SCOPE_NODE:[0-9]+]] = distinct !DISubprogram(name: "sub", {{.*}}, line: [[LINE_NODE:[0-9]+]]
!CHECK: !DILocalVariable(name: "foo_arg", scope: ![[SCOPE_NODE]], file: !3, line: [[LINE_NODE]], type: !8)
!CHECK-NOT: !DILexicalBlock(scope: ![[SCOPE_NODE]], {{.*}}, line: [[LINE_NODE]]

!Ensure that there is a LexicalBlock created for the BLOCK statement and
!the local variable `foo_block` has correct scope information i.e
!pointing to LexicalBlock.
!CHECK-DAG: !DILocalVariable(name: "foo_block", scope: ![[BLOCK_NODE:[0-9]+]]
!CHECK-DAG: ![[BLOCK_NODE]] = !DILexicalBlock(scope: ![[SCOPE_NODE]], {{.*}}, line: 19

SUBROUTINE sub(foo_arg)
      integer,value :: foo_arg
      integer :: foo_local
      foo_local = arg_foo
      BLOCK      !line number: 19
             integer :: foo_block
             foo_block = 4
      END BLOCK
END SUBROUTINE

I'll be focusing on OpenMP part of it now and get back to you.

SouraVX avatar Apr 06 '21 19:04 SouraVX

A reversion of the patch introduced by PR #895 solves the problem: the x variable appears only once in the locals list on all of the gdb versions.

I tried reverting the change but to my dismay, above specific problem doesn't shows up. Yes, This a a BUG that there are multiple x visible. But reverting the thing doesn't helped at my end. Compiling step: flang pr1019.f90 -fopenmp -g

Logs using trunk-gdb

(gdb) b 16
(gdb) r
There are            4 threads
[Switching to Thread 0x7ffff4813c80 (LWP 10237)]

Thread 4 "a.out" hit Breakpoint 1, __nv_MAIN__PARALLEL_F1L6_1 (__nv_MAIN__PARALLEL_F1L6_1Arg0=<optimized out>, __nv_MAIN__PARALLEL_F1L6_1Arg1=<optimized out>, __nv_MAIN__PARALLEL_F1L6_1Arg2=1) at pr1019.f90:16
16        if (tid == 2) then
(gdb) i locals
x = 43
tid = 3
tmp_x = 43
x = <optimized out>
(gdb) p x
$1 = 43

Would you be able share the binary for better repro ?

SouraVX avatar Apr 07 '21 11:04 SouraVX

There is an inherent bug (which seems to be pre-existing) here, and which seems to be unrelated to the removal of DW_TAG_lexical_block. There are 2 DIEs created for the firstprivate variable "x" as children of the DW_TAG_subprogram DIE corresponding to the outlined parallel region. (wherein only 1 should be created). The existence of both the DIEs (for "x") seems to predate PR#895, though one of these DIEs was nested inside the DW_TAG_lexical_block DIE before PR#895.

jinisusan avatar Apr 07 '21 16:04 jinisusan

Hi @SouraVX and @jinisusan. I've made a mistake and I am very sorry for the confusion it caused. Despite being aware that the bug manifests itself only on the very latest versions of gdb, by mistake I've been alternately using system provided gdb and the latest one I've built myself. At least this is the only way I can reasonable think of what happened here. Examining various commits on the debugging epic, I've accidentally hit this one, and since I've encountered the not-failing-after-revert test case problem, I didn't check twice if this pull request is actually the root cause of this problem. Indeed the PR in question isn't causing the problem that seems to be inherent to flang, hence I've removed the 'regression' word from the title.

pawosm-arm avatar Apr 07 '21 18:04 pawosm-arm

Hi @pawosm-arm, thanks a bunch for confirming.

jinisusan avatar Apr 08 '21 02:04 jinisusan

Have done some analysis and came with a potential fix also, but during internal tests this regresses existing things. Extra Variable DIE within the parallel region DIE is causing GDB to show 2 values of the variable x:

0x00000040:   DW_TAG_subprogram **(Parallel region)**
                DW_AT_low_pc    (0x0000000000201f10)
                DW_AT_high_pc   (0x00000000002020f0)
                DW_AT_frame_base        (DW_OP_reg7 RSP)
                DW_AT_GNU_all_call_sites        (true)
                DW_AT_name      ("__nv_MAIN__PARALLEL_F1L6_1")
                DW_AT_decl_file ("/home/sourabh/work/dwarf/OpenMP/pr1019.f90")
                DW_AT_decl_line (6)

0x00000055:     DW_TAG_variable   **(debugger showing optimized out value)**
                  DW_AT_name    ("x")
                  DW_AT_type    (0x00000039 "integer")
                  DW_AT_decl_file       ("/home/sourabh/work/dwarf/OpenMP/pr1019.f90")
                  DW_AT_decl_line       (3)
                  DW_AT_linkage_name    (".STATICS1")

0x00000091:     DW_TAG_variable   **(debugger showing actual value)**
                  DW_AT_location        (DW_OP_fbreg +32)
                  DW_AT_name    ("x")
                  DW_AT_type    (0x00000039 "integer")

Removal of this extra DIE is fine for non-trivial cases like above, but this is resulting problems for cases of multiple variables debug-info getting removed.

May pick this up later, when I'll be solely working on OpenMP debug-info. Thanks!

SouraVX avatar Apr 16 '21 16:04 SouraVX

Do these two instances of the same variable always differ with DW_AT_location vs. DW_AT_linkage_name?

pawosm-arm avatar Apr 16 '21 17:04 pawosm-arm