Decompiler: variable-length array causes return addresses of called functions to appear in the output

Open LukeSerne opened this issue 2 years ago • 1 comments

Describe the bug When decompiling a function that uses variable-length arrays, the decompiler gets confused about the stack layout and return addresses that are pushed to the stack (on x86 at least) are not recognised as such and are still shown in the decompiler output.

To Reproduce Steps to reproduce the behaviour:

Download the attached do_hashing.xml (renamed to do_hashing.xml.txt to bypass Github file extension blacklists)
Load it into decomp_dbg using restore /path/to/do_hashing.xml
Load the function: load function do_hashing
Decompile the function: decompile
Print the decompiled C code: print C
Observe that, before every function call, the output contains an assignment like *(undefined8 *)(uStack_60 + lVar1) = 0x101442;.

Expected behaviour I expected the decompiler to not show the pushed return addresses and represent the variable length array at uStack_60 + lVar1 + 8 as such.

Attachments do_hashing.xml

Environment (please complete the following information):

OS: Linux
Ghidra Version: 10.4
Ghidra Origin: official GitHub distro

Additional context This function comes from the vuln binary belonging to the "Secure Password Storage" challenge of GlacierCTF 2023.

It seems the variable-length array that is used in this function is what's tripping up the stack analysis and causes these decompilation artefacts. IDA Free's decompiler adds a fake function call to alloca to model this. Perhaps Ghidra could implement a similar solution. Printing a variable length array like byte uStack_60[lVar1 + 8]; would also work.

Dec 02 '23 16:12 LukeSerne

What is the status of this issue? I noticed it does not have an assignee, and it does not have any labels.

I noticed I forgot to attach the binary to my original post. The binary has been pushed to a github repo (direct link to the binary), unfortunately without the source.

Additionally, here's a link to dogbolt showing how ghidra performs compared to Hex-Rays / IDA (scroll down to the do_hashing function).

Here is a snippet from the first lines of the do_hashing function in Ghidra and Hex-Rays, as shown by dogbolt. Here, the pollution of Ghidra's output is clearly visible (notice the "fake" assignments to (&uStack_60)[uVar3 * -2] before every function call).

Ghidra 11.3.1:

void do_hashing(void *param_1,size_t param_2,uchar *param_3,byte param_4)

{
  size_t sVar1;
  uchar *puVar2;
  void *__src;
  uchar *d;
  ulong uVar3;
  long in_FS_OFFSET;
  undefined8 uStack_60;
  uchar auStack_58 [4];
  byte local_54;
  uchar *local_50;
  size_t local_48;
  void *local_40;
  long local_30;
  uchar *local_28;
  long local_20;
  
  local_20 = *(long *)(in_FS_OFFSET + 0x28);
  local_30 = param_2 + 7;
  uVar3 = (param_2 + 0x17) / 0x10;
  local_28 = auStack_58 + uVar3 * -0x10;
  local_54 = param_4;
  local_50 = param_3;
  local_48 = param_2;
  local_40 = param_1;
  (&uStack_60)[uVar3 * -2] = 0x1013a8;
  memset(auStack_58 + uVar3 * -0x10,0,param_2 + 8);
  puVar2 = local_28;
  __src = local_40;
  sVar1 = local_48;
  (&uStack_60)[uVar3 * -2] = 0x1013bf;
  memcpy(puVar2,__src,sVar1);
  d = local_28;
  puVar2 = local_50;
  *(undefined8 *)(local_28 + local_48) = salt;

Hex-Rays 9.1.0.250226:

unsigned __int64 __fastcall do_hashing(void *a1, size_t a2, __int64 a3, unsigned __int8 a4)
{
  void *v4; // rsp
  _BYTE v6[4]; // [rsp+0h] [rbp-50h] BYREF
  unsigned __int8 v7; // [rsp+4h] [rbp-4Ch]
  __int64 v8; // [rsp+8h] [rbp-48h]
  size_t n; // [rsp+10h] [rbp-40h]
  void *src; // [rsp+18h] [rbp-38h]
  size_t v11; // [rsp+28h] [rbp-28h]
  void *s; // [rsp+30h] [rbp-20h]
  unsigned __int64 v13; // [rsp+38h] [rbp-18h]

  src = a1;
  n = a2;
  v8 = a3;
  v7 = a4;
  v13 = __readfsqword(0x28u);
  v11 = a2 + 7;
  v4 = alloca(16 * ((a2 + 23) / 0x10));
  s = v6;
  memset(v6, 0, a2 + 8);
  memcpy(s, src, n);
  *(_QWORD *)((char *)s + n) = salt;

May 11 '25 12:05 LukeSerne