ghidra Segment based addresses need to be interpreted, understood and utilised in further analysis phases correctly.

Is your feature request related to a problem? Please describe. The interpretation of segmented addresses.

I have determined that although Ghidra does understand segment:offset style of addresses (like that used in protected-mode Windows 1990's) for disassembling, it struggles with the later analysis and turning it into decent C.

There seems to be the following kinds of addresses in this category: A. Where just the offset is given; B. Passed as function parameters; C. References to sub-structures.

Currently there are the following issues:

Numeric values are being misinterpreted as an offset for an address in the data segment (DS is being applied when it shouldn't - default segment usage is not shown in the ListingPanel and presumably not propagated to further analysis phases);
In a function parameter where the address is of thesegment:offset form, an offset to a sub-structure within said parameter is being misunderstood and comes out as either CONCAT2(xxx,yyy+offset) or even worse CONCAT13(xxx,CONCAT12(yyy,zzz+offset));
Passing the full stack address (including the SS) to another function looks like this CString::~CString((CString *)CONCAT13(uVar6,CONCAT12(uVar4,&local_1e)),2) instead of CString::~CString(&local_1e,2) or better still local_1e.~CString(2);
Arrays circum to a similar fate in that ones declared as global (on the DS) look like:

               uVar7 = *(undefined2 *)((int)g_apszProfileStrings_2c1e + 2 + nIdx * 4);
               uVar8 = *(undefined2 *)((int)g_apszProfileStrings_2c1e + nIdx * 4);

instead of psz = g_apszProfileStrings_2c1e[nIdx]; 5. I'm sure there's more, but all stem from the same fundamental issue of not interpreting the segment portion of addresses correctly.

Describe the solution you'd like Correct interpretation of segmented addresses so that all the later great features of Ghidra would be utilised.

Describe alternatives you've considered Setting the DS to 0 on the individual assembly instruction DOES eliminate issue 1. Turning pointers into a complex union (see #4469) helps a little with issue 2 and 4 when array is part of a structure whose full address has been passed to another function call. Others I haven't found anything decent yet!

Additional context I've been using a modified x86-16.cspec file together with my fork of Ghidra that I sync up regularly!

Apr 07 '23 09:04 Wall-AF

Here's a better example from where a structure member (in this case a seg:offpointer) is required from within a parent structure (pointed at from the functions first parameter) where at offset 0x46 a pointer is defined as DgnVocabularyList*32:

   FUN_10e8_05c2((DgnVocabularyList *)CONCAT22(param_1._2_2_,(int)param_1 + 0x46),
                 PROFILEINDEX_Systems_Shared_Vocabularies,1 | PROFSECT_IsDirectory);

As you can see the first parameter (above) the segment portion of the address associated with the member at offset 0x46 is the same as the segment of param_1 (bytes 3 & 4, or param_1._2_2_) and the offset is the 0x46th byte of the parent structure, but Ghidra isn't capable of understanding this and consequently misses the glaringly obvious fact it should be finding and using the datatype at that objects offset.

Apr 07 '23 12:04 Wall-AF

Another problem with the x86-16 sleigh is that it implicitly assumes (in currentCS and rel16) that the segment register is a multiple of 0x1000 when resolving references of all kinds - including jumps and calls which plays havoc with the decompiler and even autoanalysis. Also see related issue #2892.

Some likely paths from currentCS and rel16 to CALL/JMP and the MOV from #2892 are shown below

currentCS: CS is protectedMode=0 & CS { tmp:4 = (inst_next >> 4) & 0xf000; CS = tmp:2; export CS; }
:CALL rm16	    is $(LONGMODE_OFF) & vexMode=0 & addrsize=0 & opsize=0 & byte=0xff & currentCS; rm16 & reg_opcode=2 ...	{ local dest:4 = segment(currentCS,rm16); push22(&:2 inst_next); call [dest]; }
:JMP rm16	is vexMode=0 & addrsize=0 & opsize=0 & byte=0xff & currentCS; rm16 & reg_opcode=4 ...	{ target:4 = segment(currentCS,rm16); goto [target]; }

rel16: reloc is simm16      [ reloc=((inst_next >> 16) << 16) | ((inst_next + simm16) & 0xFFFF); ] { export *[ram]:$(SIZE) reloc; }
:CALL rel16     is $(LONGMODE_OFF) & vexMode=0 & addrsize=0 & opsize=0 & byte=0xe8; rel16     { push22(&:2 inst_next); call rel16; }
:JMP rel16      is vexMode=0 & opsize=0 & byte=0xe9; rel16                      { goto rel16; }

seg16: currentCS: is segover=1 & currentCS	{ export currentCS; }
Mem16: seg16^addr16	is seg16; addr16							 	   { tmp:$(SIZE) = segment(seg16,addr16); export tmp; }
Mem: Mem16 is addrsize=0 & Mem16             { export Mem16; }
rm16: "word ptr" Mem   is Mem              { export *:2 Mem; }
:MOV rm16,Sreg      is vexMode=0 & byte=0x8c; rm16 & Sreg ...               { rm16 = Sreg; }

Apr 11 '23 07:04 sad-dev

Another problem with the x86-16 sleigh ...

@sad-dev not sure that's exactly the same now, as I think the loaders have changed. In my case (I'm using protected-mode) my code segments are multiples of 8 and calls seem to have the correct addresses filled in (presumably by the loader).

What gets me is that almost all my pointers are effectively 32-bit (as load/store operations always use a segment register, either implicitly or explicitly) but in x86-16.cspec the pointer size is 16 bits. (Should it be 32???) Also, looking through the decompiler code there seems to be support for near pointers, which to me seems to allude that the pointer size should be 32 bits and, where pointers are given as 16-bit, the hiword should be obtained from the appropriate segment register/memory/stack address!

Longs are another issue!!!

Apr 11 '23 09:04 Wall-AF

I have not touched it in a while, but from memory the failure pattern I ran into was something like this - both 1000:F000 and 1F00:0000 correspond to the same address but can resolve the CALL operation very differently:

1000:F000      e8 00 10       CALL ??
1000:F003      ...

1F00:0000      e8 00 10       CALL ??
1F00:0003      ...

With a segment of 1000, this resolves to a call to 1000:0003 as 0xf003 + 0x1000 becomes 3 With a segment of 1F00, this resolves to a call to 1F00:1003.

It is possible that subsequent improvements have fixed this - I will double check when I can, although my guess is that no, the sleigh is still used to resolve the references.

Apr 11 '23 10:04 sad-dev

IS this similar, or different (but maybe related) to an issue I am facing while working on disassembling a game that runs on Sony's PS2/Emotion Engine, in that even though the struct is defined, and applied correctly to an address, it still shows up in decompiled functions as (var_name.first_field + [offset in hex])? (where offset points to the offset of a valid member of the struct being utilized, accessed or written to)

(perhaps a silly question, perhaps I am doing something wrong if not a related (or the same) issue?

Apr 24 '23 07:04 travelsonic

I have not touched it in a while, but from memory the failure pattern I ran into was something like this - both 1000:F000 and 1F00:0000 correspond to the same address but can resolve the CALL operation very differently:
1000:F000      e8 00 10       CALL ??
1000:F003      ...

1F00:0000      e8 00 10       CALL ??
1F00:0003      ...
With a segment of 1000, this resolves to a call to 1000:0003 as 0xf003 + 0x1000 becomes 3 With a segment of 1F00, this resolves to a call to 1F00:1003.

It is possible that subsequent improvements have fixed this - I will double check when I can, although my guess is that no, the sleigh is still used to resolve the references.

I believe what you're describing relates to "real mode" where multiple combinations of seg:off represent the same physical address.

May 29 '23 10:05 Wall-AF

IS this similar, or different (but maybe related) to an issue I am facing while working on disassembling a game that runs on Sony's PS2/Emotion Engine, in that even though the struct is defined, and applied correctly to an address, it still shows up in decompiled functions as (var_name.first_field + [offset in hex])? (where offset points to the offset of a valid member of the struct being utilized, accessed or written to)

(perhaps a silly question, perhaps I am doing something wrong if not a related (or the same) issue?

Looks similar.

May 29 '23 10:05 Wall-AF

Having seen yet another example of this (and engaging brain) pnCurrNext = (IDX_ONE_BASED *)((int)this + 0x1a); from

   11e0:178e 8b 56 08     MOV        DX,word ptr SS:[BP + this+0x2]
   11e0:1791 8b 46 06     MOV        AX,word ptr SS:[BP + this]
   11e0:1794 05 1a 00     ADD        AX,0x1a

and *pCurrNext = pCurr$1->nNext_0x10; from

   11e0:17b8 c4 5e f8     LES        BX,SS:[BP + pCurr$1]
   11e0:17bb 26 8b 5f 10  MOV        BX,word ptr ES:[BX + 0x10]
   11e0:17bf 8e c2        MOV        ES,DX
   11e0:17c1 93           XCHG       AX,BX
   11e0:17c2 26 89 07     MOV        word ptr ES:[BX],AX

I believe it's something to do with mechanism that Ghidra uses to take the address of a structure's member when using a sized version of a pointer to that structure.

For example: if that member is a simple int and something is assigned to it, the decompiler produces this->nNext_0x1a = nIdx; from

   11e0:16dc c4 5e 06     LES        BX,SS:[BP + this]
   11e0:16df 26 89 4f 1a  MOV        word ptr ES:[BX + 0x1a],CX

which is correct.

May 29 '23 11:05 Wall-AF

For example: if that member is a simple int and something is assigned to it, the decompiler produces this->nNext_0x1a = nIdx; from
   11e0:16dc c4 5e 06     LES        BX,SS:[BP + this]
   11e0:16df 26 89 4f 1a  MOV        word ptr ES:[BX + 0x1a],CX
which is correct.

To get Ghidra to produce the C output above, I modified this https://github.com/NationalSecurityAgency/ghidra/blob/3e9419d4d793dad0619e51769474be61d6968ef9/Ghidra/Features/Decompiler/src/decompile/cpp/varnode.cc#L1932 to

    if (op2->code() != CPUI_SUBPIECE) {
      if ((Varnode*)0 == op2->getIn(0)) return false;
      op2 = op2->getIn(0)->getDef();
      if (op2->code() != CPUI_SUBPIECE) return false;
    }

and https://github.com/NationalSecurityAgency/ghidra/blob/3e9419d4d793dad0619e51769474be61d6968ef9/Ghidra/Features/Decompiler/src/decompile/cpp/ruleaction.cc#L8536 to

    PcodeOp* op2 = vn2->getDef();
    if (op2->code() != CPUI_SUBPIECE) {
      data.opInsertInput(op, op2->getIn(1), 1);
      data.opSetOpcode(op, op2->code());
    }
    else {
      data.opSetOpcode(op,CPUI_COPY);
    }

May 29 '23 16:05 Wall-AF

@caheckman is this something in your domain???

Jun 14 '23 13:06 Wall-AF

Here is another example of Ghidra's inability to correctly comprehend "sized pointers"[^1] when it comes to working out the consequences of referencing components within them. The C output is:

int __cdecl16far FUN_1168_09ec(uint numElems,DgnSAData1168_0b3c_0x20_t *param_2,int param_3,int *param_4)
{
   Dgn1170_03c2_0x14_t::LPpfnFUN_1168_1262 *pDVar1;
   int iVar2;
   int iVar3;
   uint i;
   uint nSum;
   long nMultSum;
   undefined2 seg;
   
   nMultSum = 0;
   nSum = 0;
   for (i = 0; i < numElems; i += 1) {
      seg = *(undefined2 *)
             ((int)((DgnSAData1168_0b3c_0x20_t *)param_2)->field0_0x0 + i * 4 + 2);
      iVar2 = *(int *)(*(int *)(param_2->field0_0x0 + i) + 4);
      if ((param_3 == 0) || (iVar2 != 0)) {
         iVar2 = (*(int *)(*(int *)(param_2->field0_0x0 + i) + 2) + iVar2 / 2) / iVar2 + 1;
         pDVar1 = &param_2->field0_0x0[i]->_vtable->field4_0x10;
         iVar3 = (*pDVar1)(param_2->field0_0x0[i]);
         nSum += iVar2;
         nMultSum += (ulong)(uint)(iVar3 * iVar2);
      }
   }
   *param_4 = nSum;
   return (int)((long)(nMultSum + (ulong)(nSum >> 1)) / (long)(ulong)nSum);
}

The lines of interest are: iVar2 = *(int *)(*(int *)(param_2->field0_0x0 + i) + 4); and iVar2 = (*(int *)(*(int *)(param_2->field0_0x0 + i) + 2) + iVar2 / 2) / iVar2 + 1; where field field0_0x0 (of param_2) is defined as Dgn1170_03c2_0x14_t *32[8] and the fields of Dgn1170_03c2_0x14_t at offsets 2and4 are int types, (offset 0 is a pointer to a vtable - using the default 16-bit size). These definitions cannot easily have their pointer declarations altered as doing so creates incorrectly sized structures (and offsets within) and issues with calling functions when changing a pointer size. [^1]: The only way to get pointers of the correct size when the architecture supports multiple (two in this case) sizes.

Oct 07 '23 14:10 Wall-AF

Is there anyone able to look at this?

Oct 07 '23 14:10 Wall-AF

Can someone please look at this. I have examples of this everywhere and its making REing even more difficult! Thanks.

Jan 13 '24 10:01 Wall-AF

I'm getting output like

            iVar4 = FindElementIdx((DgnEAClient *)
                                   CONCAT22(*(undefined2 *)
                                             ((int)&((DgnEAVocabData_0x6_t *)
                                                    g_vocabElems_8951.m_pEaData)[local_20].
                                                    pVoc_0x0 + 2),
                                            (DgnEAClient *)
                                            (*(int *)&((DgnEAVocabData_0x6_t *)
                                                       g_vocabElems_8951.m_pEaData + local_20)
                                                      ->pVoc_0x0 + 10)),pClient);

instead of iVar4 = FindElementIdx(&g_vocabElems_8951.m_pEaData[local_20].pVoc_0x0->eaClients_0xa, pClient);

Jan 13 '24 11:01 Wall-AF

ghidra ghidra copied to clipboard

Segment based addresses need to be interpreted, understood and utilised in further analysis phases correctly.

ghidra
ghidra copied to clipboard