ghidra
ghidra copied to clipboard
Segment based addresses need to be interpreted, understood and utilised in further analysis phases correctly.
Is your feature request related to a problem? Please describe. The interpretation of segmented addresses.
I have determined that although Ghidra does understand segment:offset style of addresses (like that used in protected-mode Windows 1990's) for disassembling, it struggles with the later analysis and turning it into decent C.
There seems to be the following kinds of addresses in this category:
A. Where just the offset is given;
B. Passed as function parameters;
C. References to sub-structures.
Currently there are the following issues:
- Numeric values are being misinterpreted as an
offsetfor an address in the data segment (DSis being applied when it shouldn't - default segment usage is not shown in theListingPanel and presumably not propagated to further analysis phases); - In a function parameter where the address is of the
segment:offsetform, anoffsetto a sub-structure within said parameter is being misunderstood and comes out as eitherCONCAT2(xxx,yyy+offset)or even worseCONCAT13(xxx,CONCAT12(yyy,zzz+offset)); - Passing the full stack address (including the
SS) to another function looks like thisCString::~CString((CString *)CONCAT13(uVar6,CONCAT12(uVar4,&local_1e)),2)instead ofCString::~CString(&local_1e,2)or better stilllocal_1e.~CString(2); - Arrays circum to a similar fate in that ones declared as global (on the
DS) look like:
uVar7 = *(undefined2 *)((int)g_apszProfileStrings_2c1e + 2 + nIdx * 4);
uVar8 = *(undefined2 *)((int)g_apszProfileStrings_2c1e + nIdx * 4);
instead of psz = g_apszProfileStrings_2c1e[nIdx];
5. I'm sure there's more, but all stem from the same fundamental issue of not interpreting the segment portion of addresses correctly.
Describe the solution you'd like Correct interpretation of segmented addresses so that all the later great features of Ghidra would be utilised.
Describe alternatives you've considered
Setting the DS to 0 on the individual assembly instruction DOES eliminate issue 1.
Turning pointers into a complex union (see #4469) helps a little with issue 2 and 4 when array is part of a structure whose full address has been passed to another function call.
Others I haven't found anything decent yet!
Additional context
I've been using a modified x86-16.cspec file together with my fork of Ghidra that I sync up regularly!
Here's a better example from where a structure member (in this case a seg:offpointer) is required from within a parent structure (pointed at from the functions first parameter) where at offset 0x46 a pointer is defined as DgnVocabularyList*32:
FUN_10e8_05c2((DgnVocabularyList *)CONCAT22(param_1._2_2_,(int)param_1 + 0x46),
PROFILEINDEX_Systems_Shared_Vocabularies,1 | PROFSECT_IsDirectory);
As you can see the first parameter (above) the segment portion of the address associated with the member at offset 0x46 is the same as the segment of param_1 (bytes 3 & 4, or param_1._2_2_) and the offset is the 0x46th byte of the parent structure, but Ghidra isn't capable of understanding this and consequently misses the glaringly obvious fact it should be finding and using the datatype at that objects offset.
Another problem with the x86-16 sleigh is that it implicitly assumes (in currentCS and rel16) that the segment register is a multiple of 0x1000 when resolving references of all kinds - including jumps and calls which plays havoc with the decompiler and even autoanalysis. Also see related issue #2892.
Some likely paths from currentCS and rel16 to CALL/JMP and the MOV from #2892 are shown below
currentCS: CS is protectedMode=0 & CS { tmp:4 = (inst_next >> 4) & 0xf000; CS = tmp:2; export CS; }
:CALL rm16 is $(LONGMODE_OFF) & vexMode=0 & addrsize=0 & opsize=0 & byte=0xff & currentCS; rm16 & reg_opcode=2 ... { local dest:4 = segment(currentCS,rm16); push22(&:2 inst_next); call [dest]; }
:JMP rm16 is vexMode=0 & addrsize=0 & opsize=0 & byte=0xff & currentCS; rm16 & reg_opcode=4 ... { target:4 = segment(currentCS,rm16); goto [target]; }
rel16: reloc is simm16 [ reloc=((inst_next >> 16) << 16) | ((inst_next + simm16) & 0xFFFF); ] { export *[ram]:$(SIZE) reloc; }
:CALL rel16 is $(LONGMODE_OFF) & vexMode=0 & addrsize=0 & opsize=0 & byte=0xe8; rel16 { push22(&:2 inst_next); call rel16; }
:JMP rel16 is vexMode=0 & opsize=0 & byte=0xe9; rel16 { goto rel16; }
seg16: currentCS: is segover=1 & currentCS { export currentCS; }
Mem16: seg16^addr16 is seg16; addr16 { tmp:$(SIZE) = segment(seg16,addr16); export tmp; }
Mem: Mem16 is addrsize=0 & Mem16 { export Mem16; }
rm16: "word ptr" Mem is Mem { export *:2 Mem; }
:MOV rm16,Sreg is vexMode=0 & byte=0x8c; rm16 & Sreg ... { rm16 = Sreg; }
Another problem with the x86-16 sleigh ...
@sad-dev not sure that's exactly the same now, as I think the loaders have changed. In my case (I'm using protected-mode) my code segments are multiples of 8 and calls seem to have the correct addresses filled in (presumably by the loader).
What gets me is that almost all my pointers are effectively 32-bit (as load/store operations always use a segment register, either implicitly or explicitly) but in x86-16.cspec the pointer size is 16 bits. (Should it be 32???) Also, looking through the decompiler code there seems to be support for near pointers, which to me seems to allude that the pointer size should be 32 bits and, where pointers are given as 16-bit, the hiword should be obtained from the appropriate segment register/memory/stack address!
Longs are another issue!!!
I have not touched it in a while, but from memory the failure pattern I ran into was something like this - both 1000:F000 and 1F00:0000 correspond to the same address but can resolve the CALL operation very differently:
1000:F000 e8 00 10 CALL ??
1000:F003 ...
1F00:0000 e8 00 10 CALL ??
1F00:0003 ...
With a segment of 1000, this resolves to a call to 1000:0003 as 0xf003 + 0x1000 becomes 3
With a segment of 1F00, this resolves to a call to 1F00:1003.
It is possible that subsequent improvements have fixed this - I will double check when I can, although my guess is that no, the sleigh is still used to resolve the references.
IS this similar, or different (but maybe related) to an issue I am facing while working on disassembling a game that runs on Sony's PS2/Emotion Engine, in that even though the struct is defined, and applied correctly to an address, it still shows up in decompiled functions as (var_name.first_field + [offset in hex])? (where offset points to the offset of a valid member of the struct being utilized, accessed or written to)
(perhaps a silly question, perhaps I am doing something wrong if not a related (or the same) issue?
I have not touched it in a while, but from memory the failure pattern I ran into was something like this - both
1000:F000and1F00:0000correspond to the same address but can resolve the CALL operation very differently:1000:F000 e8 00 10 CALL ?? 1000:F003 ... 1F00:0000 e8 00 10 CALL ?? 1F00:0003 ...With a segment of
1000, this resolves to a call to1000:0003as0xf003 + 0x1000becomes3With a segment of1F00, this resolves to a call to1F00:1003.It is possible that subsequent improvements have fixed this - I will double check when I can, although my guess is that no, the sleigh is still used to resolve the references.
I believe what you're describing relates to "real mode" where multiple combinations of seg:off represent the same physical address.
IS this similar, or different (but maybe related) to an issue I am facing while working on disassembling a game that runs on Sony's PS2/Emotion Engine, in that even though the struct is defined, and applied correctly to an address, it still shows up in decompiled functions as (var_name.first_field + [offset in hex])? (where offset points to the offset of a valid member of the struct being utilized, accessed or written to)
(perhaps a silly question, perhaps I am doing something wrong if not a related (or the same) issue?
Looks similar.
Having seen yet another example of this (and engaging brain)
pnCurrNext = (IDX_ONE_BASED *)((int)this + 0x1a);
from
11e0:178e 8b 56 08 MOV DX,word ptr SS:[BP + this+0x2]
11e0:1791 8b 46 06 MOV AX,word ptr SS:[BP + this]
11e0:1794 05 1a 00 ADD AX,0x1a
and
*pCurrNext = pCurr$1->nNext_0x10;
from
11e0:17b8 c4 5e f8 LES BX,SS:[BP + pCurr$1]
11e0:17bb 26 8b 5f 10 MOV BX,word ptr ES:[BX + 0x10]
11e0:17bf 8e c2 MOV ES,DX
11e0:17c1 93 XCHG AX,BX
11e0:17c2 26 89 07 MOV word ptr ES:[BX],AX
I believe it's something to do with mechanism that Ghidra uses to take the address of a structure's member when using a sized version of a pointer to that structure.
For example: if that member is a simple int and something is assigned to it, the decompiler produces
this->nNext_0x1a = nIdx;
from
11e0:16dc c4 5e 06 LES BX,SS:[BP + this]
11e0:16df 26 89 4f 1a MOV word ptr ES:[BX + 0x1a],CX
which is correct.
For example: if that member is a simple
intand something is assigned to it, the decompiler producesthis->nNext_0x1a = nIdx;from11e0:16dc c4 5e 06 LES BX,SS:[BP + this] 11e0:16df 26 89 4f 1a MOV word ptr ES:[BX + 0x1a],CXwhich is correct.
To get Ghidra to produce the C output above, I modified this
https://github.com/NationalSecurityAgency/ghidra/blob/3e9419d4d793dad0619e51769474be61d6968ef9/Ghidra/Features/Decompiler/src/decompile/cpp/varnode.cc#L1932
to
if (op2->code() != CPUI_SUBPIECE) {
if ((Varnode*)0 == op2->getIn(0)) return false;
op2 = op2->getIn(0)->getDef();
if (op2->code() != CPUI_SUBPIECE) return false;
}
and https://github.com/NationalSecurityAgency/ghidra/blob/3e9419d4d793dad0619e51769474be61d6968ef9/Ghidra/Features/Decompiler/src/decompile/cpp/ruleaction.cc#L8536 to
PcodeOp* op2 = vn2->getDef();
if (op2->code() != CPUI_SUBPIECE) {
data.opInsertInput(op, op2->getIn(1), 1);
data.opSetOpcode(op, op2->code());
}
else {
data.opSetOpcode(op,CPUI_COPY);
}
@caheckman is this something in your domain???
Here is another example of Ghidra's inability to correctly comprehend "sized pointers"[^1] when it comes to working out the consequences of referencing components within them.
The C output is:
int __cdecl16far FUN_1168_09ec(uint numElems,DgnSAData1168_0b3c_0x20_t *param_2,int param_3,int *param_4)
{
Dgn1170_03c2_0x14_t::LPpfnFUN_1168_1262 *pDVar1;
int iVar2;
int iVar3;
uint i;
uint nSum;
long nMultSum;
undefined2 seg;
nMultSum = 0;
nSum = 0;
for (i = 0; i < numElems; i += 1) {
seg = *(undefined2 *)
((int)((DgnSAData1168_0b3c_0x20_t *)param_2)->field0_0x0 + i * 4 + 2);
iVar2 = *(int *)(*(int *)(param_2->field0_0x0 + i) + 4);
if ((param_3 == 0) || (iVar2 != 0)) {
iVar2 = (*(int *)(*(int *)(param_2->field0_0x0 + i) + 2) + iVar2 / 2) / iVar2 + 1;
pDVar1 = ¶m_2->field0_0x0[i]->_vtable->field4_0x10;
iVar3 = (*pDVar1)(param_2->field0_0x0[i]);
nSum += iVar2;
nMultSum += (ulong)(uint)(iVar3 * iVar2);
}
}
*param_4 = nSum;
return (int)((long)(nMultSum + (ulong)(nSum >> 1)) / (long)(ulong)nSum);
}
The lines of interest are:
iVar2 = *(int *)(*(int *)(param_2->field0_0x0 + i) + 4);
and
iVar2 = (*(int *)(*(int *)(param_2->field0_0x0 + i) + 2) + iVar2 / 2) / iVar2 + 1;
where field field0_0x0 (of param_2) is defined as Dgn1170_03c2_0x14_t *32[8] and the fields of Dgn1170_03c2_0x14_t at offsets 2and4 are int types, (offset 0 is a pointer to a vtable - using the default 16-bit size).
These definitions cannot easily have their pointer declarations altered as doing so creates incorrectly sized structures (and offsets within) and issues with calling functions when changing a pointer size.
[^1]: The only way to get pointers of the correct size when the architecture supports multiple (two in this case) sizes.
Is there anyone able to look at this?
Can someone please look at this. I have examples of this everywhere and its making REing even more difficult! Thanks.
I'm getting output like
iVar4 = FindElementIdx((DgnEAClient *)
CONCAT22(*(undefined2 *)
((int)&((DgnEAVocabData_0x6_t *)
g_vocabElems_8951.m_pEaData)[local_20].
pVoc_0x0 + 2),
(DgnEAClient *)
(*(int *)&((DgnEAVocabData_0x6_t *)
g_vocabElems_8951.m_pEaData + local_20)
->pVoc_0x0 + 10)),pClient);
instead of
iVar4 = FindElementIdx(&g_vocabElems_8951.m_pEaData[local_20].pVoc_0x0->eaClients_0xa, pClient);