Function referenced data variable improvements (pointer sweep)
Binary referenced in ticket: happy star paints carefully.
During analysis we identify pointers to data in functions and create data variables, this step will visit all MLIL instructions in a function and as such we expect the pointers to be in a simple expression. This however poses an issue for some specific cases where the constant pointer is constructed piecemeal, or relative to some other value.
MLIL:
21 @ 14021499e rax_2 = [(rcx_3 + &__dos_header) + 0x8e6e60].q
22 @ 1402149a6 rcx_4 = [(rcx_3 + &__dos_header) + 0x8e79f0].q
MLIL (Show opcodes):
21 @ 14021499e (MLIL_SET_VAR.q rax_2 = (MLIL_LOAD.q [(MLIL_ADD.q (MLIL_ADD.q (MLIL_VAR.q rcx_3) + (MLIL_CONST_PTR.q &__dos_header)) + (MLIL_CONST.q 0x8e6e60))].q))
22 @ 1402149a6 (MLIL_SET_VAR.q rcx_4 = (MLIL_LOAD.q [(MLIL_ADD.q (MLIL_ADD.q (MLIL_VAR.q rcx_3) + (MLIL_CONST_PTR.q &__dos_header)) + (MLIL_CONST.q 0x8e79f0))].q))
HLIL:
14021499e int64_t rax_2 = (&data_1408e6e60)[rbx]
1402149a6 int64_t rcx_4 = *((rbx << 3) + 0x1408e79f0) // The LHS is folded in from another expr
Your next question might be, why did the data variable at 14021499e get constructed? Well it has to do with the way pointer sweep operates, at that address there was a value pointing at a function, which pointer sweep will use as a strong indicator of it (the address 14021499e) being a pointer.
1408e6e60 void* data_1408e6e60 = sub_1402145f0
At 0x1408e79f0 we are not so lucky:
And if we manually make a data variable here:
1408e79f0 int64_t data_1408e79f0 = 0x140ffa900
Our current pointer sweep is conservative in the sense that we track these referrers (1408e79f0) and wait until 0x140ffa900 is discovered, than if 0x140ffa900 becomes a data variable we will backtrack and construct data variables at locations pointing to it, such as 1408e79f0. This however means that if 0x140ffa900 never gets identified as a data variable during pointer sweep, we will miss it (assuming no data variable existed prior to pointer sweep obviously).
So what can we do? We can really solve the issue in two ways, either by identifying the data variable during function analysis (likely by simplifying the expression when we check for data variable references), or by improving pointer sweep for cases of non relocatable binaries, likely through some "pointer table" sweep.
There are a few other ways that also might improve the situation, not exactly sure of which is best.
Also see this as effort medium because this by itself is unlikely to occur and we would most likely bundle this with some other refactor to pointer sweep. But you could get away with just improving the data variable identification on the function analysis side.
Hi, thx for the ping. That is our estimation of the amount of work required to resolve the issue, not an ETA.
The issue is not currently on the next release's milestone, so it would not be resolved very quickly
Any updates on this one?
No updates, when the issue is planned for a release it will be added to a milestone. Recommend upvoting to increase visibility when we are doing release planning or if there is more context that increases the impact of the bug feel free to provide additional info or examples.