NoteZ HookZz Internal

Prologue

HookZz 写了好久, 一直想抽时间写一下, 在这方面的总结.

InlineHook 本质
InlineHook 高级 Trick
Assembler 和 Disassembler 的工程化使用

当然还有很多其他的 trick hook, 例如基于 #PF 和 HardwareBreakpoint 的 hook, 剩下的后续再讲吧.

Attention

本文更关注整体思路和关键的技术, 对部分技术点, 有些只是逻辑实现的代码多少问题, 本文只是一提而过, 还需读者自行查资料, 例如使用 mmap/remap 进行 CodePatch.

1. InlineHook 本质

1.1. InlineHook Routing

Write your fake funciton
Build trampoline that branch to your fake function
Patch the original function address with the trampoline

1.2. InlineHook Keypoint

Generate a short (assembly) trampoline
Allocate/Modify the executable memory
Call the original function (Origin Instrutions Fix)

1.2.1. Generate a short (assembly) trampoline

本质就是构造一个可以 branch 到 fake function 的指令片段.

在 ARM/ARM64 里就是 br or b, 为了通用, 这里选择 br 作为 IndirectBranch, 可以在任意地址空间进行跳转.

在 X86/X86_64 里指令长度不固定, 就是 jmp Immmediate32 or jmp Immediate with REX prefix, 直接跳转.

这里需要使用一个完整的 Assembler ?

可以, 没有必要, 但需要一个实现优雅的工程化的 mini Assembler. 如下为 HookZz 中使用的 Assembler.

void CodeGen::LiteralLdrBranch(uint64_t address) {
  TurboAssembler *turbo_assembler_ = reinterpret_cast<TurboAssembler *>(this->assembler_);
#define _ turbo_assembler_->
  PseudoLabel address_ptr;
  _ Ldr(X(17), &address_ptr);
  _ br(X(17));
  _ PseudoBind(&address_ptr);
  _ EmitInt64(address);
}

1.2.2. Modify the memory attributes

为了能够添加 trampoline, 需要对 rx 属性的内存进行 Patch. 这必然涉及到内存属性的修改.

在 Android/Linux/macOS/Windows 可以分配, rwx 属性的内存, 所以这里直接修改内存属性进行 Patch 即可.

在 iOS 上无法分配和修改 rx 属性内存, 具体原因请自己找 Paper 看, 这里讲解的是 JailBreak or DebugMode 状态下.

但是在 Patch 的时候需要有两个注意的点.

Patch 前要 freeze 所有线程
Patch 后要 ClearCache

上面的思路都是修改成 rwx 属性进行修改, 另外一种思路. temporary_file + mmap with MAP_FIXED flag, 这个思路在 frida 下使用, 同时在 frida 的 darwin 下实现, 还有基于 remap 的实现, 具体细节还请读者自行查阅.

1.2.3. Call the original function (Origin Instrutions Fix)

由于 original function 的前几条指令被修改 trampoline, 为了能够调用原来的函数, 需要把原来的前几条指令给 Relocate 到分配出来的 rx 属性的内存地址, 这个内存地址, 即作为 origin_function, 在之后继续被调用.

在 Relocate 的过程中, 由于地址发生改变, 需要对 IP Relative(ia32/x64) or PC Relative(arm/aarch64) 指令进行修复.

是否需要一个 Disassembler 例如 capstone 去解析指令?

可以, 但没有必要. 这里分架构来说.

arm, ExecuteState 普遍说有两种, ARM/Thumb, 但 Thumb 还有 Thumb1/Thumb2 之分, 指令长度就 2/4 之分, 可以通过 InstructionEncoding 的标记位判断(请参考 Armv8 Architecture Manual, A32/T32/T16 那部分章节), 所以只要对特殊指令做好 mask 判断即可
arm64, 指令长度固定, 只要对特殊指令做好 mask 判断即可.
ia32 or x64 指令长度不固定, 但是有 x86 Instruction Encoding 格式规范, 每个字段都是有固定的位置, 所有的指令都可以通过下面的结构表示, 可以通过构建 opcode decode map, 判断长度.

如下为 HookZz 中使用的 X86 指令描述和 DecodeMap

struct Instr {
  byte prefix;

  byte REX;

  union {
    byte opcode[3];
    struct {
      byte opcode1;
      byte opcode2;
      byte opcode3;
    };
  };

  union {
    byte ModRM;
    struct {
      byte Mod : 2;
      byte RegOpcode : 3;
      byte RM : 3;
    };
  };

  union {
    byte SIB;
    struct {
      byte base : 2;
      byte index : 3;
      byte scale : 3;
    };
  };

  byte Displacement[4];
  int DisplacementOffset;

  byte Immediate[4];
  int ImmediateOffset;
};

OpcodeDecodeItem OpcodeDecodeTable[257] = {{0x00, 2, OpEn_MR, OpSz_8, ImmSz_0, _DecodeOpEn_MR},
                                           {0x01, 2, OpEn_MR, OpSz_16 | OpSz_32, ImmSz_0, _DecodeOpEn_MR},
                                           {0x02, 2, OpEn_RM, OpSz_8, ImmSz_0, _DecodeOpEn_RM},
                                           {0x03, 2, OpEn_RM, OpSz_16 | OpSz_32, ImmSz_0, _DecodeOpEn_RM},
                                           {0x04, 1, OpEn_I, OpSz_0, ImmSz_8, _DecodeOpEn_I},
                                           {0x05, 1, OpEn_I, OpSz_16 | OpSz_32, ImmSz_16 | ImmSz_32, _DecodeOpEn_I},
                                           ...

2. InlineHook 高级 Trick

2.1. 更短的 Trampoline

在 ARM/ARM64 中, 如果可以使用 B_xxx 来替代 LDR + Br AKA LiteralBranch, 就可以实现 Single Instruction Trampoline, 同时避免了寄存器的污染, 减少了指令修复条数, 并且可以应对短函数(stub function). 同理在 IA32/X64 使用 Jmp Immmeidate32 来替代 Jmp Immediate with REX prefix.

这里以 ARM/ARM64 来介绍.

但是显然 B_xxx 存在 Branch Range 的限制, 在 ARM/ARM64 中这个限制为 +-(1 << 25), 但是显然 fake function 的实现地址, 几乎不能与 binary/library 在 runtime 期间是这个区间内.

思路转化为 Bxxx + LDR + Br AKA LiteralBranch, 即增加一个 FastForward Trampoline, 只需要在 +-(1 << 25) 区间内, 找到一个 Code Cave 来存放这个 FastForward Trampoline 即可. 整理流程可以表达为.

OriginFunction -> Trampoline(Bxxx) -> FastForwardTrampoline(LiternalBranch) -> FakeFunciton

以 MachO 举例. 可能存在 Code Cave 的位置, 大部分是因为内存 map 时对齐导致的.

mach_header 与 __text section 之间
__TEXT 与 __DATA segment 之间
function align

这里实现起来比较简单, 拿到 ProcessMemoryLayout 即可, 对于 Android/Linux 解析 /proc/<pid>/maps 即可, 对于 Darwin 调用 vm_region_recurse_64 就可以 iterate 所有的 memory region

2.2. Dynamic Binary Instrument With Closure-Trampoline-Bridge

前面都是都是函数级的操作, 如何实现指令级的插桩, 并且获取和控制所有寄存器?

大致流程可以总结为如下.

Instruction Adresss -> Trampline -> Save Register State -> Instrument Handler -> Restore Register State -> Go on the reset instructions

为了能够让 InstrumentHandler 携带 Saved Register State ,并在 Restore Register State 期间用修改过的寄存器状态修改原始的寄存器, HookZz 构建 Closure Trampoline Bridge, entry 作为 package 进行携带.

#define _ turbo_assembler_.
  TurboAssembler turbo_assembler_;

  PseudoLabel ClosureTrampolineEntry;
  PseudoLabel ForwardCode_ClosureBridge;

  // ===
  _ Ldr(x16, &ClosureTrampolineEntry);
  _ Ldr(x17, &ForwardCode_ClosureBridge);
  _ br(x17);
  _ PseudoBind(&ClosureTrampolineEntry);
  _ EmitInt64((addr_t)entry);
  _ PseudoBind(&ForwardCode_ClosureBridge);
  _ EmitInt64((addr_t)get_closure_bridge());
  // ===

  AssemblyCode *code = AssemblyCode::FinalizeFromTurboAssember(reinterpret_cast<AssemblerBase *>(&turbo_assembler_));

  entry->address       = (void *)code->raw_instruction_start();
  entry->carry_data    = carry_data;
  entry->carry_handler = carry_handler;
  entry->size          = code->raw_instruction_size();
  return entry;

void *get_closure_bridge() {

  // if already initialized, just return.
  if (closure_bridge)
    return closure_bridge;

// check if enable the inline-assembly closure_bridge_template
#if ENABLE_CLOSURE_BRIDGE_TEMPLATE
  extern void closure_bridge_tempate();
  closure_bridge = closure_bridge_template;
// otherwise, use the Assembler build the closure_bridge
#else
#define _ turbo_assembler_.
#define MEM(reg, offset) MemOperand(reg, offset)
#define MEM_EXT(reg, offset, addrmode) MemOperand(reg, offset, addrmode)
  TurboAssembler turbo_assembler_;

  // save {q0-q7}
  _ sub(SP, SP, 8 * 16);
  _ stp(Q(6), Q(7), MEM(SP, 6 * 16));
  _ stp(Q(4), Q(5), MEM(SP, 4 * 16));
  _ stp(Q(2), Q(3), MEM(SP, 2 * 16));
  _ stp(Q(0), Q(1), MEM(SP, 2 * 16));
  // save {x1-x30}
  _ sub(SP, SP, 30 * 8);
  _ stp(X(29), X(30), MEM(SP, 28 * 8));
  _ stp(X(27), X(28), MEM(SP, 26 * 8));
  _ stp(X(25), X(26), MEM(SP, 24 * 8));
  _ stp(X(23), X(24), MEM(SP, 22 * 8));
  _ stp(X(21), X(22), MEM(SP, 20 * 8));
  _ stp(X(19), X(20), MEM(SP, 18 * 8));
  _ stp(X(17), X(18), MEM(SP, 16 * 8));
  _ stp(X(15), X(16), MEM(SP, 14 * 8));
  _ stp(X(13), X(14), MEM(SP, 12 * 8));
  _ stp(X(11), X(12), MEM(SP, 10 * 8));
  _ stp(X(9), X(10), MEM(SP, 8 * 8));
  _ stp(X(7), X(8), MEM(SP, 6 * 8));
  _ stp(X(5), X(6), MEM(SP, 4 * 8));
  _ stp(X(3), X(4), MEM(SP, 2 * 8));
  _ stp(X(1), X(2), MEM(SP, 0 * 8));

#if 1
  // save {x0}
  _ sub(SP, SP, 2 * 8);
  _ str(x0, MEM(SP, 8));
#else
// Ignore, refer: closure_bridge_template
#endif

  _ mov(x0, SP);
  _ mov(x1, TMP1);
  _ CallFunction(ExternalReference((void *)intercept_routing_common_bridge_handler));

  // restore x0
  _ ldr(X(0), MEM(SP, 8));
  _ add(SP, SP, 2 * 8);
  // restore {x1-x30}
  _ ldp(X(1), X(2), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(3), X(4), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(5), X(6), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(7), X(8), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(9), X(10), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(11), X(12), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(13), X(14), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(15), X(16), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(17), X(18), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(19), X(20), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(21), X(22), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(23), X(24), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(25), X(26), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(27), X(28), MEM_EXT(SP, 16, PostIndex));
  _ ldp(X(29), X(30), MEM_EXT(SP, 16, PostIndex));
  // restore {q0-q7}
  _ ldp(Q(0), Q(1), MEM_EXT(SP, 32, PostIndex));
  _ ldp(Q(2), Q(3), MEM_EXT(SP, 32, PostIndex));
  _ ldp(Q(4), Q(5), MEM_EXT(SP, 32, PostIndex));
  _ ldp(Q(6), Q(7), MEM_EXT(SP, 32, PostIndex));

  // _ brk(0); // for debug

  // branch to next hop, @modify by `xxx_routing_dispatch`
  _ br(x16);

  AssemblyCode *code = AssemblyCode::FinalizeFromTurboAssember(&turbo_assembler_);
  closure_bridge     = (void *)code->raw_instruction_start();

  DLOG("[*] Build the closure bridge at %p\n", closure_bridge);

#endif
  return (void *)closure_bridge;
}

最后在 intercept_routing_common_bridge_handler 做分发即可.

// Closure bridge branch here unitily, then  common_bridge_handler will dispatch to other handler.
void intercept_routing_common_bridge_handler(RegisterContext *reg_ctx, ClosureTrampolineEntry *entry) {
  DLOG("[*] catch common bridge handler, carry data: %p, carry handler: %p\n",
       ((HookEntry *)entry->carry_data)->target_address, entry->carry_handler);
  USER_CODE_CALL UserCodeCall = (USER_CODE_CALL)entry->carry_handler;
  UserCodeCall(reg_ctx, entry);
  return;

2.3. Function Wrapper With Closure-Trampoline-Bridge

如何实现为函数增加, pre_call 和 post_call, 对于 pre_call 毫无疑问依然是走 trampoline, 对于 post_call 呢? 这里的思路是修改 lr 寄存器, 定向到 Closure Trampoline Bridge, 但是同时需要保存 lr 寄存器, lr 寄存器保存到哪里? 这里利用 THreadLocal 为每一个线程构建一个伪函数调用栈区, 进入 pre_call 时将 lr 入栈, 在 post_call 时, 将 lr 弹出即可.

因为需要所有平台完成如下的 interface.

class OSThread {
public:
  typedef int LocalStorageKey;

  static int GetCurrentProcessId();

  static int GetCurrentThreadId();

  // Thread-local storage.
  static LocalStorageKey CreateThreadLocalKey();

  static void DeleteThreadLocalKey(LocalStorageKey key);

  static void *GetThreadLocal(LocalStorageKey key);

  static int GetThreadLocalInt(LocalStorageKey key);

  static void SetThreadLocal(LocalStorageKey key, void *value);

  static void SetThreadLocalInt(LocalStorageKey key, int value);

  static bool HasThreadLocal(LocalStorageKey key);

  static void *GetExistingThreadLocal(LocalStorageKey key);
};

2.4. 工程的规范化

│   ├── InstructionRelocation
│   │   ├── arm
│   │   ├── arm64
│   │   ├── ia32
│   │   ├── x64
│   │   └── x86
│   ├── InterceptRouting
│   │   ├── InterceptRouting.cpp
│   │   └── InterceptRouting.h
│   ├── InterceptRoutingPlugin
│   │   ├── DynamicBinaryInstrument
│   │   ├── FunctionInlineReplace
│   │   └── FunctionWrapper
│   ├── InterceptRoutingTrampoline
│   │   ├── arm
│   │   ├── arm64
│   │   └── x64

InlineHook 总结

InlineHook 由于对内存进行了修改, 如果软件做了 crc 之类的 integrity check 会被检测到, 当然这里也有基于 #PF 和 HardwareBreakpoint 的方案去解决, 但是也有其他限制.

Feb 12 '19 13:02 jmpews

以前在x86x64下，为了懒省事都是直接集成个汇编器进去的。另外我的习惯都是push ret，而不是jmp。。为了就是懒得算长短跳转。。。另外在win平台下，硬断可以通过thread ctx中dr寄存器检测，当然还有一些比如异常机制hook的，但一般的安全防护都可以做针对的检测。不知道ios平台下是个什么样子的情况，有没有这样的机制。

Feb 14 '19 13:02 ohroy

@rozbo push ret 要长一点吧? 会多1个字节? 我前几天看了一个人搞得 iOS 基于 hardware breakpoint 的 CE, 还可以的.

Feb 14 '19 15:02 jmpews

frida好像也是硬断，就是不清楚有没有对应的检测手段，如果没有的话，用这种方式hook，隐蔽性多牛逼。。

Feb 15 '19 01:02 ohroy

NoteZ NoteZ copied to clipboard

HookZz Internal

Prologue

Attention

1. InlineHook 本质

1.1. InlineHook Routing

1.2. InlineHook Keypoint

1.2.1. Generate a short (assembly) trampoline

1.2.2. Modify the memory attributes

1.2.3. Call the original function (Origin Instrutions Fix)

2. InlineHook 高级 Trick

2.1. 更短的 Trampoline

2.2. Dynamic Binary Instrument With Closure-Trampoline-Bridge

2.3. Function Wrapper With Closure-Trampoline-Bridge

2.4. 工程的规范化

InlineHook 总结

NoteZ
NoteZ copied to clipboard