Overview 2: The MC recursive traversal disassembler

2014-04-03 | Dagger Team

The LLVM MC layer provides APIs for machine code disassembly to MC instructions. The MCDisassembler interface is pretty basic, and mostly disassembles one instruction, from a given memory buffer, resulting in tools that only do linear disassembly. Starting at the address of each symbol (or section if the object file is stripped), disassemble instructions sequentially. This isn't ideal. Since linear disassembly doesn't consider control flow, unreachable portions of the text sections (for instance data-in-code such as jump tables or constant pools) will be disassembled when they shouldn't be. On ISAs with variable length instruction encodings (such as X86), misinterpreting an instruction in the byte stream leads to more misinterpretations down the road, because it makes it much more likely to try disassembling the following instructions starting in the middle of their encoding.

In real world binaries, we don't always know where and what to disassemble: recursive traversal disassembly starts at the entrypoints, and follows branches and calls, matching the actual control flow of the program. This is implemented in the MCObjectDisassembler: starting from an ObjectFile representing a binary, it starts at the entrypoints (usually main or friends), and follows branches to create MCBasicBlocks, and calls to create MCFunctions, in turn composed of MCBasicBlocks. Using the MCDisassembler, it disassembles MCInsts into the MCBasicBlocks. Information about how instructions relate to control flow is obtained from the MCInstrAnalysis helper class.

The result of the recursive disassembly is an MCModule, progressively filled by the MCObjectDisassembler. It is made of MCFunctions, again in turn composed of MCBasicBlocks, which reference an MCTextAtom corresponding to their content instructions. MCTextAtoms are a special kind of MCAtoms, which represent atomic binary blobs found in the object file.