Overview 1: The DC layer

2014-04-03 | Dagger Team

A new LLVM library is introduced to provide DeCompilation to IR: DC. In much the same way the MC layer works, DC provides target-independent APIs, which are partially implemented by backends. Currently, DC is focused on translating LLVM MC (assembly and machine code level) to IR.
There are a few basic principles guiding the design of DC. First, do as much as possible statically: for instance, the dynamic binary translator runtime is the static translator, but reading from memory rather than binary object files, and with a few changes to accomodate linker and loader related problems. Second, optimize for the common case: in our case, sound binaries produced by a compiler, rather than the more exotic files, using packers, obfuscation, or any kind of anti-reverse-engineering. However, still keep in mind corner cases and provide fallbacks. Third, be as lazy as possible during translation, mostly to reduce its overhead.

There are a few key classes in DC.

DCInstrSema implements instruction-level semantics. It handles the translation from MCInsts to IR, driven by the tables produced by the SemanticsEmitter TableGen backend. It is also responsible for control flow, functions, and basic blocks in the generated IR Module, based on the translated control flow instruction. This is basically the class that does local translation, from an MCFunction to an IR Function.

DCRegisterSema is used to generate code related to register set accesses: generating the register set context struct type, using it to save and restore registers, creating local variables for individual registers, handing super-/sub-register semantics, and more. This is also where status registers flags and condition codes are translated.

DCTranslator drives the complete translation process, from disassembly to symbolization to IR generation to immediate optimizations.