It's been a few months since we last made a post, so it's about time for an update! We've made lots of progress over the summer; more on that later. Since, we've all been very busy with grad school for some, new jobs for others, and didn't have as much time as we wanted to work on the project.
We've made quite a few commits on our private repo; every once in a while, we submit one of those as a public patch to LLVM, to be reviewed by the community. We've already submitted some of them, with most having been committed since (there are still 2 or 3 that need some work though).
Basically, there are 2 parts to the work we did. First is the MC-level additions. We built on top of some existing infrastructure, and added a bunch of new stuff. The goal was to enhance the disassembly: traditional disassemblers, such as binutils' objdump, or the LLVM MC disassembler API, are linear disassemblers, and as such aren't really fit for our project: they scan through the code sections, and at the first non-code byte (for instance, a jump table entry), everything goes wrong, and the rest of the code can't be disassembled (or at least, not correctly).
Recursive traversal disassemblers, such as IDA, match more closely the actual execution of the code, by following branches, and disassembling at their targets. We added a recursive traversal disassembler to LLVM MC. We also added data structures to represent assembly-level (or more precisely MC-level) Control Flow Graphs, based mostly around the MCObjectDisassembler, MCModule, and MCAtom classes. Also, the new APIs we wrote work at the object level, and not at the byte stream level. For now, we've been focusing on Mach-O and Linux/ELF X86, mostly because it's the most common machines we have. We also did some work on ARM (v7), though we'll see about that another time!
Back to our progress. All this is already contributed and part of LLVM MC. (You can start by looking at the Object disassembler). There are already a few people in the community who were interested and submitted a few patches; if you are one, thanks, and contributions are as always very welcome!
The rest of the work we did was on the translation from MC to IR. Now, our design changed quite a bit from the talk we gave last year at the LLVM Euro conference. At the end of the talk, we mentioned investigating direct translation from the Selection-DAG patterns found in the TableGen instruction descriptions to LLVM IR; we talked with lots of people involved in those parts of LLVM, and decided to switch to that way. So, right now, we are able to extract those patterns using a TableGen backend; generate a table of the semantics of all instructions (this part is where the MIR code we talked about went); generate IR, driven by the table, and the MC CFG we created from the input object file.
This process works pretty well, and we tried to use it in a few different scenarios. An interesting one is dynamic binary translation. We ran the LLVM test-suite to compare the execution time between natively compiled (with clang-3.4, -O1) and natively compiled (same binaries), but going through our tool, that dynamically translates to LLVM IR and then JITs + executes. The last time we checked, around 60% of all test-suite programs executed correctly under our translator; here are the results for the longest-running passing tests.
At this point we don't really have a schedule; whenever we feel a patch is ready to go, we submit it to the community. The goal being, once we're done, our work becomes a full part of LLVM, where we and all contributors can continue to advance it!