This category involves doing original research on some topic related to security (it doesn't necessarily have to be purely binary analysis, but try not to go too far from that). I've collected a list of projects here that I think would be cool (and some are already in progress), but feel free to suggest your own project as well.
Background reading:
Neural machine translation works by jointly learning a correspondence between two parallel corpora. Some past work has looked at doing this with the same function compiled for two different architectures. But what if we did it with a function's source code and the binary representation? The hope here is you may be able to learn to decompile binary code automatically. You can generate huge amounts of data just by compiling a bunch of programs and using debug information to tell you the mapping between source lines and assembly statements.
Background reading:
Using our LAVA bug-injection system, we built a small prototype of an automated CTF. It worked by automatically injecting bugs into two seed programs to create lots of variants that could then be exploited by competitors. However, it had many limitations:
For this project: pick one of these problems and extend AutoCTF to fix it, resulting in a more fun CTF (I have ideas about how to tackle many of these, so ask me for details). Once you've made your changes, you should try to get someone to solve some of your new challenges as well as some of the original AutoCTF challenges (don't tell them which are which). If you've done things right, they will hopefully find your challenges more fun than the original versions!
Modern image editing software, printers, and scanners all contain code that prevents them from working with images of currency from various countries. Steven Murdoch did some initial analysis on how software detects currency. One interesting finding is that contrary to some reports, it's not actually using the so-called Eurion pattern. Instead it appears to be using a much more subtle watermark.
The goal of this project is to analyze the code of Photoshop both statically and dynamically to understand how it detects currency. On the static side, this means disassembling Photoshop, finding the place where it decides if an image is currency, and then reverse engineering the functions there to determine how it all works. On the dynamic side, you can use things like dynamic taint analysis to help locate the functions, step through what computations are being performed, and validate knowledge derived from static analysis. You can also use dynamic taint analysis to track the image data as it moves through Photoshop.
For super extra cool points, figure out how to extract the currency watermark and transfer it onto another document, or even a T-Shirt. This would let you create your own documents that can't be scanned or printed!
Birkan Erenler has done preliminary work on this project; you are advised to collaborate with him if you want to do this project.
DARPA spent millions on the Cyber Grand Challenge, which resulted in a large suite of programs with known vulnerabilities -- just what you'd want to figure out the strengths and weaknesses of different tools! Sadly, they decided to write these programs for their own custom operating system called DECREE, which limits our ability to evaluate standard tools using the CGC dataset.
Trail of Bits did lots of great work to port the CGC challenges to Windows, Linux, and OS X. I have also done some initial work to see what's required to run KLEE on these challenges; it's not easy, but it's possible to get it up and running for at least some of the programs.
In this project, you'd perform a thorough investigation of KLEE's performance on the CGC dataset. To do so, you'll extend the work mentioned above to more programs and then run KLEE on them. You should measure both the coverage achieved by KLEE on each program as well as the number of bugs found.
We know that some features of binaries make them easier or harder to fuzz to find bugs. The goal of this project is to systematically evaluate how compiler optimization settings affect the coverage achieved and the number of bugs found in a program. Take the two major compilers (clang
and gcc
), and compile a set of benchmark programs with each of them under various optimization settings. Then run AFL on each for a fixed amount of time and compare the coverage achieved. You may want to include programs in the benchmark that have known bugs that have been found by AFL in the past so that you can measure the effectiveness of bug-finding rather than just coverage.
Background reading:
Related to the above project, can we come up with new compiler passes that make fuzzing easier or harder? One easy thing that has been suggested is breaking up comparisons against large constants into a series of smaller comparisons that can be discovered by AFL incrementally.
Are there other transformations that would make the resulting program easier to fuzz? Suggest one or two such improvements, implement them in LLVM/clang, and then evaluate the effect those changes have on the "fuzzability" of the resulting program.
Background reading:
Static analyzers have lots of false positives. Recently, a group of researchers developed a technique for guided fuzzing, which allows a fuzzer like AFL to be directed toward specific parts of the program. This gives you the opportunity use a noisy static analyzer to find potential problems and then use fuzzing to try and confirm whether those problems really exist. For static analysis, I recommend something like clang static analyzer or even just lint; it's also possible to use Coverity Scan for free, but the project has to be open source (and this may violate the EULA; don't ask me, I'm not a lawyer).
The tricky part of this one is evaluation: if you just pick a random program, you may or may not find any real issues. I'd recommend either working with programs that have some known vulnerabilities or deliberately seeding some programs with vulnerabilities by hand. Then you can measure both the base rate of bug-finding, the accuracy of the static analysis tool, and the combined efficacy of the static analysis + directed fuzzing.
Background reading:
Embedded devices often ship with functionality that end users don't end up needing. For example, if you don't have any Bluetooth devices, you might want to turn off the Bluetooth radio on your car. So how can we disable hardware functionality if we have the ability to modify the software?
External events come into an embedded system by way of interrupts. Typically there is a well-defined top-level interrupt handler (e.g., on ARM this is located at address 0xFFFF0018) that then queries a interrupt controller to find out what device requested an interrupt (usually this just means reading out the interrupt number from some memory-mapped I/O region), looking up the appropriate handler, and then executing it.
So if we can enumerate and identify the interrupts available in a piece of firmware, then we can modify the firmware so that it disables those interrupts. How do we locate interrupts that are available? One way is to just trigger a top-level interrupt and then fuzz the responses whenever the firmware tries to query a memory-mapped I/O address. By doing this repeatedly and tracing what code gets executed, you can build up a mapping between different fuzzed values and the corresponding handlers.
Concretely, you should start with a memory dump of a Raspberry Pi. Using PANDA, you can then run an emulated system and put that RAM dump into memory, set up the CPU registers, and then trigger an interrupt. Evaluate this by comparing the interrupts found to the datasheets for the Raspberry Pi.
Note: Hao Ke (now graduated) did initial work on this project in Spring 2017 and you are encouraged to build on it. His code is available on GitHub:
https://github.com/H4oK3/interrupt_analysis
Background reading:
We know that malware sometimes uses exploits to elevate its privileges once it's on a system. But how can we detect such exploits? Traditional dynamic analysis techniques for malware analysis have trouble with this, because they generally rely on monitoring system calls or taking memory snapshots, which don't have enough information to detect previously-unknown exploits.
With whole-system record and replay, however, we can capture a trace of malware executing in a VM that lets us then replay the trace with instruction-level fidelity. This means we can check properties like Control Flow Integrity, which should allow generic detection of exploits. In addition, we have over 96,000 recordings of malware, so we can get a good measurement of how common privilege escalation is in a large malware corpus.
So, the basic plan is this:
Mohammed Ashraf Ali has done preliminary work on this project; you are advised to collaborate with him if you want to do this project.
Computer science has historically not been great at making sure our results can be reproduced by others, much less actually trying to reproduce past work. This category of projects tries to remedy that: you will pick a paper on something related to binary analysis from one of the top 4 security conferences (IEEE Security and Privacy (Oakland), ACM CCS, USENIX Security, and NDSS) and try to reproduce its results.
What does it mean to reproduce a paper? We can think of a few different levels:
Naturally, (1) and (2) are easier to accomplish. However, (3) provides stronger assurance that the original research is correct (if you just use their software directly, you're not likely to notice bugs in it that affect the results; if you reimplement it, your mistakes are unlikely overlap exactly with theirs and you'll see a discrepancy in the results). Also, sometimes the original authors can't or won't make the software or data they used available, so (3) is the only option.
For the writeup, you should describe the steps you went through to try and reproduce the results. Document anything that wasn't obvious about the original description, unexpected stumbling blocks and how you overcame them, and finally what the results of the replication effort were.
Warning: reproducibility is something of a sensitive topic. If you decide to take on this project, be respectful of the original authors. Coming up with a different result doesn't necessarily mean that you're right and they're wrong. You will be expected to make attempts to correspond with the original authors in order to help reconcile your results with theirs.
In a survey paper, you read a lot of different papers and try to assemble them into a coherent set of themes that describe an entire subarea of research. Do not make the mistake of thinking this category is the "easy option": writing a good survey paper is a lot of work. You will generally need to survey around 100 papers to get a complete picture, and you'll have to actually understand the field and be able to meaningfully summarize the main ideas and contributions. However, a good survey can be really valuable to the community, because it provides a single entry point that newcomers can use to get a handle on the area, complete with references to the most important work.
Some possible topics: