Fuzzing in Parallel with AFL
Using the fuzzer AFL (American Fuzzy Lop) to identify vulnerabilities in a computer program
Introduction
AFL is a well-documented, user-friendly fuzzer originally developed by Michał Zalewski (aka lcamtuf) and initially released in late 2013. The tool has helped to discover hundreds of vulnerabilities in widely used software. It allows to easily execute fuzzing in parallel on multi-core systems. Moreover, it contains built-in features to further analyse the crashes identified during the fuzzing phase. In this article I describe my experience in using AFL to fuzz an open-source XML parser found on GitHub. To test the parallel feature of AFL, and to be able to fuzz the application for several days, I set up a quad-core virtual machine on Microsoft Azure. Towards the end, I inserted a small description of the exploitability assessment phase that usually follows the fuzzing execution, citing some very useful tools.
What is fuzzing
Quoting Sutton’s book¹ we know that fuzzing
is typically an automated or semi-automated process that involves repeatedly manipulating and supplying data to target software for processing.
It is a powerful and well-known strategy for identifying security issues in software. It is able to identify bugs that can drive to remote code execution and privilege escalation. Fuzzing is not intended to find logical flaws but lower-level issues like buffer overflows or off-by-one errors, often found in low level programs or libraries written in C/C++ where:
- code is compiled directly to machine language;
- there is no VM in the middle;
- it is easier to make security-related mistakes.
The programs that are usually targeted are network applications like HTTP and SSH servers and media conversion tools like ffmpeg and gstreamer. Another feature of fuzzing is that it can help you to automatically generate input to run against your application, in order to test its code coverage and also exercise parts of code that are not identified by coverage.
Next-gen fuzzers
Fuzzers prior to the advent of AFL fell into one of two categories: mutation-based fuzzers, applying different kinds of mutations on existing data samples, creating new test cases, and generation-based fuzzers, creating test cases from scratch by modelling the target input format. Of course several fuzzers exist that exploit techniques coming from both worlds.
The problem with these two fuzzing techniques is that random mutations generate test cases that will unlikely reach certain paths inside the target code, leaving potential vulnerabilities out of reach. The approach used by AFL is to use an instrumentation-guided genetic algorithm to generate test cases that will likely reach specific paths in the program. Instrumentation is the ability to measure the performance of a program from the inside, by injecting specific code at compile time. AFL uses a replacement for gcc and other compilers to do that (e.g. afl-gcc and afl-clang). The AFL official documentation² describes the abstract steps that the underlying algorithm performs. I will report them here:
- Load user-supplied initial test cases into the queue;
- Take the next input file from the queue;
- Attempt to trim the test case to the smallest size that doesn’t alter the measured behaviour of the program;
- Repeatedly mutate the file using a balanced and well-researched variety of traditional fuzzing strategies;
- If any of the generated mutations resulted in a new state transition recorded by the instrumentation, add mutated output as a new entry in the queue;
- Go to 2.
The instrumentation-guided approach of AFL requires to have available the source code of the target program, although an experimental support for instrumenting black-box binaries is available (for more info refer to paragraph Instrumenting binary-only apps of the official documentation²).
Fuzzing phases
Independently of the fuzzing approach used or the target program, there are basic phases making up the process that always apply:
- Identify the target. In my case I selected a target that could be easily found and whose source code was available;
- Identify inputs. To some degree, all exploitable vulnerabilities are caused by user input not properly sanitized or validated by the program. The sources of potentially malicious input are several, and some of them are subtle: headers, filenames, environment variables, registry keys, and more;
- Generate fuzzed data. This is an automated process that generates inputs to use during the execution. AFL starts from a base folder containing one or more inputs and uses several techniques to obtain interesting test cases. These are then inserted into a queue used by the fuzzer, that is constantly updated with new data generated by the genetic algorithm;
- Execute fuzzed data. Tightly related to the previous point, this is the act of using the input to try to generate crashes in the target program. Again, this is an automated step;
- Monitor each execution. We must be able to mark the input which is responsible for a crash, and considering the huge amount of executions per second (hundreds to thousands), the fuzzer tool must be smart enough to keep track of everything. Note that not only a crash is an interesting behaviour, but also a hang;
- Assess exploitability. The next step after having identified the faults, is to determine whether the discovered bugs can be turned into an exploit. This can be a manual (and tedious) process, or a semi-automated one.
Fuzzing in Parallel with AFL
AFL offers an interesting feature to fully exploit multi-core systems, or even to create a multi-system parallelization without much effort. Each copy of a fuzzer take up one core, meaning that with an n-core system you can theoretically run n concurrent fuzzers with no performance hit. These concurrent fuzzers can either target n different binaries, or a single one. For the latter, parallelizing a job across different cores, we exploit a feature of AFL to synchronize a master fuzzer with n-1 secondary instances. For more information on parallel fuzzing with AFL you can refer to the AFL user guide³.
For the purpose of this analysis I will focus on parallelizing a multi-core system, using a 4-core processor. In this case I am going to use a master instance (M) with 2 secondary instances (S), that will use a single output directory to synchronize their jobs. The difference between M and S is that the master instance will perform deterministic checks on the input data generated, whereas the secondary instances will only perform random tweaks of various kinds.
Choosing the target
For the purpose of this analysis I decided to pick an open source project available on GitHub: a parallel XML parser written in C. The application accepts a file containing an XML document and outputs another file with an alternative representation of the input. Parsers are a perfect target for fuzzing since they may make assumptions on the input format, driving to bad, possibly exploitable faults. The SLOC count for the program is 1239.
Instrumenting the program
AFL makes instrumenting the program really easy. I just had to compile the program using afl-gcc, instead of the classic gcc compiler. In this case the XML parser was shipped with a make file containing the definition of the compiler as CC = gcc. I just had to change it to CC = afl-gcc and compile the executable with make. The resulting executable is the one that AFL will use to input the data and monitor the behaviour, looking for crashes.
Monitoring the execution
Prior to the execution I had to create two folders used by the fuzzers to sync their jobs: in and out. I inserted into the in folder the base test case used by the genetic algorithm to generate inputs, consisting of a single, valid, XML file. The next step was to launch the three separate instances of fuzzers and let them do their jobs. I used the command
repeated three times (only varying -M to -S and the fuzzer name). I used tmux to run and monitor the three instances of afl-fuzz.
The beautiful, retro-style, UI shows the execution status for each fuzzer. To understand what is going on we will concentrate on the master fuzzer and I will describe the most important parts of the status screen. For the description I will refer to the next figure.
In the process timing section there are basic and intuitive information about fuzzer timings. The most recent findings can be paths, crashes and hangs. By last new path is meant a test case that triggers a new execution pattern inside the program. The ideal thing is to keep an eye on this metric to know when to stop. If no new paths are found for a while, in this section it will appear a warning. In the overall results section a particularly important field is the cycles done; it counts the number of queue passes done so far. As many others are, this field is color-coded, and it will turn green after the fuzzer hasn’t been finding any fault in new cycles for a while. Cycle progress keeps under control the current cycle, showing the id of the test case that it is currently working on. In the out/fuzzerID/crashes/ folder, each file contains a reference to the test case ID that caused the crash. To know exactly what the fuzzer is doing right now, we can look at stage progress. The first line shows the stage that has to do with the test case generation. It can be one of several techniques used by the AFL algorithm to generate interesting test cases (e.g. havoc, splice, trim). For more information about the different stages, check out the official docs. Findings in depth gives more insights into the findings. Particularly interesting are the total crashes and timeouts. The path geometry section shows the pending inputs that haven’t been fuzzed yet. The pend favs are the inputs that the fuzzer will try to push inside the current queue cycle (the non-favored may have to wait more cycles). Imported are the paths imported from other fuzzers running in parallel. Regarding stability, if the program always behaves the same for the same input data, this would be 100%. As long as the value is purple-colored, the fuzzing process will be OK.
Crash triage
Before exploring the last phase of fuzzing, let’s recap what was found in the execution phase. The fuzzers ran for 6 days each, resulting in 675 unique crashes after 364 million executions. The next image shows the final stats of the fuzzing process.
AFL provides a tool (afl-plot) that allows to automatically generate summary plots, given the final output folder of a fuzzer. Here you can see one of the plots generated, showing the distribution of the findings for the Master fuzzer. Most of the crashes where found in the first hour, however, it then took several days to find the remaining 50 crashes.
Crash triage involves examining each crash discovered by a fuzzer to determine whether the crash is likely due to a vulnerability and, if so, what the root cause of the crash is. Reviewing each crash in detail can be very time-consuming, especially if the fuzzer has identified tens or hundreds of crashes. In my case AFL has reported almost 700 unique crashes. Even if it goes outside the scope of this article, I thought it would be a good idea to investigate a bit further the inputs that were determined to cause crashes in XMLParser. There are several tools to either manually or automatically reproduce and investigate the errors (more on this can be found in Josiah’s blog post⁴):
- gdb. The famous GNU debugger is very useful to analyse the crash scenario to look for possible ways to manipulate memory addresses to (for example) attempt a buffer overflow on the target program. Particularly useful was the gef plug-in for gdb;
- exploitable. A great plug-in for gdb that automatically tries to determine whether a crash in the program is likely to be exploitable. It provides a classification system to label each crash status. Of course it does not guarantee the exploitability, but it helps to prioritize the analysis of the human tester;
- Crashwalk. This tool is useful to automatize the classification of the crashes. It uses \textit{exploitable} and creates a report of each crash exploitability status.
Another possible way to investigate the results of fuzzing is using AFL’s built-in crash exploration mode. Let me use \textit{lcamtuf}’s description⁵ of this feature:
[…] you take a crashing test case and give it to afl-fuzz as a starting point for the automated run. The fuzzer then uses its usual feedback mechanisms and genetic algorithms to see how far it can get within the instrumented codebase while still keeping the program in the crashing state. Mutations that stop the crash from happening are thrown away; so are the ones that do not alter the execution path in any appreciable way.
The resulting corpus of different, yet correlated, crashes can be used to determine the degree of control that the tester has over the faulting address.
Conclusions
This article is a collection of stuff I learned while playing with fuzzing in a tool analysis that I performed for a the course of ICT Risk Assessment and Management at the University of Pisa. The fuzzing experience was fun and interesting. In the process I got to study and experiment with several different security (and not only) tools.
The target program chosen yielded a huge amount of crashes, possibly due to its parallel nature, and the fact that the project was developed by a single programmer, evidently without using enough security checkers.
Initially the number of fuzzers where four, to fully exploit the 4 vCPUs provided by the server. However, I quickly noticed that, due to parallel execution of the fuzzing target (XMLparser), the performances of the fuzzers where scarce, going as low as few tens of executions per second. Reducing the number of secondary fuzzers from 3 to 2 mitigated the issue, helping to obtain better performances.
AFL is a great tool for beginners (as I am myself) to jump into the world of fuzzing. I encourage you to try it out yourself as you will learn a lot in the process, and it’s also a lot of fun to crash things in general!
Notes
To perform the fuzzing analysis I used a server hosted on Microsoft Azure with the following specifications: Linux system (Ubuntu 18.04) with 4 vCPUs and 16 GiB of RAM. The version of AFL used was 2.57b, gdb version was 8.1.1. Before renting a server for fuzzing you have to be sure that the terms of use of the provider allow you to do so. Most cloud providers, in fact, base their business model on the capability to share CPU time among different tenants. Since fuzzing easily take up 100\% of the CPU, you risk to be blocked by the service provided if this condition is not permitted by the contract. As you can see from the following figure, the average CPU and disk usage is quite high. The disk write operations is proportional to the number of executions per second performed by the fuzzers.
Fuzzing on a machine using consumer SSD storage might not be a good idea, since the output directory gets written very often and this, on the long term, could damage the drive. A good alternative would be to use tmpfs, which uses the RAM instead of the disk, introducing, however, a data persistence issue. To monitor the CPU/disk stats on Linux systems you can use the tool iostat.
[1] Michael Sutton Adam Greene and Pedram Amini. Fuzzing: Brute Force Vulnerability Discovery. Addison-Wesley Professional, 2007.
[2] AFL official documentation. https://afl-1.readthedocs.io/en/latest/
[3] AFL user guide for parallel fuzzing. https://afl-1.readthedocs.io/en/latest/user_guide.html#parallel-fuzzing
[4] Josiah Pierce. Introduction to triaging fuzzer-generated crashes. https://trustfoundry.net/introduction-to-triaging-fuzzer-generated-crashes/
[5] Michał Zalewski. afl-fuzz: crash exploration mode. https://lcamtuf.blogspot.com/2014/11/afl-fuzz-crash-exploration-mode.html