Who corrupted the data! Get a fast and precise answer with the taint
In vulnerability analysis a frequent question that needs answering is: “who corrupted this data?”.
Timeless Debugging and Analysis (TDnA) systems like REVEN can provide fast and accurate answers to this particular question. For example, the Memory History feature of REVEN allows to see the entire list of accesses to a given memory location, so that the user can find which access caused the corruption.
Using REVEN, we can get an even simpler way to find “who” corrupted some data: tainting the corrupted data backward will allow us to go back to the input buffer that caused the corruption, without ever having to manually look for memcopies or manipulations of the data in CPU registers. Using the taint is not only simpler and faster, it also leaves less room for error than the manual way as it is more systematic.
The taint is one of the favorite feature of our users and alledgedly the most powerful one in REVEN, with use cases such as instantaneously going back from a crash to the input file causing it. Its implementation represents several man-years of effort and relies heavily on the timeless nature of REVEN, for allowing backward taint analysis as well as for important performance optimizations.
This article gives an overview of the taint feature in REVEN, and also provides some insight as to how it works under the hood.
Contents
- Taint basics
- Taint under the hood
- Validating the taint
- Building applications on the taint
- Conclusion
Taint basics
Taint analysis works by marking some inputs as tainted, and then propagating this information, by applying a propagation semantics to the instructions manipulating the tainted data. The output of the algorithm is a list of the instructions where the tainted state of some data changed, or where tainted data was used (useful for “slicing” a program according to some tainted data). In REVEN, the taint performs data flow propagation, the inputs can be x86_64 registers and ranges of memory, and the propagation direction can be either forward or backward.
Forward taint is answering the question: “in the future, what registers and memory locations have their value depend on the initial tainted data?”, while backward taint is more like answering the question “where does the value in this location come from in the past?”, with the end of the output list typically containing the origin of the input data, such as the input buffer or file that caused the corruption.
The remainder of this section provides some simple examples of taint propagation.
Forward register propagation
; Start of propagation
; input: rax tainted
mov rbx, rax
; output: rax and rbx tainted
; End of propagation
Since this instruction moves the contents of rax
into rbx
, the value of rbx
is directly dependent on the value of rax
.
Forward memory propagation
; Start of propagation
; input: rax tainted
add qword ptr [r8+0x10], rax
; rax, [ds:0xfffff80002802cd0 ; 8] tainted
mov ebx, dword ptr [r8+0x14]
; output: rbx[0; 4], rax, [ds:0xfffff80002802cd0 ; 8]
; End of propagation
For memory accesses, the dereferenced address is resolved to its runtime address, here ds:0xfffff80002802cd0
.
The corresponding range of 8 bytes becomes tainted after the addition which makes it depend on both its initial value
and on rax
.
Then, the second instruction taints ebx
from the 4 last tainted bytes of the range [ds:0xfffff80002802cd0 ; 8]
. Note that the
taint in REVEN is able to track taint state at the byte level.
Backward propagation
; End of propagation (start at the last instruction in backward)
; output: rbx, [ds:0xfffff80002802cd0 ; 8] tainted
add rbx, qword ptr [r8+0x10]
; rbx tainted
mov rax, rbx
; input: rax tainted
; Start of propagation
The first thing to notice in backward is that we start propagation from the final instruction, and go back up to the first.
Second, the propagation rules are a bit different as in forward, as in the instruction mov rax, rbx
, rax
takes its
value from rbx
. That means that the value of rax
after propagating through this instruction depends solely on the
value of rbx
, which is why rbx
becomes tainted and rax
loses the taint.
In the add rbx, qword ptr [r8+0x10]
instruction, the value of rbx
depends on both the value of rbx
prior to the
addition, and on the value at address ds:0xfffff80002802cd0
, hence the final result.
OK, this gives us a basic understanding of what taint propagation is like, now, how do we implement such an algorithm so that it can run on real x86 instruction traces of several billions of instructions? Let’s find out…
Taint under the hood
Let’s take a look at the big picture first. The taint algorithm can be split in 3 main steps, described below:
- Sequencing: Determine the next sequence of transitions (“transitions” generalizes the concept of “CPU instruction” and encompasses executed instructions, but also page faults and other interrupts or faults.) from the execution trace on which taint should be propagated. Most of the time, this is the sequence that starts after the end of the previous sequence, and that stops when the sequence roughly corresponds to a basic block. However in some cases there can be jumps in the trace due to memory history optimization (more on that in a later section), or the taint can stop if nothing is tainted anymore or if we reached the end of the requested range of transitions.
- Lifting: Generate LLVM instructions from the sequence of
transitions. Each x86_64 instruction in the sequence is lifted to multiple LLVM instructions using the excellent
open-source library remill
(in the process, Tetrane contributed to remill the lifting of some instructions such as
iretq
or some variants ofPADDUSB
/PADDUSW
). Lifted instructions are further optimized using LLVM’s optimizer, to improve the taint accuracy. Already seen sequences are fetched from the cache to avoid lifting and optimizing again, as those are cpu-intensive operations. - Propagating: On each LLVM instruction produced from lifting the sequence of x86_64 instructions, apply our propagation algorithm. The algorithm maintains and modifies the state of what is tainted according to the propagation rules we defined.
What does the propagation algorithm look like? Before answering this question, let’s take a detour through the LLVM optimization engine.
Exploiting LLVM’s optimization engine
During the lifting step, we also apply some optimizations to the produced LLVM. This is to improve the accuracy of the propagation. For a simple example of a case where optimization helps, consider the following sequence of instructions:
mov rbx, rax
xor rax, rbx
If these instructions are taken separately, then the propagation algorithm is forced to keep any taint on rax
, because
it cannot assert that rax == rbx
. This results in over-taint, which can cause the taint to report spurious links
between data that are actually unrelated.
By lifting with the optimizations enabled, the optimizer is able to recognize that the above is actually semantically equivalent to:
mov rax, 0
This optimized version allows the taint to successfully remove any taint on rax
and avoid over-tainting.
The optimization engine of LLVM is very powerful (it powers the clang
compiler) and allows to simplify very complex
cases, from which the taint benefits.
Propagation algorithm example
The propagation algorithm uses a classic LLVM instruction visitor.
Here is a (simplified) example of the propagation algorithm on an LLVM instruction:
/// Instruction of the form `x = v[i]`.
///
/// Either `i` is statically known, in which case the markers of `v[i]` are propagated to `x`,
/// or `i` is determined at runtime, in which case the markers of `forall j, v[j]` must be propagated to `x`.
void TaintPropagationVisitor::visitExtractElementInst(llvm::ExtractElementInst& inst)
{
auto dest_size = get_type_size_bytes(layout_, inst.getVectorOperandType()->getElementType());
auto index = inst.getIndexOperand();
auto* maybe_const_index = llvm::dyn_cast<llvm::ConstantInt>(index);
auto* vector_operand = inst.getVectorOperand();
if (maybe_const_index) {
const auto const_index = maybe_const_index->getValue().trunc(64).getZExtValue();
// v[const_index] to inst
taint_map().propagate_marker({vector_operand, const_index * dest_size, (const_index + 1) * dest_size}, &inst);
} else {
// all of v is propagated to inst
const auto count = inst.getVectorOperandType()->getNumElements();
for (std::size_t i = 0; i < count; ++i) {
// v[i] to inst
taint_map().propagate_marker({vector_operand, i * dest_size, (i + 1) * dest_size}, &inst);
}
}
}
In the example above, “markers” are used to indicate which piece of data is tainted. The taint in REVEN actually supports propagating
an arbitrary number of markers (although the Axion GUI and the current API in reven2.preview.taint
only supports 2
markers), which allows for instance to taint each byte of an input buffer with a different marker in order to
determine what part of the tainted data comes from each byte of the input buffer after the propagation.
The example allows forward propagation in a LLVM vector instruction that extracts an element from a vector by indexing it. When the index is constant, we can use it in the taint to propagate only the markers of that precise element of the vector.
However, when the index is a runtime value, the best we can do is propagate the markers of each element in the vector to the produced value. If in reality there are stricter conditions on the index that restricts the value it can take at runtime, then this may produce over-taint. In practice though, this particular cause of over-tainting was not often observed.
Exploiting Memory History as a performance optimization
Iterating, lifting on x86_64 instructions, and then optimizing and propagating on LLVM instructions are slow, intensive operations. When only memory (no register) is tainted, we can actually do better in REVEN. Using the Memory History in the sequencing step of the taint algorithm, we can request the transition where the tainted memory will be accessed next, and skip lifting the entire portion of the trace up until the tainted memory is accessed. This sometimes allows to skip billions of transitions, and is instrumental in the great performance of the taint.
Validating the taint
From the start, validating that the taint gives correct results was a priority of ours. To do so we worked on two avenues:
- Testing the taint.
- Designing the taint so that it can self-report inconsistent propagation events.
Testing the taint
For (1), we manually crafted more than 230 unit tests, such as the following:
BOOST_AUTO_TEST_CASE(test_by_inst_2)
{
make_monomarker_taint_by_inst</* forward */ true>({amd64::rax}, "xor rbx, rax\n" // rbx = rbx_0 ^ rax
"xor rdx, rax\n" // rdx = rdx_0 ^ rax
"mov rcx, rax\n" // rcx == rax
"xor rbx, rcx\n" // rbx = rbx_0
"mov rdx, rcx\n" // rdx == rax
"sub rdx, rax\n" // rdx = 0
"div rax", // rax = (rdx:rax) /rax = rax / rax = 1
{
{amd64::rax, amd64::rbx, amd64::sf, amd64::zf, amd64::pf},
{amd64::rax, amd64::rbx, amd64::rdx, amd64::sf, amd64::zf, amd64::pf},
{amd64::rax, amd64::rbx, amd64::rcx, amd64::rdx, amd64::sf, amd64::zf, amd64::pf},
{amd64::rax, amd64::rcx, amd64::rdx},
{amd64::rax, amd64::rcx, amd64::rdx},
{amd64::rax, amd64::rcx},
{amd64::rcx}
});
In this example, we test the taint in the forward direction, with rax
tainted initially, on a set of instructions
starting with xor rbx, rax
, and we compare the resulting taint state after each instruction to the expected
taint state passed as the last parameter to make_monomarker_taint_by_inst
.
We also built integration tests, notably to validate inter-process tainting such as demonstrated in the “tokio-chat” article, where we use the taint to follow messages between a server and clients running locally.
Taint self report
For (2), we equipped the taint with the ability to report warnings to users in various situations, such as when encountering an instruction that we are not able to lift or propagate through just yet.
The warnings can be found in the Warnings
tab of the Taint widget, and the ones that occur in the same sequence
of transitions as a change to the taint will also be reported with a warning sign in the list of changes in the taint
widget.
The list of warnings can also be accessed programatically through the API for scripting purposes.
Building applications on the taint
The taint provided by REVEN offers an API, that can be used as a building block for building more advanced algorithms.
- By directly using the API, we are able to build a taint that follows tainted data at the resolution of each process:
# display taint result each time we change to a different process process = None table = "" # iterate over all changes in tainted data for change in taint.changes().all(): new_process = change.transition.context_before().ossi.process() # get current process if process is None or new_process.pid != process.pid: # we changed process table += table_line(["#{}".format(change.transition.id), new_process, read_tainted_memory(change)]) process = new_process display_table(title="Process changes for the forward taint of 'Hello!'", headers=["Transition", "Process", "Tainted memory"], html_lines=table)
See more in the notebook
- Taint the pointer resulting from an allocation in order to track uses of that pointer and its aliases, which is used to detect use-after-free (UAF) and buffer overflow (BoF) vulnerabilities.
Conclusion
The taint in REVEN can be used both to directly analyze vulnerabilities faster or to implement higher level algorithms such as UAF or BoF vulnerability detection.
We look forward to delivering higher value tools to reverse engineers and developers with the taint and other exciting upcoming features in REVEN (who said “a call tree view”?)!
In particular, for the taint, there are at least two directions that we could pursue to bring the feature to the next level:
-
Providing a graph of dependencies between tainted values.
Currently the taint “flattens” dependencies by just retaining the current state of the taint during propagation. By additionally saving which data was responsible for tainting other data at each point, we could provide REVEN’s users with a directed acyclic graph (DAG) of the taint propagation. Such a DAG has useful applications such as visualizing at a glance how the data flows, and granting users the ability to “cut” unwanted edges in the graph, with the benefit of removing noise and focusing on the path of propagation that is of interest to them.
-
Giving access to the lifted LLVM to users.
This feature would allow users to build their own algorithms on top of the generated LLVM.
Thank you for reading this article! What would you like to do with a taint such as REVEN’s?