Who corrupted the data! Get a fast and precise answer with the taint


Feb 18, 2021
by Louis
Categories: REVEN -
Tags: REVEN - Reverse Engineering - Taint -




In vulnerability analysis a frequent question that needs answering is: “who corrupted this data?”.

Timeless Debugging and Analysis (TDnA) systems like REVEN can provide fast and accurate answers to this particular question. For example, the Memory History feature of REVEN allows to see the entire list of accesses to a given memory location, so that the user can find which access caused the corruption.

Memory History

Using REVEN, we can get an even simpler way to find “who” corrupted some data: tainting the corrupted data backward will allow us to go back to the input buffer that caused the corruption, without ever having to manually look for memcopies or manipulations of the data in CPU registers. Using the taint is not only simpler and faster, it also leaves less room for error than the manual way as it is more systematic.


The taint is one of the favorite feature of our users and alledgedly the most powerful one in REVEN, with use cases such as instantaneously going back from a crash to the input file causing it. Its implementation represents several man-years of effort and relies heavily on the timeless nature of REVEN, for allowing backward taint analysis as well as for important performance optimizations.

This article gives an overview of the taint feature in REVEN, and also provides some insight as to how it works under the hood.

Contents

Taint basics

Taint analysis works by marking some inputs as tainted, and then propagating this information, by applying a propagation semantics to the instructions manipulating the tainted data. The output of the algorithm is a list of the instructions where the tainted state of some data changed, or where tainted data was used (useful for “slicing” a program according to some tainted data). In REVEN, the taint performs data flow propagation, the inputs can be x86_64 registers and ranges of memory, and the propagation direction can be either forward or backward.

Forward taint is answering the question: “in the future, what registers and memory locations have their value depend on the initial tainted data?”, while backward taint is more like answering the question “where does the value in this location come from in the past?”, with the end of the output list typically containing the origin of the input data, such as the input buffer or file that caused the corruption.

The remainder of this section provides some simple examples of taint propagation.

Forward register propagation

; Start of propagation
; input: rax tainted
mov rbx, rax
; output: rax and rbx tainted
; End of propagation

Since this instruction moves the contents of rax into rbx, the value of rbx is directly dependent on the value of rax.

Forward memory propagation

; Start of propagation
; input: rax tainted
add qword ptr [r8+0x10], rax
; rax, [ds:0xfffff80002802cd0 ; 8] tainted
mov ebx, dword ptr [r8+0x14]
; output: rbx[0; 4], rax, [ds:0xfffff80002802cd0 ; 8]
; End of propagation

For memory accesses, the dereferenced address is resolved to its runtime address, here ds:0xfffff80002802cd0. The corresponding range of 8 bytes becomes tainted after the addition which makes it depend on both its initial value and on rax.

Then, the second instruction taints ebx from the 4 last tainted bytes of the range [ds:0xfffff80002802cd0 ; 8]. Note that the taint in REVEN is able to track taint state at the byte level.

Backward propagation

; End of propagation (start at the last instruction in backward)
; output: rbx, [ds:0xfffff80002802cd0 ; 8] tainted
add rbx, qword ptr [r8+0x10]
; rbx tainted
mov rax, rbx
; input: rax tainted
; Start of propagation

The first thing to notice in backward is that we start propagation from the final instruction, and go back up to the first.

Second, the propagation rules are a bit different as in forward, as in the instruction mov rax, rbx, rax takes its value from rbx. That means that the value of rax after propagating through this instruction depends solely on the value of rbx, which is why rbx becomes tainted and rax loses the taint.

In the add rbx, qword ptr [r8+0x10] instruction, the value of rbx depends on both the value of rbx prior to the addition, and on the value at address ds:0xfffff80002802cd0, hence the final result.

OK, this gives us a basic understanding of what taint propagation is like, now, how do we implement such an algorithm so that it can run on real x86 instruction traces of several billions of instructions? Let’s find out…

Taint under the hood

Let’s take a look at the big picture first. The taint algorithm can be split in 3 main steps, described below:

  1. Sequencing: Determine the next sequence of transitions (“transitions” generalizes the concept of “CPU instruction” and encompasses executed instructions, but also page faults and other interrupts or faults.) from the execution trace on which taint should be propagated. Most of the time, this is the sequence that starts after the end of the previous sequence, and that stops when the sequence roughly corresponds to a basic block. However in some cases there can be jumps in the trace due to memory history optimization (more on that in a later section), or the taint can stop if nothing is tainted anymore or if we reached the end of the requested range of transitions.
  2. Lifting: Generate LLVM instructions from the sequence of transitions. Each x86_64 instruction in the sequence is lifted to multiple LLVM instructions using the excellent open-source library remill (in the process, Tetrane contributed to remill the lifting of some instructions such as iretq or some variants of PADDUSB/PADDUSW). Lifted instructions are further optimized using LLVM’s optimizer, to improve the taint accuracy. Already seen sequences are fetched from the cache to avoid lifting and optimizing again, as those are cpu-intensive operations.
  3. Propagating: On each LLVM instruction produced from lifting the sequence of x86_64 instructions, apply our propagation algorithm. The algorithm maintains and modifies the state of what is tainted according to the propagation rules we defined.

What does the propagation algorithm look like? Before answering this question, let’s take a detour through the LLVM optimization engine.

Exploiting LLVM’s optimization engine

During the lifting step, we also apply some optimizations to the produced LLVM. This is to improve the accuracy of the propagation. For a simple example of a case where optimization helps, consider the following sequence of instructions:

mov rbx, rax
xor rax, rbx

If these instructions are taken separately, then the propagation algorithm is forced to keep any taint on rax, because it cannot assert that rax == rbx. This results in over-taint, which can cause the taint to report spurious links between data that are actually unrelated.

By lifting with the optimizations enabled, the optimizer is able to recognize that the above is actually semantically equivalent to:

mov rax, 0

This optimized version allows the taint to successfully remove any taint on rax and avoid over-tainting.

The optimization engine of LLVM is very powerful (it powers the clang compiler) and allows to simplify very complex cases, from which the taint benefits.

Propagation algorithm example

The propagation algorithm uses a classic LLVM instruction visitor.

Here is a (simplified) example of the propagation algorithm on an LLVM instruction:

/// Instruction of the form `x = v[i]`.
///
/// Either `i` is statically known, in which case the markers of `v[i]` are propagated to `x`,
/// or `i` is determined at runtime, in which case the markers of `forall j, v[j]` must be propagated to `x`.
void TaintPropagationVisitor::visitExtractElementInst(llvm::ExtractElementInst& inst)
{
	auto dest_size = get_type_size_bytes(layout_, inst.getVectorOperandType()->getElementType());
	auto index = inst.getIndexOperand();
	auto* maybe_const_index = llvm::dyn_cast<llvm::ConstantInt>(index);
	auto* vector_operand = inst.getVectorOperand();
	if (maybe_const_index) {
		const auto const_index = maybe_const_index->getValue().trunc(64).getZExtValue();
		// v[const_index] to inst
		taint_map().propagate_marker({vector_operand, const_index * dest_size, (const_index + 1) * dest_size}, &inst);
	} else {
		// all of v is propagated to inst
		const auto count = inst.getVectorOperandType()->getNumElements();
		for (std::size_t i = 0; i < count; ++i) {
			// v[i] to inst
			taint_map().propagate_marker({vector_operand, i * dest_size, (i + 1) * dest_size}, &inst);
		}
	}
}

In the example above, “markers” are used to indicate which piece of data is tainted. The taint in REVEN actually supports propagating an arbitrary number of markers (although the Axion GUI and the current API in reven2.preview.taint only supports 2 markers), which allows for instance to taint each byte of an input buffer with a different marker in order to determine what part of the tainted data comes from each byte of the input buffer after the propagation.

The example allows forward propagation in a LLVM vector instruction that extracts an element from a vector by indexing it. When the index is constant, we can use it in the taint to propagate only the markers of that precise element of the vector.

However, when the index is a runtime value, the best we can do is propagate the markers of each element in the vector to the produced value. If in reality there are stricter conditions on the index that restricts the value it can take at runtime, then this may produce over-taint. In practice though, this particular cause of over-tainting was not often observed.

Exploiting Memory History as a performance optimization

Iterating, lifting on x86_64 instructions, and then optimizing and propagating on LLVM instructions are slow, intensive operations. When only memory (no register) is tainted, we can actually do better in REVEN. Using the Memory History in the sequencing step of the taint algorithm, we can request the transition where the tainted memory will be accessed next, and skip lifting the entire portion of the trace up until the tainted memory is accessed. This sometimes allows to skip billions of transitions, and is instrumental in the great performance of the taint.

Validating the taint

From the start, validating that the taint gives correct results was a priority of ours. To do so we worked on two avenues:

  1. Testing the taint.
  2. Designing the taint so that it can self-report inconsistent propagation events.

Testing the taint

For (1), we manually crafted more than 230 unit tests, such as the following:

BOOST_AUTO_TEST_CASE(test_by_inst_2)
{
	make_monomarker_taint_by_inst</* forward */ true>({amd64::rax}, "xor rbx, rax\n" // rbx = rbx_0 ^ rax
	                                                                "xor rdx, rax\n" // rdx = rdx_0 ^ rax
	                                                                "mov rcx, rax\n" // rcx == rax
	                                                                "xor rbx, rcx\n" // rbx = rbx_0
	                                                                "mov rdx, rcx\n" // rdx == rax
	                                                                "sub rdx, rax\n" // rdx = 0
	                                                                "div rax", // rax = (rdx:rax) /rax = rax / rax = 1
	                              {
	                                  {amd64::rax, amd64::rbx, amd64::sf, amd64::zf, amd64::pf},
	                                  {amd64::rax, amd64::rbx, amd64::rdx, amd64::sf, amd64::zf, amd64::pf},
	                                  {amd64::rax, amd64::rbx, amd64::rcx, amd64::rdx, amd64::sf, amd64::zf, amd64::pf},
	                                  {amd64::rax, amd64::rcx, amd64::rdx},
	                                  {amd64::rax, amd64::rcx, amd64::rdx},
	                                  {amd64::rax, amd64::rcx},
	                                  {amd64::rcx}
	                              });

In this example, we test the taint in the forward direction, with rax tainted initially, on a set of instructions starting with xor rbx, rax, and we compare the resulting taint state after each instruction to the expected taint state passed as the last parameter to make_monomarker_taint_by_inst.

We also built integration tests, notably to validate inter-process tainting such as demonstrated in the “tokio-chat” article, where we use the taint to follow messages between a server and clients running locally.

Taint self report

For (2), we equipped the taint with the ability to report warnings to users in various situations, such as when encountering an instruction that we are not able to lift or propagate through just yet.

The warnings can be found in the Warnings tab of the Taint widget, and the ones that occur in the same sequence of transitions as a change to the taint will also be reported with a warning sign in the list of changes in the taint widget.

The list of warnings can also be accessed programatically through the API for scripting purposes.

Building applications on the taint

The taint provided by REVEN offers an API, that can be used as a building block for building more advanced algorithms.

  • By directly using the API, we are able to build a taint that follows tainted data at the resolution of each process:
    # display taint result each time we change to a different process
    process = None
    table = ""
    # iterate over all changes in tainted data
    for change in taint.changes().all():
      new_process = change.transition.context_before().ossi.process() # get current process
    
      if process is None or new_process.pid != process.pid: # we changed process
          table += table_line(["#{}".format(change.transition.id),
                               new_process, read_tainted_memory(change)])
          process = new_process
    
    display_table(title="Process changes for the forward taint of 'Hello!'",
                headers=["Transition", "Process", "Tainted memory"],
                html_lines=table)
    

    See more in the notebook

  • Taint the pointer resulting from an allocation in order to track uses of that pointer and its aliases, which is used to detect use-after-free (UAF) and buffer overflow (BoF) vulnerabilities.

Conclusion

The taint in REVEN can be used both to directly analyze vulnerabilities faster or to implement higher level algorithms such as UAF or BoF vulnerability detection.

We look forward to delivering higher value tools to reverse engineers and developers with the taint and other exciting upcoming features in REVEN (who said “a call tree view”?)!

In particular, for the taint, there are at least two directions that we could pursue to bring the feature to the next level:

  1. Providing a graph of dependencies between tainted values.

    Currently the taint “flattens” dependencies by just retaining the current state of the taint during propagation. By additionally saving which data was responsible for tainting other data at each point, we could provide REVEN’s users with a directed acyclic graph (DAG) of the taint propagation. Such a DAG has useful applications such as visualizing at a glance how the data flows, and granting users the ability to “cut” unwanted edges in the graph, with the benefit of removing noise and focusing on the path of propagation that is of interest to them.

  2. Giving access to the lifted LLVM to users.

    This feature would allow users to build their own algorithms on top of the generated LLVM.

Thank you for reading this article! What would you like to do with a taint such as REVEN’s?

Next post: Interprocess Use of Uninitialized Memory detection using REVEN
Previous post: REVEN OpenLab - Feb 18th, 2021