Overview

Overview

Introduction
PyDbg
PIDA
Future Development

Introduction

PaiMei, is a reverse engineering framework consisting of multiple extensible components. The framework can essentially be thought of as a reverse engineer's swiss army knife and has already been proven effective for a wide range of both static and dynamic tasks such as fuzzer assistance, code coverage tracking, data flow tracking and more. The framework breaks down into the following core components:

PyDbg: A pure Python win32 debugging abstraction class.
pGRAPH: A graph abstraction layer with seperate classes for nodes, edges and clusters.
PIDA: Built on top of pGRAPH, PIDA aims to provide an abstract and persistent interface over binaries (DLLs and EXEs) with separate classes for representing functions, basic blocks and instructions. The end result is the creation of a portable file that when loaded allows you to arbitrarily navigate throughout the entire original binary.

A layer above the core components you will find the remainder of the PaiMei framework broken into the following over-arching components:

Utilities: A set of utilities for accomplishing various repetitive tasks.
Console: A pluggable WxPython GUI for quickly and efficiently rolling out your own sexy RE utilities.
Scripts: Individual scripts for accomplishing various tasks. One very important example of which is the pida_dump.py IDA Python script which is run from IDA to generate .PIDA modules.

For a quick example of an advanced creation on top of the PaiMei framework see the PAIMEIpstalker Flash demo.

PyDbg

PyDbg exposes most of the expected debugger functionality and then some. Hardware / software / memory breakpoints, process / module / thread enumeration and instrumentation, system DLL tracking, memory reading/writing and intelligent dereferencing, stack and SEH unwinding, exception and event handling, endian manipulation routines, memory snapshot and restore functionality, disassembly (libdasm) engine, and more ... The abstracted interface allows for painless development of custom debugger scripts. Diving right into the thick of things, consider the following example snippet:

    from pydbg import *
    from pydbg.defines import *
    
    def handler_breakpoint (pydbg):
       # ignore the first windows driven breakpoint.
       if pydbg.first_breakpoint:
           return DBG_CONTINUE
    
       log("ws2_32.recv() called from thread %d @%08x" % pydbg.dbg.dwThreadId, pydbg.exception_address)
    
       return DBG_CONTINUE
    
    dbg = pydbg()
    
    # register a breakpoint handler function.
    dbg.set_callback(EXCEPTION_BREAKPOINT, handler_breakpoint)
    dbg.attach(XXXXX)
    
    recv = dbg.func_resolve("ws2_32", "recv")
    dbg.bp_set(recv)
    
    pydbg.debug_event_loop()

We attach to a target process, set a breakpoint on ws2_32.recv() and print a message every time the API is called. All in less than 15 lines of code. Not too shabby.

PIDA

The crux of the PIDA component is based on IDA / IDA Python, which is used to propogate all of the initial structural data. Once the initial analysis is complete the data can be serialized to and loaded from a zlib compressed file. This allows you to extract all the relevant attributes you are interested in from the IDA database and access it "on the fly" in whatever standalone application you are creating. The usage of a generic Python binary representation allows us to consider dropping the reliance on IDA in the future. To generate a PIDA module currently, run the pida_dump.py IDA Python script after IDA has completed auto-analysis on your target binary.

The object structure is built on pGRAPH and can be thought of as a graph of graphs. Each component has it's own relevant attributes that we won't enumerate here. A module is a graph containing functions as nodes with the edges between the nodes representing the intramodular calls. Each function is a graph as well as a node. The nodes of a function are the basic blocks that it consists of (including chunked blocks). Each basic block is a node that contains a list of instructions. Finally, individual instructions are not graph objects but rather a simple struct with various attributes.

At any point you can take advantage of the graph abstraction to create arbitrary down / up graphs, graph intersections, graph concatenations, etc... The generated graphs can be rendered in either GML, GraphViz or uDraw formats. Consider the following simple example that will step through every function, basic block and instruction within a module and produce various outputs along the way:

    import pida
    
    module = pida.load("some_file.pida")
    
    # render a function graph in GML format for the entire module.
    fh = open("graphs/functions.gml", "w+")
    fh.write(module.render_graph_gml())
    fh.close()
    
    # render a function graph in uDraw format for the entire module.
    fh = open("graphs/functions.udg", "w+")
    fh.write(module.render_graph_udraw())
    fh.close()
    
    # step through each function in the module:
    for function in module.nodes.values():
        # if we found the first function we are interested in
        if function.ea_start == 0x00407950:
            # step through each basic block in the function.
            for bb in function.nodes.values():
                print "\t%08x - %08x" % (bb.ea_start, bb.ea_end)
                # print each instruction in each basic block.
                for ins in bb.instructions.values():
                    print "\t\t%s" % ins.disasm
    
            # render a GML graph of this function.
            fh = open("graphs/function.gml", "w+")
            fh.write(function.render_graph_gml())
            fh.close()
    
            # render a GraphViz PNG of this function too.
            graph = function.render_graph_graphviz()
            graph.write_png("graphs/function.png", prog="dot")
    
        # if we found the second function we are interested in.
        if function.name == "some_routine":
            # render a GML format proximity graph.
            fh = open("graphs/proximity.udg", "w+")
            # look 3 levels up and 2 levels down
            prox_graph = module.graph_proximity(function.id, 3, 2)
            fh.write(prox_graph.render_graph_udraw())
            fh.close()

Consider another, more real-world example. You need to locate all functions within a binary that at some point open a file. You want to display all possible execution paths from the entry point of each of these functions to the API call responsible for opening the file. Finally, you want to display this data as a graph, per function. The task is easily accomplished:

    # for each function in the module
    for function in module.functions.values():
        # create a downgraph from the current routine and locate the calls to [Open|Create]File[A|W]
        downgraph = module.graph down(function.ea start, -1)
        matches = [node for node in downgraph.nodes.values() if re.match(".*(create|open)file.*", node.name, re.I)]
        upgraph = pgraph.graph()
    
        # for each matching node create a temporary upgraph and add it to the parent upgraph.
        for node in matches:
            tmp_graph = module.graph up(node.ea start, -1)
            upgraph.graph cat(tmp_graph)
    
        # write the intersection of the down graph from the current function and the upgraph from
        # the discovered interested nodes to disk in gml format.
        downgraph.graph intersect(upgraph)
    
        if len(downgraph.nodes):
            fh = open("%s.gml" % function.name, "w+")
            fh.write(downgraph.render graph gml())
            fh.close()

Together, PIDA and PyDbg offer a powerful combination for building a variety of tools. Consider for example the ease of re-creating Process Stalker on top of this platform. Simply generate a PIDA module, load it in a PyDbg script, step through the functions / basic blocks within the module setting breakpoints along the way and finally register a breakpoint handler that logs the breakpoint hits to disk.

Entity Relationship Diagram

The following entity relationship diagram should give you a good overall feel for how the framework is organized:

The specifics of each of the above components is detailed later in this document.

Future Development

There are a bunch of tools and utilities I want to and have already built on this framework. One major need for improvement is in the memory consumption of loaded PIDA modules. Full analysis (vs. function or basic block only) PIDA modules consume a drastic amount of memory. The likely solution to this problem will be to create some form of "on demand" access to the underlying data as opposed to loading the entire data structure into memory from the get go. If anyone has a creative solution to this problem, please let me know.

Overview

Table of Contents

Introduction

PyDbg

PIDA

Entity Relationship Diagram

Future Development