Skip to main content

Revealing malware relationships with GraphDB: Part 1

In this post, we will learn how using a Graph Database like Neo4j can help visualize malware relationships and extend these relationships to identify patterns between samples. Before we dig into Neo4j, let’s start with some fundamental graph terminologies:   

Nodes represent entities such as a human, car, laptop or phone.
Properties are attributes nodes can contain. A steering wheel or tires would be a property of the “car” node.
Labels are a way to group together nodes of a similar type. For example, a label of “FastFood” may include nodes such as “Taco Bell, McDonald’s, and Chipotle”.
Edges (or vertices) represent the relationship connection between two nodes. Relationships can also have their own properties.

Getting started with Neo4j

Neo4j is a Graph Database commonly known for its pure simplicity and easy to use interface. I find the structure of a graph database quite fascinating, on top of learning how to normalize malware analysis data for each sample into a schema that works for a graph database. To get started, we first need to get a Neo4j instance running. The quickest way to do this is docker.  Once you have docker installed (, you can quickly pull down a Neo4j docker image using the following command:

docker pull neo4j
Once you have the image downloaded to your system, you can start the container by running the command below:

docker run \
    --publish=7474:7474 --publish=7687:7687 \
    --volume=$HOME/neo4j/data:/data \

If all goes well, you should see some standard output in your console, including the line:

INFO  Remote interface available at http://localhost:7474/

If you navigate to this url in your browser, you should be prompted to login to the Neo4j docker container using the default credentials “neo4j/neo4j”. After logging in and changing your password, you can now begin exploring the interface. If you’re new to Neo4j, I would recommend digging into the “Learning about Neo4j” section, so you can get a handle on the syntax for searching and updating node or edges in the database.

Define the schema

In order to load the data into Neo4j, we need to build a schema that defines our nodes, edges and their properties. Like most databases, defining a standardized schema is very important before inserting data. Let’s start by taking a look at what a simple File node looks like below:

MERGE (n:File { md5 : {md5}, sha1 : {sha1}, sha256 : {sha256}, size : {size} }) RETURN n

You can see from the statement above, we have a label of File, which has the properties of md5, sha1, sha256 and size. A Label is a way to group nodes of a similar type together. Notice we don’t have the name or path properties inside the File node. This is because any file can be renamed and moved to a different location but the hashes and size will remain the same. Because of this, we would create another node for each filename and path, as many other malware samples may reuse the same name or file path, thus creating a relationship between two different pieces of malware. However, a relationship based solely on a filename or path on disk is not the strongest relationship unless it’s a very unique name or path. I’ve outlined the other nodes, labels and their properties below in Neo4j’s Cypher syntax.

MERGE (n:FileType { type : {file_type} }) RETURN n
MERGE (n:Compiled { timestamp : {timestamp} }) RETURN n
MERGE (n:Library { name : {library} }) RETURN n
MERGE (n:Function { name : {function} }) RETURN n
MERGE (n:Detection { name : {detection} }) RETURN n

In addition to creating the various nodes, edges and properties, we also want to define the relationships these nodes can have with each other and the direction of those relationships. Let’s check out an example on creating a relationship between the File and FileType nodes using the Cypher query below:

MATCH (n:File { md5 : {md5}, sha1 : {sha1}, sha256 : {sha256}, size : {size} }), (f:FileType { type : {file_type} }) MERGE (n)-[:HAS_FILETYPE]->(f)

In this query, we first have to match on the File node (assigned to the variable n) , then we have to match on FileType node (assigned to f). Once the matches are collected, we establish a relationship HAS_FILETYPE between these two nodes. In Neo4j, relationships can only have a single direction and cannot have a relationships that go in both directions. You also cannot have a relationship that points to another relationship. To counter this, we can use Intermediary Nodes to help link nodes together in more complex relationships (intermediary nodes will be covered in a future post). To better show how this will look, let’s view the final schema below in Neo4j:
Nodes, Edges and Relationships

Testing a sample set

For this post, i’m using WannaCry ransomware samples to better understand the relationships between these binaries at a static analysis level (not running the malware). To get started, we need to get the malware metadata into Neo4j. To extract the malware’s static attributes, I used PEFRAME ( against each sample and saved all the JSON outputs to a single directory. We can then use a little bit of Python to load the JSON data for each sample and create cypher queries to quickly create our nodes, edges and relationships. Let’s take a look at a single node and all its relationships below:
Single WannaCry sample

We can see from the image above that our single WannaCry specimen imports six libraries, may use at least 83 functions from these six libraries and has four detections. We also see green nodes, which outline the executable’s FileType, which in this example is “PE32 executable (GUI) Intel 80386, for MS Windows” and has a compiled timestamp of “2017-05-04 22:34:46”, shown in the gray node above.

Researching Relationships

Now that we have our single sample set working; let’s go ahead and load some additional samples into the graph database to identify other potential relationships or patterns between the various WannaCry specimens.

Compile Times

One common attribute of PE files we can quickly pivot on is the compile timestamp. When working with malware from the same family, you may be able to see a trend in compile times as newer variants are built and used in the wild. We can see a small clustering of these groups below:

Compile time grouping


Another attribute we can use to visualize malware relationships is “detections”, which is produced as an output of peframe. The graph below outlines the 19 malware samples and the four main detections (i.e. mutex, antibg, xor and packer).
Nodes with relationships to detections

Going beyond static

For this post, we only focused on a handful static attributes (dead code analysis, not running the malware). We could further enhance the attributes and relationships in the graph by including other data sources such as VirusTotal, Cuckoo Sandbox (dynamic analysis) and Xori output. As always, I hope that post was informative and happy hunting!

Special Thanks

Thanks to @omgapt and @jeffochan7 for the assist on the post. For Medium users, you can read this blog post here:

Additional Resources


Popular posts from this blog

Analyzing and detecting web shells

Of the various pieces of malware i’ve analyzed, I still find web shells to be the most fascinating. While this not a new topic, i've been asked by others to do a write up on web shells, so here it is ;). 
For those new to web shells, think of this type of malware as code designed to be executed by the web server - instead of writing a backdoor in C, for example, an attacker can write malicious PHP and upload the code directly to a vulnerable web server. Web shells span across many different languages and server types. Let's take a looks at some common servers and some web extensions:
Operating System Service Binary Name Extensions Windows IIS (Internet Information Services) w3wp.exe .asp/.aspx Windows/Linux apache/apache2/nginx httpd/httpd.exe/nginx .php Windows/Linux Apache Tom

Introduction to Malware Analysis

Why malware analysisMalware analysis (“MA”) is a fun and excited journey for anyone new or seasoned in the career field. Taking a specimen (malware sample) and reverse engineering it to better understand its inner workings can be a long, tedious adventure. With the sheer number of malware samples circulating the internet, in addition to the various formats specimens are found in, makes malware analysis a good challenge. Outside of learning MA as a hobby, here are some other reasons why we perform malware analysis:To better understand how a specimen works. This may yield certain unique attributes about how the malware was written, methods it performs or its dependencies.To collect intelligence and build Indicators of Compromise (“IOCs”), usually comprised of Host Based Indicators (“HBIs”) and/or Network Based Indicators (“NBIs”).For general knowledge or research purposes.How do I get started?!If you’re new to malware analysis, you want to ensure you’ve taken the right precautions befor…

Smashing the stack with Carbon Black

In this blog post, we will cover how we perform stacking using Carbon Black Response and how we can use this methodology to find anomalies in your environment. In reality, an awesome threat hunter would like to have the following data at their disposal:
Type Code Details Real Time RT Real time process executions and its context Forensic FZ Live forensic data such as prefetch, appcompat, registry keys, etc.. Network NT PCAP and extracted metadata Logs LG