Skip to main content

Revealing malware relationships with GraphDB: Part 1

In this post, we will learn how using a Graph Database like Neo4j can help visualize malware relationships and extend these relationships to identify patterns between samples. Before we dig into Neo4j, let’s start with some fundamental graph terminologies:   

Nodes represent entities such as a human, car, laptop or phone.
Properties are attributes nodes can contain. A steering wheel or tires would be a property of the “car” node.
Labels are a way to group together nodes of a similar type. For example, a label of “FastFood” may include nodes such as “Taco Bell, McDonald’s, and Chipotle”.
Edges (or vertices) represent the relationship connection between two nodes. Relationships can also have their own properties.

Getting started with Neo4j

Neo4j is a Graph Database commonly known for its pure simplicity and easy to use interface. I find the structure of a graph database quite fascinating, on top of learning how to normalize malware analysis data for each sample into a schema that works for a graph database. To get started, we first need to get a Neo4j instance running. The quickest way to do this is docker.  Once you have docker installed (, you can quickly pull down a Neo4j docker image using the following command:

docker pull neo4j
Once you have the image downloaded to your system, you can start the container by running the command below:

docker run \
    --publish=7474:7474 --publish=7687:7687 \
    --volume=$HOME/neo4j/data:/data \

If all goes well, you should see some standard output in your console, including the line:

INFO  Remote interface available at http://localhost:7474/

If you navigate to this url in your browser, you should be prompted to login to the Neo4j docker container using the default credentials “neo4j/neo4j”. After logging in and changing your password, you can now begin exploring the interface. If you’re new to Neo4j, I would recommend digging into the “Learning about Neo4j” section, so you can get a handle on the syntax for searching and updating node or edges in the database.

Define the schema

In order to load the data into Neo4j, we need to build a schema that defines our nodes, edges and their properties. Like most databases, defining a standardized schema is very important before inserting data. Let’s start by taking a look at what a simple File node looks like below:

MERGE (n:File { md5 : {md5}, sha1 : {sha1}, sha256 : {sha256}, size : {size} }) RETURN n

You can see from the statement above, we have a label of File, which has the properties of md5, sha1, sha256 and size. A Label is a way to group nodes of a similar type together. Notice we don’t have the name or path properties inside the File node. This is because any file can be renamed and moved to a different location but the hashes and size will remain the same. Because of this, we would create another node for each filename and path, as many other malware samples may reuse the same name or file path, thus creating a relationship between two different pieces of malware. However, a relationship based solely on a filename or path on disk is not the strongest relationship unless it’s a very unique name or path. I’ve outlined the other nodes, labels and their properties below in Neo4j’s Cypher syntax.

MERGE (n:FileType { type : {file_type} }) RETURN n
MERGE (n:Compiled { timestamp : {timestamp} }) RETURN n
MERGE (n:Library { name : {library} }) RETURN n
MERGE (n:Function { name : {function} }) RETURN n
MERGE (n:Detection { name : {detection} }) RETURN n

In addition to creating the various nodes, edges and properties, we also want to define the relationships these nodes can have with each other and the direction of those relationships. Let’s check out an example on creating a relationship between the File and FileType nodes using the Cypher query below:

MATCH (n:File { md5 : {md5}, sha1 : {sha1}, sha256 : {sha256}, size : {size} }), (f:FileType { type : {file_type} }) MERGE (n)-[:HAS_FILETYPE]->(f)

In this query, we first have to match on the File node (assigned to the variable n) , then we have to match on FileType node (assigned to f). Once the matches are collected, we establish a relationship HAS_FILETYPE between these two nodes. In Neo4j, relationships can only have a single direction and cannot have a relationships that go in both directions. You also cannot have a relationship that points to another relationship. To counter this, we can use Intermediary Nodes to help link nodes together in more complex relationships (intermediary nodes will be covered in a future post). To better show how this will look, let’s view the final schema below in Neo4j:
Nodes, Edges and Relationships

Testing a sample set

For this post, i’m using WannaCry ransomware samples to better understand the relationships between these binaries at a static analysis level (not running the malware). To get started, we need to get the malware metadata into Neo4j. To extract the malware’s static attributes, I used PEFRAME ( against each sample and saved all the JSON outputs to a single directory. We can then use a little bit of Python to load the JSON data for each sample and create cypher queries to quickly create our nodes, edges and relationships. Let’s take a look at a single node and all its relationships below:
Single WannaCry sample

We can see from the image above that our single WannaCry specimen imports six libraries, may use at least 83 functions from these six libraries and has four detections. We also see green nodes, which outline the executable’s FileType, which in this example is “PE32 executable (GUI) Intel 80386, for MS Windows” and has a compiled timestamp of “2017-05-04 22:34:46”, shown in the gray node above.

Researching Relationships

Now that we have our single sample set working; let’s go ahead and load some additional samples into the graph database to identify other potential relationships or patterns between the various WannaCry specimens.

Compile Times

One common attribute of PE files we can quickly pivot on is the compile timestamp. When working with malware from the same family, you may be able to see a trend in compile times as newer variants are built and used in the wild. We can see a small clustering of these groups below:

Compile time grouping


Another attribute we can use to visualize malware relationships is “detections”, which is produced as an output of peframe. The graph below outlines the 19 malware samples and the four main detections (i.e. mutex, antibg, xor and packer).
Nodes with relationships to detections

Going beyond static

For this post, we only focused on a handful static attributes (dead code analysis, not running the malware). We could further enhance the attributes and relationships in the graph by including other data sources such as VirusTotal, Cuckoo Sandbox (dynamic analysis) and Xori output. As always, I hope that post was informative and happy hunting!

Special Thanks

Thanks to @omgapt and @jeffochan7 for the assist on the post. For Medium users, you can read this blog post here:

Additional Resources


Popular posts from this blog

Analyzing and detecting web shells

Of the various pieces of malware i’ve analyzed, I still find web shells to be the most fascinating. While this not a new topic, i've been asked by others to do a write up on web shells, so here it is ;).  For those new to web shells, think of this type of malware as code designed to be executed by the web server - instead of writing a backdoor in C, for example, an attacker can write malicious PHP and upload the code directly to a vulnerable web server. Web shells span across many different languages and server types. Let's take a looks at some common servers and some web extensions: Operating System Service Binary Name Extensions Windows IIS (Internet Information Services) w3wp.exe .asp/.aspx Windows/Linux apache/ apache2/nginx httpd/httpd.exe/nginx .php Windows/Linux Apache Tomcat* tomcat*.exe/tomcat* .jsp/.jspx Web shells 101 To better understand web shells, let’s take a look at a simple eval web shell below: <?php

Web shell hunting: Meet the web shell analyzer

 In continuation of my prior work on web shells ( Medium / Blog ), I wanted to take my work a step further and introduce a new tool that goes beyond my legacy webshell-scan tool. The “webshell-scan” tool was written in GoLang and provided threat hunters and analysts alike with the ability to quickly scan a target system for web shells in a cross platform fashion. That said, I found it was lacking in many other areas. Allow me to elaborate below… Requirements of web shell analysis In order to perform proper web shell analysis, we need to define some of the key requirements that a web shell analyzer would need to include. This isn’t a definitive list but more of a guide on key requirements based on my experience working on the front lines: Static executable: Tooling must include all dependencies when being deployed. This ensures the execution is consistent and expected. Simple and easy to use: A tool must be simple and straightforward to deploy and execute. Nothing is more frustrating

RDP Over Tor

Happy Tuesday, everyone! Recently, I encountered a threat actor leveraging Tor to establish Remote Desktop Protocol (RDP) sessions from a victim system to an attacker-controlled server. The best part of this is, because the threat actor was using Tor, all encrypted communications were sent over port 443. Therefore, there wasn’t any evidence of RDP (port 3389) being used on the network illegitimately. In fact, we could have closed port 3389 on their firewall and the attacker would have still had access to the system via RDP. I found this very sneaky by the threat actor, but realized how simple it was to configure it and thought I would share it with everyone. In this blog post, we will cover the basics of proxying RDP traffic over TOR and how to set it up, with tips to avoid being detected. Before We Get Started For those of you who are unfamiliar with Tor, it’s a free and anonymous network that provides anonymity when browsing the Internet. Also known as “The Onion Router”, user