Using Neo4j Graph Database to map your AWS Infrastructure

Nick Doyle
8 min readAug 2, 2018

--

Here you can see just one principal (the main account) having full control privs on all buckets, which are in four regions
S3 bucket access vizualization - Here you can see only one principal (the main account) having full control (yellow labeled “FULL_CONTROL”) on all buckets (purple), which are in four regions (green)

Update May 2020 now based on neo4j 4.0

TLDR — run live local webserver with:

docker run \
-d \
--name aws_map_myaccount \
--env NEO4J_AUTH=$NEO4J_AUTH \
--env AWS_TO_NEO4J_LIMIT_REGION=ap-southeast-2 \
--env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
--env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
--env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
-p 80:7474 \
-p 7687:7687 \
rdkls/aws_infra_map_neo4j

Where NEO4J_AUTH is of format neo4j/mypassword

For a while I’ve been kicking around the idea of combining 2 things:

  • Graph Databases
  • Cloud (primarily AWS)

In a certain way, applying things that graph databases are good at to the cloud.
On researching I was surprised to see that nobody else seems to be doing this.

This post is a writeup of a quick hack I made in this area, and thoughts about where it might lead.
I think there’s scope to do some very interesting (and potentially lucrative) work in this area, for those with the time and interest.
Basically what I’ll be covering in the post is this:

Graph & Cloud

Talking to people about graph databases is interesting; many people have no idea what they are.
Or beyond that, why they’re actually beyond-novelty useful.
Often people trying to explain graph feel frustrated, and that others “just don’t get it” (personally I think that yes, some lazy brains just don’t feel like making the effort to think — but that we can always improve our communication)

I’ll try to summarize:

  • Despite the name, “relational” databases are NOT good at relationships
    Graph databases are good at relationships
  • Graphs are better at computation — when your analysis is heavy on relationships
    (though some heretics may debate this)
  • Graphs are better at exploration — humans exploring patterns and relationships in data.
    It’s often said in data analysis that you need to know what you’re looking for in order to conduct the analysis. With graph this isn’t necessarily true.

If you’ve ever worked as a DBA, particularly in BI/analysis, to the point where you have fucking had enough of JOINs, I’m sure you see the appeal.

Just to digress on that first point, relationships being (as, or more) important than the data themselves always brings to mind Systems Thinking — but that’s a whole other Thing.

Now the Cloud.

Common challenges I see are:

  • Nontrivial architectures comprising many interrelated components
    (lambdas, ASGs, caches, dbs, containers etc)
  • Permission hierarchies, often stretching across accounts (i.e. assumed roles), managed by cloud providers or customers, possibly one-off (‘inline’)
  • Network security (ingress/egress) managed both at security group and NACL levels
  • VPC peering and routing
  • Supporting such environments, accurate as-built documentation — understandably a PITA to create let alone maintain
  • Exploring and understanding existing environments for those who didn’t build them

My primary goal was, coming into a new AWS account, if I could run a scraper that jammed all AWS account info into a graph database, it’d be a great way to explore and understand the account.

I should also point out, that I wanted something better than static analysis on e.g. cloudformation or terraform templates; I wanted to know what the real deal as-deployed, warts and all, ideally with scope for realtime updates.

But there are other potential benefits, such as exploration & analysis on:

  • Data lineage and governance (hello GDPR)
  • Network ingress / egress, security
  • Log correlation — especially e.g. tied into ELK

The Beginning — Awless and Google Badwolf

█████╗  ██╗    ██╗ ██╗     ██████  ██████╗ ██████╗
██╔══██╗ ██║ ██║ ██║ ██╔══╝ ██╔═══╝ ██╔═══╝
███████║ ██║ █╗ ██║ ██║ ████╗ ██████ ██████
██╔══██║ ██║███╗██║ ██║ ██╔═╝ ██╗ ██╗
██║ ██║ ╚███╔███╔╝ ██████╗ ██████╗ ██████║ ██████║
╚═╝ ╚═╝ ╚══╝╚══╝ ╚═════╝ ╚═════╝ ╚═════╝ ╚═════╝

After this idea kicking around up there for a while, what really prompted me to have a crack was the excellent tool awless- a supercharged, more-standard and less-painful AWS CLI.
If you do any engineering at all with AWS I highly recommend trying it.

Once I started using it, I noticed that it locally stored details of your AWS environment in a graph database — namely using Google BADWOLF.

Right so my infrastructure data is already there, that’s great no need to scrape it all with boto.

It’s in a standard format right? Wrong. Badwolf gives the user flexibility with the scema, being only “loosely modeled” on w3c standard RDF. As far as my limited graph knowledge goes, some reasons for this might be the temporal aspect, as well as meta-modeling; relating relationships, and also “just because” … UFO built by insane space aliens …

So I had a look at the data files, and despite not being totally 100% RDF-compliant, they weren’t far off. Unfortunately they did differ for each resource type/file, so standardising has to be done for each type of resource, but it’s not too bad.
An example is in the “infrastructure.nt”:

<i-0a74e88bee42b4bb5> <cloud:launched> "2018-06-27T23:39:31Z"^^<xsd:dateTime> .
  1. Problem: The resources just don’t have schemas
    Soution: Just prepend resource://
  2. Problem: the double-caret is a kind-of language tag, but not prefixed with an @ symbol, so the RDF import breaks
    Solution: Just remove it

Changing such to:

<resource://i-0a74e88bee42b4bb5> <cloud:launched> "2018-06-27T23:39:31Z" .

Made it valid RDF and able to be loaded in the next step.

(the code I put together for this is in “awless_to_neo.py” in the source repo for this post)

Loading to Neo4j

When I first realized I had RDF-like data, I looked round for ways to easily load that.

And surprise surprise, this guy Jesús Barassa has already written Neo4j modules to make this possible.

(I say surprise surprise because only a couple months back I became quite familiar with his thoughts on ontologies in neo4j, when I had to do some data lineage and fault finding for a consulting gig — but that’s another story)

They worked pretty well.

I found I did need to hack around some fields / labels after import to make the display nicer (this code also in “awless_to_neo.py” at end of post).

All good.

I hope that right now you are, just as I was, itching to take awless and the rdf import modules right out of the picture, and insert directly into Neo4j with one beautiful script. Yes. That would be a lovely improvement for future work on this.

Packaging — Docker

Having hacked the basics together I wanted a way to hand off to someone to try out.

I could tell them to set up neo4j, install the script etc etc etc

Or I could just chuck it all in a docker container, so I did that.

Source is on GitHub with built Docker image on dockerhub here.

docker run \
-d \
--name aws_map_myaccount \
--env NEO4J_AUTH=$NEO4J_AUTH \
--env AWS_TO_NEO4J_LIMIT_REGION=ap-southeast-2 \
--env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
--env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
--env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
-p 80:7474 \
-p 7687:7687 \
rdkls/aws_infra_map_neo4j

(variables substituted per your environment) and you’ll get your own account mapped and available on port 80.

Some Results

Here you can see regions in dark green. Most resources concentrated in a primary, then a secondary, with some (probably unintentional/orphaned) resources in 2 others.
Which VPCs (green) have Subnets (blue) in which AZs (red), in which Regions (purple)

Integrate with other frontends

Another cool thing you can do once you have the data is point other tools such as Linkurious, or the new (but basic) (but free!) GraphXR at it.

I think GraphXR is looking pretty cool. A bit clunky, but looks sweet, and loving the 3d effect. IMO this sort of interface, “just like in the movies” where we’re exploring data in 3d, will have big benefits in future. Here’s an example of pointing GraphXR at my localhost. Basically you go there, create a new project, point it at your local neo4j instance, “configure search index” and do some sort of search, to get things showing up.

Actual AWS Account Infra data loaded into GraphXR. The large aggregation is the VPC, with subnets etc coming out of it. On the left is another VPC, with its own Lambdas (green) and Security Groups (red)

Environment Variables

Code

Code for the work in this post can be found on GitHub
A built docker image, including neo4j community edition, awless and related utils is available on Dockerhub.

Possible future work

Better / custom graph layouts needed
Possibly using Sigma.js
In my trials, Dagre (with default NetworkSimplex ranker) gave decent layout, but to make the best I think I’d need to stub in there a static hierarchy i.e. be able to specify Account > VPC > Subnet etc

Dagre with default NetworkSimplexRanker
Getting close, but not as good as I’d like

Elasticsearch + Kibana plugin to index + explore AWS accounts

Relate to other log / data indexed in ES for drilldown

Opportunity to commercialize this — particularly when tied into existing ELK data — machine logs, VPC flow firewall etc

Other graph engines

Janusgraph/gremlin server backed by dynamodb — still requires ec2/docker long-running gremlin/janus api

AWS Neptune — requires minimum db.r4.large @ $0.348/hr

IBM Cloud — run on free K8s cluster with janus backend

Source

References & Further Reading

JESÚS BARRASA — Importing RDF data into Neo4j

Andy Robbins writes about Blood Hound, using graph to automate attack — and defense — paths in an Active Directory Environment

Dagre layout

Dagre bindings for sigma.js

Awless: A Mighty CLI for AWS

Chromeo — ‘Bonafied Lovin’ Chromeo

That’s It!

If you’ve read this far thanks, hope you found it interesting.
If you have any suggestions, insights, or karaaaaazy konspiracy theories feel free to let me know in the comments, and I will feel free to keep on insulting you. Until next time.
—— — — — — — — —— brrrzzzzzttt —— — — — — — — — -

--

--

Nick Doyle
Nick Doyle

Written by Nick Doyle

Cloud-Security-Agile, in Melbourne Australia, experience includes writing profanity-laced Perl, surprise Migrations, furious DB Admin and Motorcycle instructing