Mastering the Network Analysis Tool in Alteryx

Alteryx comes packaged with lots of tools. Although there are over 200 tools to choose from, most users will never need to wander far from the standards: Select, Formula, Join, Summarize, and friends.

The great thing about those 200+ tools that Alteryx gives us is that when we do need a specialized tool, it is likely that we have it. Case-in-point: the Network Analysis tool.

Here is Alteryx’s description of the tool:

Generate an interactive visual representation of all the nodes in a network. By mapping names and relationships, this tool provides an easy way to visualize and explore the relationship between the nodes in a network.

In addition to providing an interactive network graph, the tool also provides several statistics that quantify the relationships between the nodes in the graph. These statistics, as we’ll see later in this post, reveal a deeper understanding of these relationships, which may have eluded detection otherwise.

Throughout the rest of this post, we will walk through how to use the tool, understand its configuration, outputs, and corresponding statistics. We’ll do this through the lens of a data set that shows the number of passengers that flew on a route between origin and destination airports.

Network analysis tool

The Data Set and Inputs

For this example, we have used a data set that logs the number of passengers flying between origin and destination airports.

We have columns for origin and destination airport code, name, city, state, and region. Thus, every record represents a unique origin-destination combination.

Note: We are dealing with geographic data in this example, and have included a spatial object field in the data set. For network analysis, this is not necessary. The analysis is performed without regard to geographic orientation.

As with many specialized tools in Alteryx, the structure of your data is very important, and the Network Analysis tool is no exception. The tool requires two inputs, labeled N and E for nodes and edges, respectively.

The N Input

The N input anchor contains a list of all unique nodes in your data set. If you want to size the nodes in the visualization by an existing variable, such as Passengers, this is the input in which it must be included. In this example, we will need to use the Union tool to create a list of unique airports from both the origin and destination columns.

By using the Summarize tool, we created parallel lists of unique origin and destination airports. This is the step where, if you so desire, you can also group by other fields, such as Region, which can be used to color the network graph later on.

The N input should look like this (note that we filtered the original data set to include only the 15 airports seen here). We have one record per airport (both origin and destination), with any desired supplemental data, such as Region or Passengers.

The E Input

The E input lists the start and end points of each edge. Here, each edge connects an origin and a destination airport. In other words, edges here represent flight routes, such as Tampa to Atlanta.

In this example, the E input requires no additional data processing or reshaping. All we need to do is use a Select tool to include only the fields that comprised the Airport field in the N input. Thus, we need to include only the fields Origin Airport Name and Destination Airport Name.

Whereas the N input had only 15 records, the E input here has 152 records. These represent each unique origin-destination combination between the 15 airports.

Configuring the Network Analysis Tool

With our input data sets built properly, we can now progress to configuring the Network Analysis tool. The tool’s configuration menu consists of three tabs, Nodes, Edges, and Layout. Although you technically can run the tool without touching the configuration menu, it is more than likely you’ll want to customize the settings for your particular application.

Configuring Nodes

The first tab lets us configure the nodes of our network graph.

Change the shape of each node in the network graph

  • Dot (default)
  • Circle
  • Ellipse
  • Square

Change the size of each node in the network graph

  • Fixed (default) to a specific numeric size. All nodes will be the same size.
  • By variable. Allows you to size each node dynamically according to a variable from your N input data set. Size variables can be continuous (Passengers) or discrete (Region). It seems that sizing by a discrete field doesn’t work very well, so this is not recommended.
  • By statistic. Allows you to size each node dynamically according to one of five statistics {degree, betweenness, closeness, authority, hub}. More on these later.

Group Nodes (optional)

  • By variable. Allows you to color nodes based on a variable from your N data set.
  • By statistic. Colors nodes based on one of two statistics {fastgreedy.community, infomap.community}.

Configuring Edges

Compared to the node configuration, edges are trivially simple. We only have two things to configure:

  • If the graph is directed or not, and
  • The opacity of the edges

A directed graph is one in which the edge from one node to another flows in only one direction. This is depicted by placing an arrow on the edge pointing in the direction of the flow. For this example, we’ll leave it unchecked since we are more interested in the overall relationship between airports rather than the direction of travel.

Configuring Layout

The final tab of the configuration menu is for the layout of the network graph. Note that the options on this tab do not impact the quantitative or analytic results of the network analysis in any way. The layout tab allows you to alter the way in which nodes are laid out on the graph.

By default, the Specify Layout check box is unchecked. When left in this condition, Alteryx will place the nodes according to some algorithm. However, I have encountered strange behavior with this option unchecked, such as nodes jumping all around the screen like a madhouse.

If you choose to specify your layout, these are the options available. Depending on your data set, some may provide a better glimpse of your network than others. Definitely play around with the options to find what works best for  you.

Network Analysis Outputs

With our data structured properly and the Network Analysis tool configured, we can go ahead and run the workflow. The tool consists of two outputs, D and I.

I Output

The I output is one of the most interactive and exciting outputs from any tool in Alteryx. It shows a dynamic network graph, along with summary statistics and a histogram of the measures of centrality (described in the D Output section below).

Each airport is represented as a node (or point) within the network graph. Nodes are connected via lines (edges) which represent routes between origin and destination airports.

If you click on a node, that node’s betweenness and degree measures appear. In addition, the nodes that it is directly connected to remain highlighted, while all other nodes are greyed out. In the image above, we can see that Roswell International Air Center is connected directly to Tampa, McCarran, Hartsfield-Jackson, Burlington, Detroit, and Albany.

We must observe caution when using network graphs. These are not geographic maps in any way. According to this graph, it looks like the fastest way from Roswell (which is in New Mexico) to Albuquerque (also in New Mexico) is via Tampa (in Florida). Looking at the map below, that is obviously not the optimal route.

If you keep in mind what the network graph is actually showing: the relationship between nodes according to some measure (passengers, cost, distance, etc.),  you won’t fall into traps like this. The placement of nodes within a network graph is largely arbitrary, and often comes down to aesthetic or practical considerations rather than geographic or analytic.

D Output

The D output is a standard data set that contains one record per node. Since our N input contained 15 records, the D output will also have 15 records. Each column from the N input is listed in the D output, along with fields named betweenness, degree, closeness, pagerank, and evcent.

What on earth are these? They are known as centrality measures, and help us understand our network better. In general, these measures of centrality tell us how “important” a node is in the context of the entire network. In this case, we are relating airports based on the number of passengers that fly between them. So we can assume that airports with greater numbers of passengers will be given greater “importance” based on these measures of centrality.

Betweenness

Betweenness is a measure that looks at the length of each path between any two nodes. Given any two nodes, there must exist at least one shortest path that may or may not pass through other nodes. Betweenness measures how frequently a node lies on the shortest path between other nodes. A larger betweenness value indicates that a node acts as a bridge between other, smaller nodes.

The node with the largest betweenness value in our example is Atlanta’s Hartsfield-Jackson International Airport. If you’re not familiar, this is one of the busiest airports on earth. It is the main hub for Delta airlines and serves as a layover destination for many smaller and regional airports.

Two airports, Canyonlands Field in Utah and Hancock County-Bar Harbor in Maine have betweenness scores of 0. This indicates that no shortest paths traverse over these nodes. They are “end of the road” locations.

Degree

Degree is a measure based on the number of links that a node possesses. It is the simplest measure of centrality, as we literally count the number of links to calculate degree. Degree is valuable as it shows us nodes that are “highly connected.”

In our example, Hancock County-Bar Harbor has a degree value of 7. To see where that came from, we can search through the origins and destinations from the E input:

Hancock County-Bar Harbor has 7 routes total: 4 as the origin and 3 as the destination. This is its degree.

Closeness

Mathematically, closeness measures the average length of the shortest path between a node and all other nodes in a graph. In other words, nodes with greater closeness values are more closely related to all other nodes.

Nodes with greater closeness scores are best positioned to influence all other nodes in the graph. In our example Atlanta’s and Las Vegas’ airports have the highest closeness score, while tiny Hancock County-Bar Harbor trails all others.

Evcent (Eigenvector Centrality)

Eigenvector Centrality measures a node’s influence on the network based on the number of connections it has to other nodes within the network. It is based on the assumption that connections with “more important” nodes contributes more to a node’s value than connections with lesser nodes.

In our example, Atlanta leads the way again in eigenvector centrality. This tells us that, considering the entire network, it is the most important or influential node. This makes sense, as such a busy and well-connected airport.

PageRank

The final measure of centrality here is called PageRank. Named after Google co-founder Larry Page, PageRank was developed by Page and Sergey Brin and was the first search algorithm used by Google in the late 1990s.

PageRank returns a probability value, between 0 and 1, that represents the likelihood of being directed to a desired output node when starting on a random node within the network. In our example, Atlanta has a PageRank of about 0.90. This means that there is about a 90% of being directed towards Atlanta when starting on a random node in the network.

Final Thoughts

Obviously, the Network Analysis tool isn’t something that you’ll use in every workflow. Many Alteryx users will likely never use it. But when you need it, you need it.

Whether you need to produce a neat-looking network graph or calculate measures of centrality (have fun calculating Eigenvector Centrality on your own), the tool is invaluable.  The real power of network analysis is how it enables analysts to abstract and visualize data on its own. Stripped of things like geographic context, we can peel away the layers to reveal the crux of the problem. In our simple example here, we quantified the importance of Atlanta and relative obscurity of Canyonlands Field within the filtered data set.

Thanks for reading! If you have any questions or want to reach out, you can contact me at john.emery@tessellationconsulting.com or follow me on Twitter @jemery_dataviz.

About Tessellation

We are a modern analytics consultancy. We enable and manage organizations’ analytics and self-service teams by educating people, optimizing technology, developing world-class products, and providing sustainable results. Curious to know how we can level up your organization’s analytics? Click here!

Leave a Reply

You must be logged in to post a comment.