Graphs
Representing data in a graph!
Representing data in a graph!
The next data structure we will introduce is a graph.
Graphs are multidimensional data structures that can represent many different types of data. We can use graphs to represent electronic circuits and wiring, transportation routes, and networks such as the Internet or social groups.
A popular and fun use of graphs is to build connections between people such as Facebook friends or even connections between performers. One example is the parlor game Six Degrees of Kevin Bacon. Players attempt to connect Kevin Bacon to other performers through movie roles in six people or less.
For example, Laurence Fishburne and Kevin Bacon are directly connected via ‘Mystic River’. Keanu Reeves and Kevin Bacon have never performed in the same film, but Keanu Reeves and Laurence Fishburne are connected via ‘The Matrix’. Thus, Keanu and Kevin are connected via Laurence.
In this module we will discuss graphs in more detail and build our own implementation of graphs.
We will discuss some of the basic terminology associated with graphs. Some of this vocabulary should feel familiar from the trees section; trees are a specific type of graph!
Nodes
: Node is the general term for a structure which contains an item.
Size
: The size of a graph is the number of nodes.Capacity
: The capacity of a graph is the maximum number of nodes.Nodes can be, but are not limited to the following examples: - physical locations (IE Manhattan, Topeka, Salina), - computer components (IE CPU, GPU, RAM), or - people (IE Kevin Bacon, Laurence Fishburne, Emma Stone)
Edges
: Edges are the connection between two nodes. Depending on the data, edges can represent physical distance, films, cost, and much more.
Adjacent
: Node A and node B are said to be adjacent if there is an edge from node A to node B.Neighbors
: The neighbors of a node are nodes which are adjacent to the node.Edges can be, but are not limited to: - physical distances, like the distance between cities or wiring between computer components, - cost, like bus fares, and - films, like the Six Degrees of Kevin Bacon example
Cycles
: A cycle is a path where the first and last node are the only repeated nodes. More explicitly, this means that we start at node A and are able to end up back at node A.For example, we can translate the Amtrak Train Station Connections into a graph where the edges represent direct train station connections.
Within this context, we could say that Little Rock and Fort Worth are adjacent
. The neighbors
of San Antonio are Fort Worth, Los Angeles, and New Orleans. The Amtrak Train Graph has multiple cycles
. One of these is Kansas City -> St. Louis -> Chicago -> Kansas City
.
Generated using the Amtrak system map from 2018. This graph does not include all stations or connections. ↩︎
While trees are a type of graph, graphs can have more functionality than trees. For example, recall that to be a single tree, there could be no disconnected pieces.
Connectedness
: Graphs do not require being fully connected. There can be disconnected portions within a graph. For example, the following graph shows all of the students in a sophomore biology class. There is an edge between two student nodes if they are Facebook friends.Graphs can also have loops. In a tree, this would be like a node being its own parent, which is not an allowable condition.
Loops
: Loops are edges which connect a node to itself. These can be useful in depicting graphs that show control flow in programming. In this example, node A is connected to node B and node A is connected to itself.A weighted graph is a graph which has weights associated with the edges. These weights quantify the relationships, so they can represent dollars, minutes, miles, and many other factors which our data may depend upon.
Weights are not limited to physical quantities; they can also be our own defined similarity in text, product types, and anything for which we can create a similarity measure for. Let’s look at concrete weights using the Amtrak example.
We are able to expand the Amtrak graph from the previous page to include approximate distances in miles between cities.
Now that we have weights defined on our edges, we can compare paths in a different way. When we discussed trees, we just looked at the number of edges it took to get to another node. We can also determine the shortest path between nodes with respect to distance. If we wanted to travel from San Antonio to Kansas City, we may be tempted to travel San Antonio -> Los Angeles -> Albuquerque -> Kansas City
as it has the fewest stops. This trip would take us 2,531 miles (1201+640+690). With the edge weights in mind, a much better route would be San Antonio -> Fort Worth -> Little Rock -> St. Louis -> Kansas City
with a total of 1,089 miles(238+320+293+238) traveled.
Generated using the Amtrak system map from 2018. This graph does not include all stations or connections. Distance was calculated approximately ‘as the crow flies’. ↩︎
A directed graph is a graph that has a direction associated with each edge. For example, trees are a directed graph. The edge orientation will imply a fixed direction that we can move about nodes. As with trees, the flat end of the arrow will represent the origin and the arrowhead will represent the destination. If an edge has no arrowheads, then it is assumed that we can traverse both directions.
In the following graph, we have an example distribution network where each store ends up with 5 units in its possession. For example, nine units go from the distribution center to Store A. The distribution center will never receive product from stores as it has no incoming edges.
Unlike trees, directed graphs can have nodes with multiple incoming edges. We can see an example of this at Store B. The distribution center and Store A both send units to Store B.
When discussing directed graphs, we must also talk about undirected graphs. An undirected graph is a graph in which none of the edges have an orientation. If there is at least one directed edge, then it is considered a directed graph.
Undirected Edge
: An undirected edge is an edge which has no defined orientation (IE no arrowheads) which implies that we can traverse in either direction. If node A and node B are connected via an undirected edge then we say node A is adjacent to node B and node B is adjacent to node A.Graph types and appearances can vary wildly. We are not limited to just weighted/unweighted or directed/undirected. We can also have combinations of weighted and directed.
In the following graph, we have an example of a weighted and directed map. This map represents a zoo train where each node represents a station and each edge is a part of the track. Zoo guests can get on and off wherever they desire.
This graph is weighted as guests must pay the associated fee for each part of the track. Our example train also has a one way direction in most cases. The exception to this is the entrance/exit to the aquarium, this part of the track can go either direction.
In this graph, we also have a couple of loops. This would allow for zoo-guests to ride the train around an expansive exhibit such as the elephants or giraffes.
One possible way to tour the zoo for a guest starting at the entrance could be: aquarium, primates, big cats, antelope, giraffes, loop around the giraffes, elephants, aquarium, then exit. Their total payment for just the train would be $14.
The first way that we can represent graphs is as matrices. In a matrix representation of a graph, we will have an array with all the nodes and a matrix to depict the edges. The matrix that depicts the edges is called the adjacency matrix
.
To build the adjacency matrix
, we go through the nodes and edges. If there is an edge with weight w
going from i
to j
, then we put w
in the (i,j)
spot in our adjacency matrix
. If there is no edge from i
to j
then we put infinity in the spot (i,j)
. Let’s look at some examples.
For an unweighted graph, we treat the weights as 1 for all edges in our adjacency matrix.
For an undirected edge between nodes i
and j
, we put an edge from i
to j
and an edge from j
to i
.
Suppose that we have the following graph:
Across the top of the following, we have the array of nodes. This give us the index at which each node is located. For example, node A is in spot 1
, node B is in spot 2
, node C is in spot 3
and so on.
Below that we have the adjacency matrix. For the directed edge with weight 2 that goes from node B to node C, we have the value 2 at (2,3) in the adjacency matrix as B has index 2 and C has index 3. For the directed edge with weight 4 that goes from node A to node F, we have the value 4 at (1,6) in the adjacency matrix as A has index 1 and F has index 6.
Since there is no edge that connects from node A to node B, we have infinity in (1,2)
.
Now suppose we have this graph. We now have some loops present.
For example, we have a loop on node E with weight 12 so we will put the value 12 in spot (5,5) as E has index 5.
Now suppose we have this graph which is undirected and unweighted.
Since this graph is unweighted, we will treat all edges as though they have weight equal to one. Since this graph is undirected, each edge will essentially show up twice.
For example, for the edge that connects nodes A and B, we will have an entry in our adjacency matrix at (1,2)
and (2,1)
.
In the previous module, we introduced graphs and a matrix-based implementation. For this module, we will continue working with graphs and change our implementation to lists.
When using graphs, a lot of situational variation can occur. Some graphs can have a few nodes with many edges, many nodes with few edges, and so on. When we use the matrix implementation, we initialize a matrix with the number of columns and rows equal to the number of nodes. For example, if we have a graph with 20 nodes, our adjacency matrix would have 20 rows and 20 columns, resulting in 400 potential entries.
First let’s look at the implementation and then we will discuss when one may be better than the other.
In the matrix representation, we had an array of the node items. In the list representation, we will have an array of node objects. Each node object will keep track of the node item, the node index, and the outgoing edges.
The item
can be any object and the index
will be a value within our capacity. The edges
will be a list of pairs where the first entry is the index of the target node and the second entry is the weight of the edge.
Since each node will track its neighbors, it is important that we are consistent in our indexing of nodes. If our nodes were to get out of order, then our edges would as well.
Consider the following graph which we saw in the matrix representation.
The following list of nodes depicts the graph above. We can see that each node object has the item and index.
If we look closer at the edges
of the node with item A
and index 1
, we see that the set of edges is equal to [(4, 3.0), (6, 4.0)]
. This corresponds to the fact that there are two edges with the source as node 1
. The first ordered pair, (4, 3.0)
, means that there is an edge with source node 1
(A
) and target node 4
(D
) that has weight 3
. We can confirm that in our graph we do have an edge from A
to D
with weight 3
.
The following includes a couple of examples of loops within our graph.
We have loops on nodes D
, E
, and F
in our graph. Recall that a loop is an edge where the source and target are the same. For example, we have an edge with source D
and target D
that has weight 12
. We see this in our list representation in the node object with item D
and index 4
, where we have the entry (4,12.0)
in the edges.
When considering which implementation to use, we need to consider the connectivity in our graph. The terms that we use to describe the connectedness are dense and sparse.
Dense Graph
: A dense graph is a graph in which there is a large number of edges. Typically in a dense graph, the number of edges is close to the maximum number of edges.Sparse Graph
: A sparse graph is a graph in which there is a small number of edges. In this case the number of edges is considerably less than the maximum number of edges.Intuitively, we can think of dense and sparse in terms of populations. For example, if 100 people lived in a city block, we can consider that to be densely populated. If 100 people lived in 100 square miles we can consider that to be sparsely populated.
Let’s look at some motivating examples to get an idea of how the different structures will handle these cases.
The following is a dense graph. In this case, our graph does have the maximum number of edges. This means that every node is connected to every other node including itself.
The following is a sparse graph.
For dense graphs, the matrix representation will have better qualities as we are already setting aside space for the maximum number of edges. Sparse graphs are better represented in the list representation.
When we initialize the matrix implementation, we initialize the nodes
attribute to have dimension equal to the capacity of the graph. The edges
attribute is initialized to be a square matrix with dimension equal to capacity by capacity. Thus, if we have a sparse matrix, we are representing a lot of non-existent edges.
When we initialize the list implementation, we just have the nodes
attribute which has dimension equal to the capacity and each node tracks its own edges. If we have a dense matrix and we are searching for an edge, we must loop through each edge from the target node to see if the edge exists. In the matrix representation, we can access that edge directly.
If the proportion of edges to the maximum number of edges is greater than 1/64, then the matrix representation is better in terms of space.
nodes
: This will keep track of the nodes which are in our graph as well as the node values. The nodes can have any type of value such as numbers, characters, and even other data structures.size
: This will keep track of the number of nodes that are active in our graph.Upon initialization, we will initialize nodes
to be an empty array with dimension capacity
and size
to be zero as we start with no actual nodes.
get nodes
: returns a list of the nodes with their respective indexes. This will be the same logic from our matrix graph.function GETNODES()
LIST = []
for NODE in NODES
if NODE has a VALUE
append (VALUE, INDEX) to LIST
return LIST
get edges
: returns a list of the edges in the format (source, target, weight).function GETEDGES()
LIST = []
for NODE in NODES
if NODE is not empty
for EDGE in NODE EDGES
TAR = first entry of EDGE
WEIGHT = second entry of EDGE
append (NODE,TAR,WEIGHT) to LIST
return LIST
get node
: returns the node with the given index. If the index is within the possible range, then we return the value of that node. This will be the same logic from our matrix graph.
find node
: returns the index of the given node. We iterate through our nodes and if we find that value, then we return the index. Otherwise, return -1
. This will be the same logic from our matrix graph.
get edge
: returns the weight of the edge between the given indexes of the source node and target node. If one or both of the indexes are out of range, then we should return infinity. From the source node object, we will call the graph node get edge function on the target index.
function GETEDGE(SRC,TAR)
if SRC and TAR are between 0 and capacity
SRCNODE = the node at index SRC of the NODES attribute
WEIGHT = call the graph node GETEDGE from SRCNODE on TAR
return WEIGHT
else
return infinity
get capacity
: returns the maximum number of nodes we are allowed to have. Upon initialization, we will have a fixed number of possible nodes in our node array. We can simply return the size of this array. This will be the same logic from our matrix graph.
get size
: returns the size attribute. This will be the same logic from our matrix graph.
get number of edges
: returns the number of edges currently in the graph.
function NUMBEROFEDGES()
COUNT = 0
for NODE in NODES
if NODE is not empty
for EDGE in NODE EDGES
increment COUNT by one
return COUNT
get neighbors
: returns the neighbors of the given node. We will access our row adjacency matrix that corresponds to the node and return the indexes and values of those entries which are not infinity.function GETNEIGHBORS(IDX)
SRCNODE = the node at index IDX of the NODES attribute
if SRCNODE is not empty
return SRCNODE's edges
else
return nothing
In this module, we have introduced the graph data structure. We also looked at how we would implement a graph using a matrix representation. We introduced the following new concepts in this module:
Directed Graphs
: A directed graph is a graph that has a direction associated with each edge. The flat end of the arrow will represent the origin and the arrowhead will represent the destination. If an edge has no arrowheads, then it is assumed that we can traverse both directions.
Edges
: Edges are the connection between two nodes. Depending on the data, edges can represent physical distance, films, cost, and much more.
Adjacent
: Node A and node B are said to be adjacent if there is an edge from node A to node B.Neighbors
: The neighbors of a node are nodes which are adjacent to the node.Undirected Edge
: An undirected edge is an edge which has no defined orientation (IE no arrowheads). If node A and node B are connected via an undirected edge then we say node A is adjacent to node B and node B is adjacent to node A.Loops
: Loops are edges which connect a node to itself.
Nodes
: Node is the general term for a structure which contains an item.
Size
: The size of a graph is the number of nodes.Capacity
: The capacity of a graph is the maximum number of nodes.Weighted Graphs
: A weighted graph is a graph which has weights associated with the edges. These weights will quantify the relationships so they can represent dollars, minutes, miles, and many other factors which our data may depend on.
Graphs can be represented using a matrix of edges between nodes.
In this module, we also introduced a new way to store the graph data structure. Thus, we now have two ways to work with graphs, in lists and in matrices:
While these methods show the same information, there are cases when one way may be more desirable than the other.
We discussed how a sparse graph is better suited for a list representation and a dense graph is better suited for a matrix representation. We also touched on how working with the edges in a list representation can add complexity to our edge functions. If we are needing to access edge weights or update edges frequently, a matrix representation would be a good choice – especially if we have a lot of nodes.