Category: Network Graphs

Updating Player Networks – Part 2

In our previous post, we looked at how to acquire and load our baseball player data into Gephi. In this second installment, the focus will be on creating a player network graph in Gephi, and customizing many settings to deliver a network graph we can export to the web. Player networks are used to detail the connections between all players who are connected to one another in some fashion. In this instance, it is based on players having played for the same team in one or more common seasons. So let’s begin with the process of creating the graph using our raw data from the first installment.

Importing .csv data into Gephi is quite simple – we create individual node and edge files (as we showed in the previous post), and use the Gephi import functions to pull the data in. I always start with the node file, since it will typically have additional information not included in the edges file. After importing the node data, I then import the edge data, which gives us the information to form our initial graph. If we were to start with the edge file, Gephi will create our node data automatically, and we will not have the detail needed for our graph. This approach may work for simple graphs, but not for our current case.

Once both data files have been imported, we can begin thinking about what we want form our graph. Here are several questions we might pose:

  • How will we use color?
  • What sort of layout will be best?
  • Which measures should we calculate?
  • How should we depict node sizes?

In many cases, the answers to these questions come about through trial and error. We may have some ideas going into the process, but invariably, there will be modifications along the way. So be patient, and be willing to experiment as you create network graphs. The graph you will see in this post went through many of these modifications, which I won’t take the time to detail. Instead, this post will detail my final choices, along with some explanations for why these choices were made. So let’s take a walk through the various facets of the visualization.

Layout

While a network will retain the same underlying structure from a statistical point of view (degrees, centrality, eccentricity, etc.) regardless of our layout choices, it is still important to select a layout that will visually represent the underlying patterns in the network. Otherwise, we could just as well deliver a spreadsheet with all of the network statistics. So layout selection is critical, and often involves an iterative process.

For the baseball network graphs I built in 2014, I eventually settled on the ARF layout algorithm, which ran quickly and created an attractive circular network graph display using the player connection data. Alas, there is no ARF algorithm available for Gephi 0.9.2, so I required a different approach for the updates. Ultimately, this led to a 2-step approach using a pair of layout algorithms – OpenOrd followed by Force Atlas 2. OpenOrd is especially effective at creating a quick layout from large datasets, although with far less precision than some other force-directed approaches. Still, it is a great tool for creating a general understanding of the structure of a network very quickly. Force Atlas 2, is the near opposite of OpenOrd – a very precise approach that can be tweaked easily using the various settings in Gephi. It is ideal for putting the finishing touches on what OpenOrd started.

Here are the settings I eventually settled on for Force Atlas 2, after much trial and error:

Force_Atlas_2

Some of the more important things to note here are the Scaling and Gravity settings. I reduced the scaling to 0.5 so the network would display appropriately in a single window without the need for scrolling. The Gravity setting was increased to 2.5 to force nodes slightly toward the center of the display. The LinLog mode and Prevent Overlap options are also selected in order to make this particular graph more visually effective. For other graphs, I have used the Dissuade Hubs option, forcing large nodes to the perimeter of the graph; in this case, that was not an ideal choice.

Color

The use of color is also important within a network graph display. Color can be used to highlight nuances in the data that distinguish one or more nodes relative to another group of nodes. Often we use color to visually represent clusters within the graph, as grouped using the modularity classes statistic or some similar input. In the case of this series of graphs (ultimately one graph per team), I made a decision to use the official team colors to differentiate each graph. Thus my initial graph for the Boston Red Sox would be based on the two primary hex colors for the current team (these colors do change over time for many teams).

Here are the Red Sox primary colors:

c8102e_Color_Hex_-_2018-06-10_09.15.48 0c2340_Color_Hex_-_2018-06-10_09.16.33

After capturing current team colors in a spreadsheet for easy reference, I used the color-hex.com site to select complementary colors for the Red Sox graph. Using complementary colors allows me to differentiate clusters in the graph while remaining true to the original concept of employing team colors for each graph. So instead of a wide range of colors one would normally see in a Gephi output, I was able to input the complementary colors for each group. Thus, one team color could be used for the graph background, while the other color (and it’s complements) could be used for the graph structure (nodes & edges). We’ll share the effect later in this post.

Statistics

Graph statistics are critical to the full understanding of the structure of a network. While we can view a graph and begin to understanding the general structure of a network, the various statistics will aid and reinforce our initial visual comprehension. Gephi provides a nice range of statistical measures to choose from:

  • Eccentricity (the number of steps needed to traverse the network)
  • Centrality – betweenness, eigenvector, closeness, harmonic closeness (various measures of importance of an individual node)
  • Clustering coefficient (to discern cliques in the network)
  • Number of triangles (a friends of friends measure)
  • Modularity Class (clusters)
  • Degrees (the number of connections)

Sizing

Node sizing is another key element of effective graph design. In this case, there were a few options I could pursue for node sizing – the number of seasons played (I used this in the 2014 graphs), one of the various centrality measures we calculated, or the number of degrees (connections) an individual player possesses. After computing each of these statistics, I eventually decided to use the number of degrees as a representation of influence in the graph. Visually, I want to show how many other players a single individual is related to, and using node size is an effective means of doing so.

Summary

Our final graph in Gephi is shown below; the eventual web-based version will differ slightly and include additional functionality, but that’s for another post.

red_sox_20180610

Next Post

My third and final post in this series will address exporting this graph to the web using the sigma.js plugin, and making some additional customization to the web version. Thanks for reading, and see you soon!

 

 

 

 

 

 

Updating Baseball Team Networks – Part 1

A few years back, I used Gephi and sigma.js to create a series of interactive baseball team networks, one for each current MLB franchise. These networks displayed all players through the 2013 season, going all the way back to 1901 for the original American and National League franchises. Now that we have data through the 2017 season, it’s time for an update, not only from a data perspective, but also stylistically. This post will walk through the process of creating one of these networks using Toad for MySQL, Gephi, and sigma.js to create web-based interactive network visualizations.

Here’s a typical network from the 2013 series; the full list of networks can be found here. We’ll use the existing networks as a baseline for the new networks, although a few modifications will be made.

baseball_team_network_2013
a 2013 baseball team network for the Boston Red Sox

Source Data & MySQL Queries

Let’s start our discussion with the source data. Season-level baseball data is available through the seanlahman.com website, in the form of .csv files or Microsoft Access database tables. I use the .csv format, as it can be easily added to existing MySQL databases on the visual-baseball.com server. MySQL also makes it simple to add derived fields through some simple coding. These fields can be utilized later for a variety of activities.

For the purpose of our network graphs, there are a handful of critical fields we want to use. These include the following:

  • playerID, a unique identifier for every player who ever donned a major league uniform
  • player name, which can be used to provide a meaningful reference based on the playerID field
  • yearID, which refers to the season (or seasons) a player suited up for a specific franchise
  • franchID, a unique identifier for each MLB franchise

We also need to do a little manipulation of the source data in our code to deliver our results in the proper form for use in Gephi. This means we need to create two input files – one for nodes, and a second for edges. The nodes will contain information about each player, the number of seasons played for the franchise and the first and last seasons, which may differ from the number of seasons, as players frequently leave a franchise only to return later in their career. Here’s our node code:

SELECT Id, Label, MAX(Size) as Size
FROM
(SELECT bp.playerID AS Id, CONCAT(bp.name, ” “, MIN(bp.yearID), “-“, MAX(bp.yearID)) AS Label, COUNT(bp.yearID) AS Size
FROM BattingPlus bp
WHERE bp.franchID = ‘BOS’ and bp.yearID >= 1901
GROUP BY bp.name

UNION ALL

SELECT pp.playerID AS Id, CONCAT(pp.name, ” “, MIN(pp.yearID), “-“, MAX(pp.yearID)) AS Label, COUNT(pp.yearID) AS Size
FROM BattingPlus pp
WHERE pp.franchID = ‘BOS’ and pp.yearID >= 1901
GROUP BY pp.name)  a
GROUP BY Id
ORDER BY Id;

Here’s the simple interpretation – since we are attempting to display all players for a given franchise, we are executing a UNION ALL statement to combine batters and pitchers into a single result file. We have used the playerID field to create the required Id value for Gephi, while also creating a Label field by combining the player’s name with their first and last years playing for this franchise. Finally, we have created a Size field based on the number of seasons played for the franchise. We can then choose to use this in Gephi to size each node, if we so choose.

We also need to create the edge file for Gephi. In this case, we want to understand how many seasons two players were on the same team. This code is a bit trickier, since we want to show only one connection between two players, since this will be an undirected graph. More on that distinction later. Here’s our edge code:

SELECT b.playerID AS Source, m.playerID  AS Target,  ‘Undirected’ as Type,  ‘ ‘ as Id, ‘ ‘ as Label, count(*) as weight
FROM
(SELECT a.playerID, CONCAT(m.nameFirst, ” “, m.nameLast) name, a.yearID, a.franchID

FROM Appearances a
INNER JOIN Master m
ON a.playerID = m.playerID

WHERE a.franchID = ‘BOS’ and a.yearID >= 1901) b

INNER JOIN Appearances a
ON b.yearID = a.yearID and b.franchID = a.franchID and b.playerID <> a.playerID and a.playerID > b.playerID
INNER JOIN Master m
ON a.playerID = m.playerID

GROUP BY b.playerID, a.playerID
ORDER BY b.playerID

Here we use the Master table to provide player name information, and we also gather the ID information to match the node values. The critical piece in this code is in our join criteria:

INNER JOIN Appearances a
ON b.yearID = a.yearID and b.franchID = a.franchID and b.playerID <> a.playerID and a.playerID > b.playerID

Here we are matching players based on the same season and the same franchise. We then specify that we do not want to connect any player to himself, and that we want only values where the playerID value from our main query is greater than the playerID value from the sub-query. This gives us a single connection between two players, which is what we need for an undirected graph. We then define a Source node (required by Gephi) and a Target node (also required), as well as specifying ‘Undirected’ as the graph type. We leave the ID and Label values empty, and then summarize the number of seasons played together as an edge weight. This value can be used in Gephi to show the strength of a connection between two nodes (e.g.- did they spend one season together, or 10 seasons together?).

After exporting each of these files to a .csv format, we have our source data for Gephi. In Part 2 our focus will shift to creating the network in Gephi.

Major League Baseball Trade Networks, Part 2

Welcome to Part 2 in our miniseries on building baseball (MLB) trade networks with Gephi. In the first post, the focus was on procuring and preparing the data using MySQL. The goal was to create nodes and edges that could be easily imported to Gephi. Gephi does allow for some data manipulation post-import, but I’ve learned from experience to do the main parts of the job with either SQL code or within spreadsheet software like Excel or Calc.

With our data readied for import, we’ll now move on to the more fun parts of the process, where we get to visualize the data and see any underlying patterns. Gephi is an ideal tool for this, as it allows us to try out many different algorithms, especially in version 0.8.2. The newer 0.9 versions are faster, but have not fully caught up on the plugin side at this writing, so options are a bit more limited. One other caveat – I frequently run into Java issues when using Gephi, so save your work often and be prepared to shut down and restart Gephi periodically.

We’ll kick off this part of the process by importing the data, nodes first, followed by edges. The reason I prefer this order is that nodes will be automatically created if we start with the edge file import, and they won’t contain any extra fields you may have added using your database or spreadsheet processes.

Here’s the node import window, showing the appropriate file input:

Gephi node import window
Gephi node import window

Once the node import process is complete, we turn to the edges file, and follow a similar sequence of steps. Here’s our starting point:

Gephi edge import window
Gephi edge import window

After the data has been imported, it’s time to move to the Overview tab, where we’ll see a dense mess of nodes and edges, especially if we have a fair sized dataset. Something like this:

Impossibly dense hairball network
Impossibly dense hairball network

Gephi offers a variety of interesting algorithms, each more or less appropriate based on the underlying dataset. In our case, the dataset is of a moderate size, with more than 8,000 nodes and nearly 62,000 edges. This immediately rules out the use of simple layouts such as the circular algorithms, as it would prove immensely challenging to display, even when we take it to an interactive output. At the other end of the sophistication level lie the force-directed layouts, which apply a significant dose of science and math within their respective algorithms. In Gephi, the Force Atlas 2 is quite popular, but it tends to run very slowly unless coupled with enormous levels of RAM. So where to go with our choice for this data?

I elected to take a two step approach, using the extremely fast (if less precise) OpenOrd algorithm for the original data. This provides a nice view of the network within a few minutes, making it a good starting point for our next steps.

Trade network with OpenOrd layout
Trade network with OpenOrd layout

The goal of this exercise was to create team level graphs, which will each have a small subset of the entire dataset. One easy way to achieve this is to use the Ego Network filter to select a single team and its connections. Setting the ego network to a depth of 1 limits the display to only first degree connections; in this case players traded to or from our selected team.

Ego network with depth = 1
Ego network with depth = 1

Once this step has been taken, we can then refine the display by applying another algorithm; in this case I have chosen the Yifan Hu option, and adjusted the settings until they created an aesthetically pleasing graph. The Yifan Hu adds further precision within each of the team graphs, and provides them with a common look & feel inasmuch as their respective data allows.

We have now completed our basic graph creation in Gephi, and can output our results to a variety of output formats. Our choice here is to create a GEXF file, which we can then plug in to an existing template. We do have another step with respect to the GEXF data. In order to relate the graphs back to their respective teams, I chose to apply official team colors to elements in the graph. Specifically, each node should reflect the individual team; we want the edges to remain the same across all graphs so that users have a common understanding for the types of connections between players and teams. So to update the node colors, simply use a code editor that can perform batch updates. I typically work with Brackets for this task, but choose your tool of choice. Here’s a view of the GEXF output prior to applying color changes:

GEXF nodes with original colors
GEXF nodes with original colors

Now here’s the updated version that reflects the Tigers navy blue coloring:

GEXF nodes updated with team colors
GEXF nodes updated with team colors

Once this step has been completed, I can upload the files to the web server, where other minor changes can be made. These include any updates to the CSS styling, adjustments to the config.js file, and minor changes to the index.html file so the proper team information is displayed. The easiest way to do this is to create a common set of directories for the basic javascript and CSS files, leaving only the individual config, html, and gexf files in each team’s directory.

After a bit of massaging the config and CSS settings, the result is visually appealing as well as highly functional. Here’s a zoomed in look at the Toronto Blue Jays network on the web:

Blue Jays trade network close-up
Blue Jays trade network close-up

All of the networks can be found by clicking the link below, with new ones being added until all teams have been represented:

Trade Networks

If you want to see each network in its own window or tab, use the right-click options in Firefox or Chrome.

I hope this has been helpful, and please feel free to reach out via the LinkedIn or Facebook Gephi groups, or leave a comment on this site.

Thanks for reading!

A Network Graph of MLB All-Stars

With the Major League Baseball All-Star game this week, I was suddenly possessed by an urge to create a network graph of all the players who have been selected for the game from 1933-2015. The goal, as with most of my network graph efforts, was to see where the data took me, and what stories it might tell. Along for the ride was my trusty companion Gephi, version 0.9.1 in this instance. Data for this exercise comes, as it so often does, from the Lahman baseball database, which happens to have a nice table with all the necessary all-star information.

The challenge inherent in this data, as with many temporal datasets, is to create an interesting graph that isn’t entirely driven by the time element, in this case as baseball seasons. Some otherwise excellent layout algorithms tend to turn this sort of data into long, worm-like displays that are not visually appealing. I could just as easily use a timeline if that were the goal of the visualization. So in an effort to balance aesthetics with the underlying data, I finally settled on the radial axis layout in Gephi. With this layout, we have the ability to create multiple axes radiating from the center of the graph, grouped by something meaningful. In this case, that turns out to be the modularity class, a sort of clustering mechanism that groups nodes together based on common or similar characteristics.

After some trial and error, I wound up with 13 distinct classes, a nice manageable number for this type of display. The end result can be interpreted as some sort of exotic, colorful starfish, or perhaps as a multi-colored fireworks display. In any event, I believe it tells an interesting story in a visually appealing manner, and allows for understanding the common threads within each cluster of players. Here’s the complete graph layout:

allstar_graph_base

I’ll spend the rest of the post with some quick analyses of each group, and then point you to the entire graph in interactive form so you can discover your own patterns and learn more about the all-star connections of individual players. I’ll also provide a quick overview for how to read the sidebar output for the graph when you interact with the data.

Our 13 clusters (you can think of them as cohorts) tell us some interesting things about the history of all-star participants. Let’s walk through each of the 13 (numbered 0 through 12) to learn more. The clusters begin with 0 (in green) at the upper left and move counter-clockwise around the graph. Each is rank ordered from small to large radiating out from the center, so the member with the most years as an all-star will be at the tip of each group. That individual will serve as our focal point in each of the following screenshots, followed by a brief overview of other members of the cohort.

First up is our Cohort 0, headlined by Miguel Cabrera, with 10 selections through 2015. Obviously, this would appear to be a cohort of current or recent all-stars based on Cabrera’s appearance. We can easily navigate the graph to see if that’s the case. Who else is prominent in the group? Yadier Molina, Matt Holliday, and Robinson Cano, to name a few, all big name stars for most of their careers. How about at the low end of the spectrum, players with a single all-star selection? Here’s where we find the likes of Billy Butler, R.A. Dickey, and Melky Cabrera. All long-tenured, noteworthy players, but certainly not in the same category as the first group.

allstar_0

Cohort 1 takes us on some time travel, with Johnny Mize as the representative star, also with 10 all-star selections and Hall of Fame membership as well. Joining Mize in the group are Bobby Doerr, Vern Stephens, Joe Gordon, and Bob Feller, all Hall of Famers with the exception of Stephens. At the other end of the cohort, each with one appearance, are Oscar Grimes, Red Barrett, and Nick Etten, among others. Based on the career arcs of the stars in this group, we could characterize it as primarily a 1940s-based cohort, certainly with overlap into the surrounding decades.

allstar_1

Derek Jeter is our icon for Cohort 2, so it figures to be a group that immediately precedes the Cabrera-led Cohort 0. Perhaps the focus here will be on stars from the early 2000s, at the center point of Jeter’s long career. Mariano Rivera, Albert Pujols, and David Ortiz are among the top stars here, confirming the hypothesis that this group is largely post-2000 in nature. Among the lesser knowns with a single selection each are Gil Meche, Joe Crede, and Ryan Ludwick.

allstar_2

Cohort 3 is headed up by the legendary Stan Musial, whose career covered the entirety of the 40s and 50s. Given that Cohort 1 was largely concentrated on players from the 40s, we might anticipate more of a skew towards 1950 and beyond. We’ll see in a moment if that’s true. Next to Musial we have Ted Williams and Warren Spahn, two more whose careers spanned both decades, so perhaps we have players here with greater longevity compared to the Mize cohort. Let’s go a bit deeper, where we find Roy Campanella, Larry Doby, and Robin Roberts, all with career pinnacles primarily in the 50s. So while there will certainly be connections across the two groups, Cohort 3 does appear to span more of the 1950s compared to Cohort 1.

allstar_3

With Cohort 4 we see a very large group fronted by all-time hits leader Pete Rose. So we could be focused on the 1960s or 1970s here; Rod Carew, Reggie Jackson, and Mike Schmidt, Hall of Famers all, are included, so the 1970s would seem to be the dominant theme. Perhaps we shouldn’t be surprised at the size of this group, as expansion in the 1960s afforded more players the opportunity to become an all-star. A few interesting figures can be found at the single game end of the radian – Bob Horner, Kent Hrbek, and Lonnie Smith, all with at least momentary brushes with greatness, but good enough to qualify for just one all-star nod apiece.

allstar_4

Joe DiMaggio is the lead for our next group, joined by the likes of Mel Ott, Bill Dickey, and Joe Medwick. The skew is toward the late 1930s and beyond; many members of this cohort would have had limited all-star game opportunities, as the game originated only in 1933. As proof of this, we find Hall of Famers Heinie Manush, Goose Goslin, and Kiki Cuyler at the low end of the radian, each with just a single all-star credit.

allstar_5

Hall of Famer Al Kaline heads up Cohort 6, so we know we’re in either the 50s or 60s, or more likely, a bit of both decades. Along with Kaline we have Mickey Mantle, Yogi Berra, and Ernie Banks, each of who had multiple appearances covering both decades. At the more modest end of the group we find Rocky Bridges, Bob Cerv, and perhaps surprisingly, the slugger Joe Adcock, each with just a single season as all-stars.

allstar_6

Cohort 7 is our one group that’s difficult to explain. Our graph modularity settings forced a small cohort of just nine players; Hank Aaron with 21 seasons, and eight others with a single season each. Consider this one a bit of a fluke.

allstar_7

The great Willie Mays leads the relatively small Cohort 8, joined by both Brooks and Frank Robinson, as well as Roberto Clemente. This would indicate a cohort of players who began in the 1950s and perhaps peaked in the 60s. Lower down the list this group features lesser known players like Tito Francona, Dick Howser, and Joey Jay, all one season all-stars.

allstar_8

Cohort 9 is a very large group led by Barry Bonds, Ivan “Pudge” Rodriguez, and Ken Griffey. Here we have three stars with illustrious careers launched around 1990 and extending into the new millenium. At the other extreme we find one-timers such as Jay Buhner, Mark Grudzelianek, and Lance Johnson.

allstar_9

Cohort 10 is a mid-sized group headed up by Alex Rodriguez, and featuring Manny Ramirez, John Smoltz, and Scott Rolen. This would appear to be a very similar group (if a few years later) to the prior cohort, and we should expect to find a great number of crossover connections between the two, as they each cover players from similar time periods.

allstar_10

Down to the final two groups! Cohort 11 is led by Cal Ripken, Ozzie Smith, and Roger Clemens, all major impact players in both the 1980s and 90s. This is another group featuring players who were quite talented but dented the all-star ranks just a single time. Among these were outfielders Jesse Barfield and Kevin Bass, and pitcher Teddy Higuera.

allstar_11

With Cohort 12, we see a large group led by 18-time all-star Carl Yastrzemski, supported by Johnny Bench, Tom Seaver, and Harmon Killebrew. These are all players who were at their productive peaks in the late 1960s through mid-1970s, an era where the National League was dominating the annual game. Chuck Hinton, Jerry Lumpe, and Joe Azcue are among those who can claim a single trip to the all-star game as a career highlight.

allstar_12

Quick note on the sidebar – you’ll see a few measures which I won’t go into too deeply here; there are 3 centrality measures (influence within the network), eccentricity (the number of steps to traverse the network, think six degrees of Kevin Bacon), and size, reflecting the number of seasons as an all-star. The key part of the sidebar lies in the listing of all connected players to the one currently selected, with numbers indicating the number of games as co-all-stars. Use these links to navigate through the network quickly. It’s fun!

So that’s it for our brief analysis. Now it’s time to explore for yourself by opening the MLB All-Star Network visualization. Be patient as the data loads; once it has cached the graph should be fairly fast at zooming, panning, and allowing you to explore to your heart’s content.

Exploratory DataViz Part 2

Having discussed some of Exploratory’s cool features in a prior post, I thought it would be fun to continue the exploration using JSON data as a starting point. I happen to have a fair amount of JSON on hand, thanks to a series of network graphs produced using Gephi and sigma.js, so why not put it to use with Exploratory and start creating a new dataviz?

If you have previously worked with JSON, you’re no doubt aware that it can be a bit fickle – miss a bracket or brace in one place and the entire file fails to load a visualization. However, knowing that my JSON has been successful in producing network graphs (see here for examples), I figured it was worth a shot with Exploratory.

To begin, start with the local import option, selecting the json option, and pointing it to your local file. Give it a name, run the process and cross your fingers! After a few seconds, I’ve got my results, and Exploratory has done a good job categorizing the data:

exploratory_2.1

Since this is network data, we have nodes and edges, as well as any additional attributes, such as color or size. Exploratory has picked up those groupings, first the edges, and now the nodes.

exploratory_2.2

Finally, the attribute values:

exploratory_2.3

Since we’re satisfied with the import, we can move on to the summary data, which in this case doesn’t make a whole lot of sense. No matter, let’s see what can be done with some charts and analysis.

exploratory_2.4

To start with, we have x and y values associated with each node, which sounds like a perfect candidate for a scatter plot. We add the x value to the x-axis (how convenient was that!), the y value to the y-axis, node size as the Size attribute, and finally the Eccentricity attribute for color. FWIW, eccentricity is not a measure of flakiness, but rather the distance between the most remote points in a graph. This is where the six degrees of separation (or Kevin Bacon, take your pick) concept comes into play; an eccentricity value of 6 equates to 6 degrees of distance. Here’s our result:

exploratory_2.5

Not bad, eh? We can also hover over each node to see who it is (after adding Id to the Label field):

exploratory_2.6

We still have a lot of activity in a limited space, so now let’s use a simple filter (see the command line at top) to grab the top 50 values, and see the results:

exploratory_2.7

Now let’s create a new branch to explore further. I would like to sort my dataset using the Betweenness Centrality attribute, but there’s one problem – it’s a character value at the moment, so it doesn’t sort numerically. No matter, we can fix that easily using the Mutate command to convert the variable type. This can be seen in the right margin, where Exploratory conveniently stores all actions. Now we can sort our values in descending order to understand who is most influential in the network (at least by this measure). FYI – Betweenness Centrality tells us which nodes others must pass through most frequently to connect elsewhere within the network. Typically, but not always, it is someone centrally located within a network; sometimes it may be a less influential character (Pedro Borbon in this case) who connects more distant groups to one another.

exploratory_2.9

So there you have it, another quick walk-through with Exploratory. Before I sign off, here’s the live scatter plot you can play with via the Exploratory server. Be sure to use the simple zoom features to traverse the chart!

[iframe src=”https://exploratory.io/chart/kc2519/1e4b6b0c24e1?cb=1467821749896&embed=true” width=”100%” height=”600″ frameborder=”0″]

ODSC: Analyzing Complex Networks Part 2

This is part two of a brief series sharing components of my presentation titled Analyzing Complex Networks Using Open Source Software at ODSC East in Boston on May 21st. The first post looked at a few examples from a Boston Red Sox players network, while this one examines a Miles Davis album and musician network. I’ll share a few examples of network analysis within the context of the Miles Davis graph.

The Miles Davis network could be described as a tripartite network, or one with three layers. Miles is at the center, and connects to each of nearly 50 recordings. Other musicians then connect to the respective recording(s) they played on, but not to one another. This approach provides a very clear look at musical phases in the career of the legendary trumpeter, without the graph being clouded by excessive detail. Here’s a view of the final network, after which we’ll look at some components of the graph.

miles_1

We see some interesting patterns in the graph, specifically in viewing the pink circles, which represent individual albums. Musicians playing on a recording can be seen adjacent to that recording, except in the case of musicians present on multiple albums. We would expect them to be positioned relative to all of the recordings they played on. A quick visual scan leads to five distinct clusters, as seen in the next screenshot.

miles_2

Now that we have identified these clusters, it would be helpful to understand their meaning and relevance to Miles career. Using the graph in interactive fashion, we can learn more about the recordings and musicians, and begin to formulate some insights. These can be confirmed by referring to album links on the web or in Wikipedia, which give context to what we are viewing. Based on these steps, here is a quick overview of the five clusters.

miles_3

A final step might be to add some verbiage using PowerPoint or Inkscape, which I’ve done below in very minimalist fashion. We could also add this to a web version using CSS attributes to position the text, although this could get tricky as we pan and zoom on the graph. We might be better off using some sort of stylized marker (color or shape) to communicate some of this information.

miles_4

There is much more that could be done, but I hope this brief example shed some light on the usefulness of network graphs, especially from a pure visual perspective.

ODSC: Analyzing Complex Networks Using Open Source Software

I’ll be presenting at the 2016 ODSC East event in Boston May 20-22. ODSC stands for Open Data Science Conference, where the focus is on using open data or open source tools to do clever things in the information space. The topic of my presentation is Analyzing Complex Networks Using Open Source Software, where I’ll talk through several example networks built using Gephi and Sigma.js.

While the slides are not all prepared at this stage, I’ll share a few bits that will wind up in the talk. My goal is to convey to the audience how networks can be used to statistically and visually understand complex information. After providing an overview of network analysis (at a very high level), I’ll be sharing slides from three very different networks – a Miles Davis album network (created in 2014 and rebuilt in 2016), a Boston Red Sox player network (also built in 2014), and a brand new example using data from the amazing GDELT Project.

Here’s a glimpse into what I’ll be sharing, starting with the Red Sox examples, where we examine the networks of three well known players from the last 100 years. First, Ted Williams network:

odsc_williams

Followed by Carl Yastrzemski:

odsc_yaz

Now Jason Varitek, longtime catcher and captain for two World Series championship teams:

odsc_varitek

In talking through each of these networks, I will attempt to highlight some differences in their respective structures based on the era in which each player spent time with the Red Sox. For example, there are many more connections in the Varitek network compared to Williams and Yaz, despite a shorter duration with the team. Why would this be the case? Perhaps spending time in the era of higher salaries, larger pitching staffs, and the evolution of free agency might go a long way towards explaining why Jason Varitek crossed paths with far more players than did his earlier predecessors.

Stay tuned for additional posts featuring the Miles Davis and GDELT networks.