Putting Visual Analytics into Practical Use
In this take-home exercise, I will be attempting to explore for patterns in the social networks within the city of Engagement, Ohio USA.
Before I get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, I will install the R packages and load them onto R environment.
packages = c('igraph', 'tidygraph',
'ggraph', 'visNetwork',
'lubridate', 'clock',
'tidyverse', 'ggmap', 'knitr')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The code chunk below imports SocialNetwork.csv and
Participants.csv from the data folder, into R by using read_csv()
of readr and
save them as tibble dataframes.
The following steps were done to transform the data,
city_edges_agg_all <- city_edges %>%
mutate(Yr_Mth = format(as.Date(timestamp), "%Y-%m")) %>%
mutate(Day = wday(timestamp,
label = TRUE,
abbr = FALSE)) %>%
group_by(participantIdFrom, participantIdTo, Day) %>%
summarise(Weight = n(),
.groups = "keep") %>%
filter(participantIdFrom != participantIdTo)%>%
filter(Weight > 0) %>%
ungroup()
Notice the filter(Weight > 0) in the last part of the
above code chunk?
This essentially does not really filter out anything.
However, we will make use of this to gauge where to cut off the data for analysis, as using the whole dataset prove to be “too much info” for any fruitful analysis. In fact, in my trial-and-error for the charts, using all the data will result in a very messy (cluttered) social network graph.
With the summary() function on the Weight
data column, we are able to view the interquantile stats.
summary(city_edges_agg_all$Weight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 4.000 6.687 8.000 65.000
As we are keen to find out more on the participants whom have high edge weights, we will filter for edges with weights of at least 8 and above (3rd quantile onwards).
city_edges_agg_all <- city_edges %>%
mutate(Yr_Mth = format(as.Date(timestamp), "%Y-%m")) %>%
mutate(Day = wday(timestamp,
label = TRUE,
abbr = FALSE)) %>%
group_by(participantIdFrom, participantIdTo, Day) %>%
summarise(Weight = n(), .groups = "keep") %>%
filter(participantIdFrom != participantIdTo)%>%
filter(Weight > 49) %>%
ungroup()
Trial runs were done with varying weight filters (starting from 65 in descending order) and we arrived at 50 as a good balance of data to showcase participants with relatively high connectivity to one another.
As the SocialNetwork.csv was pretty big in size at around 200 MB, it is not possible for us to upload this to Github.
Hence, we will be exporting the cleaned and filtered dataset (done in previous step) instead and used that as our reference import file for the subsequent analysis.
The code below will write (export) the file into a RDS format (native to R) to keep the file’s size small and manageable.
write_rds(city_edges_agg_all, "data/rds/city_edges_agg_all.rds")
The exported RDS files will be imported again (as the original SocialNetwork.csv will not be imported and stored in this R project/GitHub)
city_edges_agg_all <- read_rds("data/rds/city_edges_agg_all.rds")
The following code chunk will be extracting the unique participants
involved in the analysis after our previous step of edge weight
filtering (above 49), via usage of these functions, stack()
and unique().
nodes_involved_all <- city_edges_agg_all %>%
select(participantIdFrom, participantIdTo) %>%
stack() %>%
mutate(participantId = (values)) %>%
select(participantId) %>%
arrange(participantId) %>%
unique()
The following code chunk will then merge (inner join) the
participants dataset (city_nodes after importing) with the
list of unique participants involved in this analysis (after filtering
them based on edge weight). The output will provide a list of unique
participants and their respective details.
participants_nodes_all <- merge(x=city_nodes,y=nodes_involved_all,by="participantId") %>%
mutate(participantId = as.character(participantId)) %>%
rename(`Education Level` = educationLevel,
`Interest Group` = interestGroup,
`Have Kids` = haveKids,
`Household Size` = householdSize,
`Joviality` = joviality,
`Age` = age)
The next code chunk will create a tbl_graph object with the edges and nodes dataset that we have cleaned.
network_graph_all <- graph_from_data_frame(city_edges_agg_all,
vertices = participants_nodes_all) %>%
as_tbl_graph()
network_graph_all
# A tbl_graph: 523 nodes and 10740 edges
#
# A directed multigraph with 94 components
#
# Node Data: 523 x 7 (active)
name `Household Size` `Have Kids` Age `Education Lev~`
<chr> <dbl> <lgl> <dbl> <chr>
1 4 3 TRUE 43 Bachelors
2 5 3 TRUE 32 HighSchoolOrCol~
3 7 3 TRUE 27 Bachelors
4 8 3 TRUE 20 Bachelors
5 10 3 TRUE 48 HighSchoolOrCol~
6 12 3 TRUE 34 HighSchoolOrCol~
# ... with 517 more rows, and 2 more variables: `Interest
# Group` <chr>, Joviality <dbl>
#
# Edge Data: 10,740 x 4
from to Day Weight
<int> <int> <chr> <int>
1 1 49 Sunday 51
2 1 49 Monday 51
3 1 49 Tuesday 51
# ... with 10,737 more rows
Next, we will activate()
the edges so that we can begin to use the weight for our analysis.
network_graph_all %>%
activate(edges) %>%
arrange(desc(Weight))
# A tbl_graph: 523 nodes and 10740 edges
#
# A directed multigraph with 94 components
#
# Edge Data: 10,740 x 4 (active)
from to Day Weight
<int> <int> <chr> <int>
1 4 22 Wednesday 65
2 4 22 Thursday 65
3 6 316 Thursday 65
4 9 420 Thursday 65
5 10 66 Thursday 65
6 11 108 Thursday 65
# ... with 10,734 more rows
#
# Node Data: 523 x 7
name `Household Size` `Have Kids` Age `Education Lev~`
<chr> <dbl> <lgl> <dbl> <chr>
1 4 3 TRUE 43 Bachelors
2 5 3 TRUE 32 HighSchoolOrCol~
3 7 3 TRUE 27 Bachelors
# ... with 520 more rows, and 2 more variables: `Interest
# Group` <chr>, Joviality <dbl>
Here, we will start to plot all of our network graphs, in static and interactive formats.
ggraph
package will be used for the static graphs, while visNetwork
will be used for the interactive graphs.
Degree centrality is the simplest centrality measure to compute as it is essentially just counting the number of edges that each node has. If a participant has 10 social connections, then it will have a degree centrality of 10.
While it is a good measure of how connected the participant is, it does not necessarily reflect the degree of importance of the participant in connecting other nodes.
Nevertheless, it is a simple way to pick out nodes that MIGHT be important to the social network as usually nodes with high degree of edges tends to score well in other centrality tests. (no guarantees though!)
In the code chunk below, the network graph is plotted with codes that will only highlight and label nodes with degree centrality above 60 (intent was to only focus on high degree nodes, instead of showing everything which can be quite confusing with all the colours)
g1 <- ggraph(network_graph_all,
layout = "linear",
circular = TRUE) +
geom_edge_arc(aes(width = Weight),
alpha = 0.1) +
scale_edge_width(range = c(0.1, 2)) +
geom_node_point(aes(colour = ifelse(centrality_degree() > 60, `Education Level`, 'grey60'),
size = centrality_degree())) +
labs(title = "Network Graph by Education Level wrt Degree Centrality",
subtitle ='Layout: Linear and Circular',
size = 'Degree Centrality') +
theme_graph(base_family = 'Helvetica') +
scale_color_manual(name ='Education Level',
values = c('Graduate'='green','Bachelors'='red','HighSchoolOrCollege'='cyan','Low'='pink'))
g1 + geom_node_label(aes(label=ifelse(centrality_degree() > 60, name, NA)),
nudge_x = g1$data$x * .15,
nudge_y = g1$data$y * .15,
repel = TRUE,
point.padding = NA,
box.padding = 0,
force = 0.1)
It is interesting to note that the participants with high degree centrality tends to be those with higher education level (Graduate, Bachelors and High School/College).
My guess would be that people with higher education level has more socialization opportunities to build connections compared to people who did not receive higher education (‘Low’ education level)
top_degree <- city_nodes %>%
filter(participantId %in% c(157,195,216,488,503,529,540,563,573,623,624,648,698)) %>%
arrange(by_group=interestGroup)
kable(top_degree)
| participantId | householdSize | haveKids | age | educationLevel | interestGroup | joviality |
|---|---|---|---|---|---|---|
| 503 | 1 | FALSE | 24 | HighSchoolOrCollege | A | 0.8924825 |
| 195 | 1 | FALSE | 60 | Graduate | B | 0.0587027 |
| 623 | 2 | FALSE | 56 | HighSchoolOrCollege | B | 0.5761295 |
| 648 | 2 | FALSE | 49 | Bachelors | B | 0.6349943 |
| 216 | 1 | FALSE | 38 | HighSchoolOrCollege | C | 0.1539006 |
| 540 | 3 | TRUE | 44 | Bachelors | C | 0.8208976 |
| 624 | 2 | FALSE | 31 | Bachelors | C | 0.9161637 |
| 698 | 1 | FALSE | 24 | Bachelors | C | 0.6748777 |
| 529 | 3 | TRUE | 27 | Graduate | D | 0.5684054 |
| 488 | 1 | FALSE | 51 | HighSchoolOrCollege | F | 0.7701043 |
| 563 | 3 | TRUE | 41 | Bachelors | I | 0.9701663 |
| 573 | 3 | TRUE | 49 | Graduate | I | 0.9784062 |
| 157 | 2 | FALSE | 25 | Graduate | J | 0.2540057 |
Taking a quick look at the table (with kable())
after filtering for the respective top degree centrality scoring
participants, these characteristics were observed.
Betweenness centrality highlights which nodes are important in the flow of the network by measuring the importance of indivdual nodes by how critical are they in making the shortest path between 2 other nodes.
Example, if Participant A is in the shortest path between Participant B and C, it means that Participant A is critical to facilitate the flow (of information) between Participant B and C. Hence, if Participant A was not part of the path, that would mean the flow between Participant B and C would need to go through a longer network path.
In the code chunk below, the network graph is plotted with codes that will only highlight and label nodes with betweenness centrality above 800 (intent was to only focus on high betweenness scoring nodes, instead of showing everything which can be quite confusing with all the colours).
g2 <- ggraph(network_graph_all,
layout = "linear",
circular = TRUE) +
geom_edge_arc(aes(width = Weight),
alpha = 0.1) +
scale_edge_width(range = c(0.1, 2)) +
geom_node_point(aes(colour = ifelse(centrality_betweenness() > 800, `Education Level`, 'grey60'),
size = centrality_betweenness())) +
labs(title = "Network Graph by Education Level wrt Betweenness Centrality",
subtitle ='Layout: Linear and Circular',
size = 'Betweenness Centrality') +
theme_graph(base_family = 'Helvetica') +
scale_color_manual(name ='Education Level',
values = c('Graduate'='green','Bachelors'='red','HighSchoolOrCollege'='cyan','Low'='pink'))
g2 + geom_node_label(aes(label=ifelse(centrality_betweenness() > 800, name, NA)),
nudge_x = g1$data$x * .15,
nudge_y = g1$data$y * .15,
repel = TRUE,
point.padding = NA,
box.padding = 0,
force = 0.1)
It is interesting to note that the participants with higher betweenness centrality also tends to be those with higher education level (Graduate and Bachelors). However, in this case, participants of High School/College education level seems to tip over the ratio slightly compared to the other Education levels.
top_betweenness <- city_nodes %>%
filter(participantId %in% c(81,390,424,436,488,503,759,851,898,906,978,991)) %>%
arrange(by_group=educationLevel)
kable(top_betweenness)
| participantId | householdSize | haveKids | age | educationLevel | interestGroup | joviality |
|---|---|---|---|---|---|---|
| 390 | 2 | FALSE | 31 | Bachelors | J | 0.7653071 |
| 906 | 2 | FALSE | 22 | Bachelors | E | 0.9306696 |
| 81 | 2 | FALSE | 27 | Graduate | D | 0.8713203 |
| 759 | 3 | TRUE | 54 | Graduate | I | 0.8804688 |
| 851 | 2 | FALSE | 19 | Graduate | D | 0.6777122 |
| 898 | 2 | FALSE | 60 | Graduate | J | 0.4869203 |
| 424 | 1 | FALSE | 49 | HighSchoolOrCollege | D | 0.7235765 |
| 436 | 1 | FALSE | 36 | HighSchoolOrCollege | J | 0.9494120 |
| 488 | 1 | FALSE | 51 | HighSchoolOrCollege | F | 0.7701043 |
| 503 | 1 | FALSE | 24 | HighSchoolOrCollege | A | 0.8924825 |
| 978 | 1 | FALSE | 44 | HighSchoolOrCollege | D | 0.3983793 |
| 991 | 1 | FALSE | 21 | Low | C | 0.9704926 |
Taking a quick look at the table (with kable())
after filtering for the respective top betweenness centrality scoring
participants, these characteristics were observed.
Now that we have covered degree centrality and betweenness centrality analysis, next we will be looking at another centrality measurement, eigenvector centrality.
Eigenvector centrality measures the node’s importance by taking into consideration of its neighbours’ importance too.
For example, a participant with 20 unpopular connections would have lower an eigenvector centrality than another participant who has 20 popular connections, hence implying the influential capability of the node within the network.
In the code chunk below, the network graph is plotted with codes that will only highlight and label nodes with eigenvector centrality above 0.8 (intent was to only focus on high eigenvector scoring nodes, instead of showing everything which can be quite confusing with all the colours).
g3 <- ggraph(network_graph_all,
layout = "linear",
circular = TRUE) +
geom_edge_arc(aes(width = Weight),
alpha = 0.1) +
scale_edge_width(range = c(0.1, 2)) +
geom_node_point(aes(colour = ifelse(centrality_eigen() > 0.6, `Education Level`, 'grey60'),
size = centrality_eigen())) +
labs(title = "Network Graph by Education Level wrt Eigenvector Centrality",
subtitle ='Layout: Linear and Circular',
size = 'Eigenvector Centrality') +
theme_graph(base_family = 'Helvetica') +
scale_color_manual(name ='Education Level',
values = c('Graduate'='green','Bachelors'='red','HighSchoolOrCollege'='cyan','Low'='pink'))
g3 + geom_node_label(aes(label=ifelse(centrality_eigen() > 0.6, name, NA)),
nudge_x = g1$data$x * .15,
nudge_y = g1$data$y * .15,
repel = TRUE,
point.padding = NA,
box.padding = 0,
force = 0.1)
It is interesting to note that the participants with higher eigenvector centrality also tends to be those with higher education level (Graduate and Bachelors).
top_eigenvector <- city_nodes %>%
filter(participantId %in% c(157,216,277,312,529,540,563,573,620,624,648)) %>%
arrange(by_group=householdSize)
kable(top_eigenvector)
| participantId | householdSize | haveKids | age | educationLevel | interestGroup | joviality |
|---|---|---|---|---|---|---|
| 216 | 1 | FALSE | 38 | HighSchoolOrCollege | C | 0.1539006 |
| 157 | 2 | FALSE | 25 | Graduate | J | 0.2540057 |
| 620 | 2 | FALSE | 31 | Bachelors | A | 0.6735412 |
| 624 | 2 | FALSE | 31 | Bachelors | C | 0.9161637 |
| 648 | 2 | FALSE | 49 | Bachelors | B | 0.6349943 |
| 277 | 3 | TRUE | 34 | HighSchoolOrCollege | E | 0.8367877 |
| 312 | 3 | TRUE | 42 | HighSchoolOrCollege | A | 0.2607990 |
| 529 | 3 | TRUE | 27 | Graduate | D | 0.5684054 |
| 540 | 3 | TRUE | 44 | Bachelors | C | 0.8208976 |
| 563 | 3 | TRUE | 41 | Bachelors | I | 0.9701663 |
| 573 | 3 | TRUE | 49 | Graduate | I | 0.9784062 |
Taking a quick look at the table (with kable())
after filtering for the respective top betweenness centrality scoring
participants, these characteristics were observed.
Here, we will plot the same network graph, but with a different layout setting (‘nicely’), so that the visualization of the social circles can be enhanced.
set.seed(1234)
g2a <- ggraph(network_graph_all,
layout = "nicely") +
geom_edge_arc(aes(width = Weight),
alpha = 0.1,
color = 'grey') +
scale_edge_width(range = c(0.1, 2)) +
geom_node_point(aes(colour = ifelse(centrality_betweenness() > 300, `Education Level`, 'grey60'),
size = centrality_betweenness())) +
labs(title = "Network Graph by Education Level wrt Betweenness Centrality",
subtitle ='Layout: nicely',
size = 'Betweenness Centrality') +
theme_graph(base_family = 'Helvetica') +
scale_color_manual(name ='Education Level',
values = c('Graduate'='green','Bachelors'='red','HighSchoolOrCollege'='cyan','Low'='pink'))
g2a + geom_node_label(aes(label=ifelse(centrality_betweenness() > 300, name, NA)),
nudge_x = g1$data$x * .15,
nudge_y = g1$data$y * .15,
repel = TRUE,
point.padding = NA,
box.padding = 0,
force = 0.1)
Here, we can actually see these how these important nodes played their part in connecting smaller social circles to one another, to create a larger social network ecosystem.
Here, we will plot the same network graph, but with a different layout setting (‘nicely’), so that the visualization of the social circles can be enhanced.
set.seed(1234)
g3a <- ggraph(network_graph_all,
layout = "nicely") +
geom_edge_arc(aes(width = Weight),
alpha = 0.1,
color = 'grey') +
scale_edge_width(range = c(0.1, 2)) +
geom_node_point(aes(colour = ifelse(centrality_eigen() > 0.6, `Education Level`, 'grey60'),
size = centrality_eigen())) +
labs(title = "Network Graph by Education Level wrt Eigenvector Centrality",
subtitle ='Layout: nicely',
size = 'Eigenvector Centrality') +
theme_graph(base_family = 'Helvetica') +
scale_color_manual(name ='Education Level',
values = c('Graduate'='green','Bachelors'='red','HighSchoolOrCollege'='cyan','Low'='pink'))
g3a + geom_node_label(aes(label=ifelse(centrality_eigen() > 0.6, name, NA)),
nudge_x = g1$data$x * .15,
nudge_y = g1$data$y * .15,
repel = TRUE,
point.padding = NA,
box.padding = 0,
force = 0.1)
It seems like all these high eigenvector scoring participants are from the same social circle. With that, I am guessing that within these group of participants are highly connected to one another within this social group, therefore, boosting their eigenvector scores tremendously.
The following steps were done to transform the data,
participantIdFrom and
participantIdTo for subsequent usage of them in another
code chunkcity_edges_agg_all_v2 <- city_edges %>%
mutate(Yr_Mth = format(as.Date(timestamp), "%Y-%m")) %>%
mutate(Day = wday(timestamp,
label = TRUE,
abbr = FALSE)) %>%
rename(from = participantIdFrom,
to = participantIdTo) %>%
group_by(from, to, Day) %>%
summarise(Weight = n(),
.groups = "keep") %>%
filter(from != to)%>%
filter(Weight > 49) %>%
ungroup()
The following code chunk will be extracting the unique participants
involved in the analysis after our previous step of edge weight
filtering (above 49), via usage of these functions, stack()
and unique().
nodes_involved_all_v2 <- city_edges_agg_all_v2 %>%
select(from, to) %>%
stack() %>%
mutate(participantId = (values)) %>%
select(participantId) %>%
arrange(participantId) %>%
unique()
The following code chunk will then merge (inner join) the
participants dataset (city_nodes after importing) with the
list of unique participants involved in this analysis (after filtering
them based on edge weight). The output will provide a list of unique
participants and their respective details.
participants_nodes_all_v2 <- merge(x=city_nodes,y=nodes_involved_all_v2,by="participantId") %>%
mutate(participantId = as.character(participantId)) %>%
rename(id = participantId,
group = educationLevel,
`Interest Group` = interestGroup,
`Have Kids` = haveKids,
`Household Size` = householdSize,
`Joviality` = joviality,
`Age` = age)
As mentioned earlier, visNetwork
will be used to chart out an interactive network graph.
With this new found interactivity, we can now select the participant (via ID) to locate their position in their respective social network. Also, we are able to filter for nodes by their group (in this case, set as education level).
visNetwork(participants_nodes_all_v2,
city_edges_agg_all_v2) %>%
visIgraphLayout(layout = "layout_nicely") %>%
visEdges(arrows = "to",
smooth = list(enabled = TRUE,
type = "curvedCW")) %>%
visOptions(highlightNearest = TRUE,
nodesIdSelection = list(enabled = TRUE),
selectedBy = list(variable = "group")) %>%
visLegend(main = "Education Level Legend",
position = "right") %>%
visEdges(arrows = "to",
color = list(highlight="black")) %>%
visLayout(randomSeed = 123)
For curiosity, I’ve attempted to plot another interactive network graph, but with Interest Group as the group instead.
participants_nodes_all_v3 <- merge(x=city_nodes,y=nodes_involved_all_v2,by="participantId") %>%
mutate(participantId = as.character(participantId)) %>%
rename(id = participantId,
group = interestGroup,
`Education Level` = educationLevel,
`Have Kids` = haveKids,
`Household Size` = householdSize,
`Joviality` = joviality,
`Age` = age)
With this new found interactivity, we can now select the participant (via ID) to locate their position in their respective social network. Also, we are able to filter for nodes by their group (in this case, set as interest group).
visNetwork(participants_nodes_all_v3,
city_edges_agg_all_v2) %>%
visIgraphLayout(layout = "layout_nicely") %>%
visEdges(arrows = "to",
smooth = list(enabled = TRUE,
type = "curvedCW")) %>%
visOptions(highlightNearest = TRUE,
nodesIdSelection = list(enabled = TRUE),
selectedBy = list(variable = "group")) %>%
visLegend(main = "Interest Group Legend",
position = "right") %>%
visEdges(arrows = "to",
color = list(highlight="black")) %>%
visLayout(randomSeed = 123)
One interesting observation was that in those larger network circles, participants from Interest Group D and J seems to be critical in connecting the smaller social groups together, forming a larger social circle.
In summary, from the graphs plotted, my observations are,
More can be done to analyze additional details of the above observations so that the analysis result can be more holistic.