Take-home Exercise 6

Putting Visual Analytics into Practical Use

Published

June 9, 2022

DOI

Overview

In this take-home exercise, I will be attempting to explore for patterns in the social networks within the city of Engagement, Ohio USA.


Where is the library?

Before I get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, I will install the R packages and load them onto R environment.

packages = c('igraph', 'tidygraph',
             'ggraph', 'visNetwork',
             'lubridate', 'clock',
             'tidyverse', 'ggmap', 'knitr')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
    }
  library(p, character.only = T)
  }


Let’s bring in the DATA

The code chunk below imports SocialNetwork.csv and Participants.csv from the data folder, into R by using read_csv() of readr and save them as tibble dataframes.


Munging on the Raw Data

Transforming the Edges data

The following steps were done to transform the data,

city_edges_agg_all <- city_edges %>%
  mutate(Yr_Mth = format(as.Date(timestamp), "%Y-%m")) %>% 
  mutate(Day = wday(timestamp,
                    label = TRUE,
                    abbr = FALSE)) %>% 
  group_by(participantIdFrom, participantIdTo, Day) %>% 
  summarise(Weight = n(), 
            .groups = "keep") %>%
  filter(participantIdFrom != participantIdTo)%>%
  filter(Weight > 0) %>%
  ungroup()

Notice the filter(Weight > 0) in the last part of the above code chunk?

This essentially does not really filter out anything.

However, we will make use of this to gauge where to cut off the data for analysis, as using the whole dataset prove to be “too much info” for any fruitful analysis. In fact, in my trial-and-error for the charts, using all the data will result in a very messy (cluttered) social network graph.

With the summary() function on the Weight data column, we are able to view the interquantile stats.

summary(city_edges_agg_all$Weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   4.000   6.687   8.000  65.000 

As we are keen to find out more on the participants whom have high edge weights, we will filter for edges with weights of at least 8 and above (3rd quantile onwards).

city_edges_agg_all <- city_edges %>%
  mutate(Yr_Mth = format(as.Date(timestamp), "%Y-%m")) %>% 
  mutate(Day = wday(timestamp,
                    label = TRUE,
                    abbr = FALSE)) %>% 
  group_by(participantIdFrom, participantIdTo, Day) %>% 
  summarise(Weight = n(), .groups = "keep") %>%
  filter(participantIdFrom != participantIdTo)%>%
  filter(Weight > 49) %>%
  ungroup()

Trial runs were done with varying weight filters (starting from 65 in descending order) and we arrived at 50 as a good balance of data to showcase participants with relatively high connectivity to one another.


Exporting the large SocialNetwork.csv into RDS format

As the SocialNetwork.csv was pretty big in size at around 200 MB, it is not possible for us to upload this to Github.

Hence, we will be exporting the cleaned and filtered dataset (done in previous step) instead and used that as our reference import file for the subsequent analysis.

The code below will write (export) the file into a RDS format (native to R) to keep the file’s size small and manageable.

write_rds(city_edges_agg_all, "data/rds/city_edges_agg_all.rds")

The exported RDS files will be imported again (as the original SocialNetwork.csv will not be imported and stored in this R project/GitHub)

city_edges_agg_all <- read_rds("data/rds/city_edges_agg_all.rds")


Further transformation of the data

The following code chunk will be extracting the unique participants involved in the analysis after our previous step of edge weight filtering (above 49), via usage of these functions, stack() and unique().

nodes_involved_all <- city_edges_agg_all %>% 
  select(participantIdFrom, participantIdTo) %>% 
  stack() %>%
  mutate(participantId = (values)) %>% 
  select(participantId) %>% 
  arrange(participantId) %>%
  unique()

The following code chunk will then merge (inner join) the participants dataset (city_nodes after importing) with the list of unique participants involved in this analysis (after filtering them based on edge weight). The output will provide a list of unique participants and their respective details.

participants_nodes_all <- merge(x=city_nodes,y=nodes_involved_all,by="participantId") %>% 
  mutate(participantId = as.character(participantId)) %>% 
  rename(`Education Level` = educationLevel,
         `Interest Group` = interestGroup,
         `Have Kids` = haveKids,
         `Household Size` = householdSize,
         `Joviality` = joviality,
         `Age` = age)

The next code chunk will create a tbl_graph object with the edges and nodes dataset that we have cleaned.

network_graph_all <- graph_from_data_frame(city_edges_agg_all,
                                           vertices = participants_nodes_all) %>% 
  as_tbl_graph()

network_graph_all
# A tbl_graph: 523 nodes and 10740 edges
#
# A directed multigraph with 94 components
#
# Node Data: 523 x 7 (active)
  name  `Household Size` `Have Kids`   Age `Education Lev~`
  <chr>            <dbl> <lgl>       <dbl> <chr>           
1 4                    3 TRUE           43 Bachelors       
2 5                    3 TRUE           32 HighSchoolOrCol~
3 7                    3 TRUE           27 Bachelors       
4 8                    3 TRUE           20 Bachelors       
5 10                   3 TRUE           48 HighSchoolOrCol~
6 12                   3 TRUE           34 HighSchoolOrCol~
# ... with 517 more rows, and 2 more variables: `Interest
#   Group` <chr>, Joviality <dbl>
#
# Edge Data: 10,740 x 4
   from    to Day     Weight
  <int> <int> <chr>    <int>
1     1    49 Sunday      51
2     1    49 Monday      51
3     1    49 Tuesday     51
# ... with 10,737 more rows

Next, we will activate() the edges so that we can begin to use the weight for our analysis.

network_graph_all %>%
  activate(edges) %>%
  arrange(desc(Weight))
# A tbl_graph: 523 nodes and 10740 edges
#
# A directed multigraph with 94 components
#
# Edge Data: 10,740 x 4 (active)
   from    to Day       Weight
  <int> <int> <chr>      <int>
1     4    22 Wednesday     65
2     4    22 Thursday      65
3     6   316 Thursday      65
4     9   420 Thursday      65
5    10    66 Thursday      65
6    11   108 Thursday      65
# ... with 10,734 more rows
#
# Node Data: 523 x 7
  name  `Household Size` `Have Kids`   Age `Education Lev~`
  <chr>            <dbl> <lgl>       <dbl> <chr>           
1 4                    3 TRUE           43 Bachelors       
2 5                    3 TRUE           32 HighSchoolOrCol~
3 7                    3 TRUE           27 Bachelors       
# ... with 520 more rows, and 2 more variables: `Interest
#   Group` <chr>, Joviality <dbl>


Plotting the network graphs

Here, we will start to plot all of our network graphs, in static and interactive formats.

ggraph package will be used for the static graphs, while visNetwork will be used for the interactive graphs.


Plotting static graphs - Network Graph with Degree Centrality focus

Degree centrality is the simplest centrality measure to compute as it is essentially just counting the number of edges that each node has. If a participant has 10 social connections, then it will have a degree centrality of 10.

While it is a good measure of how connected the participant is, it does not necessarily reflect the degree of importance of the participant in connecting other nodes.

Nevertheless, it is a simple way to pick out nodes that MIGHT be important to the social network as usually nodes with high degree of edges tends to score well in other centrality tests. (no guarantees though!)

In the code chunk below, the network graph is plotted with codes that will only highlight and label nodes with degree centrality above 60 (intent was to only focus on high degree nodes, instead of showing everything which can be quite confusing with all the colours)

g1 <- ggraph(network_graph_all,
       layout = "linear", 
       circular = TRUE) +
  geom_edge_arc(aes(width = Weight),
                 alpha = 0.1) +
  scale_edge_width(range = c(0.1, 2)) +
  geom_node_point(aes(colour = ifelse(centrality_degree() > 60, `Education Level`, 'grey60'), 
                      size = centrality_degree())) +
  labs(title = "Network Graph by Education Level wrt Degree Centrality",
       subtitle ='Layout: Linear and Circular',
       size = 'Degree Centrality') +
  theme_graph(base_family = 'Helvetica') +
  scale_color_manual(name ='Education Level',
                     values = c('Graduate'='green','Bachelors'='red','HighSchoolOrCollege'='cyan','Low'='pink'))
  
g1 + geom_node_label(aes(label=ifelse(centrality_degree() > 60, name, NA)),
                     nudge_x = g1$data$x * .15, 
                     nudge_y = g1$data$y * .15,
                     repel = TRUE, 
                     point.padding = NA, 
                     box.padding = 0, 
                     force = 0.1)

It is interesting to note that the participants with high degree centrality tends to be those with higher education level (Graduate, Bachelors and High School/College).

My guess would be that people with higher education level has more socialization opportunities to build connections compared to people who did not receive higher education (‘Low’ education level)

top_degree <- city_nodes %>% 
  filter(participantId %in% c(157,195,216,488,503,529,540,563,573,623,624,648,698)) %>%  
  arrange(by_group=interestGroup)

kable(top_degree)
participantId householdSize haveKids age educationLevel interestGroup joviality
503 1 FALSE 24 HighSchoolOrCollege A 0.8924825
195 1 FALSE 60 Graduate B 0.0587027
623 2 FALSE 56 HighSchoolOrCollege B 0.5761295
648 2 FALSE 49 Bachelors B 0.6349943
216 1 FALSE 38 HighSchoolOrCollege C 0.1539006
540 3 TRUE 44 Bachelors C 0.8208976
624 2 FALSE 31 Bachelors C 0.9161637
698 1 FALSE 24 Bachelors C 0.6748777
529 3 TRUE 27 Graduate D 0.5684054
488 1 FALSE 51 HighSchoolOrCollege F 0.7701043
563 3 TRUE 41 Bachelors I 0.9701663
573 3 TRUE 49 Graduate I 0.9784062
157 2 FALSE 25 Graduate J 0.2540057

Taking a quick look at the table (with kable()) after filtering for the respective top degree centrality scoring participants, these characteristics were observed.


Plotting static graphs - Network Graph with Betweenness Centrality focus

Betweenness centrality highlights which nodes are important in the flow of the network by measuring the importance of indivdual nodes by how critical are they in making the shortest path between 2 other nodes.

Example, if Participant A is in the shortest path between Participant B and C, it means that Participant A is critical to facilitate the flow (of information) between Participant B and C. Hence, if Participant A was not part of the path, that would mean the flow between Participant B and C would need to go through a longer network path.

In the code chunk below, the network graph is plotted with codes that will only highlight and label nodes with betweenness centrality above 800 (intent was to only focus on high betweenness scoring nodes, instead of showing everything which can be quite confusing with all the colours).

g2 <- ggraph(network_graph_all,
       layout = "linear", 
       circular = TRUE) +
  geom_edge_arc(aes(width = Weight),
                 alpha = 0.1) +
  scale_edge_width(range = c(0.1, 2)) +
  geom_node_point(aes(colour = ifelse(centrality_betweenness() > 800, `Education Level`, 'grey60'), 
                      size = centrality_betweenness())) +
  labs(title = "Network Graph by Education Level wrt Betweenness Centrality",
       subtitle ='Layout: Linear and Circular',
       size = 'Betweenness Centrality') +
  theme_graph(base_family = 'Helvetica') +
  scale_color_manual(name ='Education Level',
                     values = c('Graduate'='green','Bachelors'='red','HighSchoolOrCollege'='cyan','Low'='pink'))

g2 + geom_node_label(aes(label=ifelse(centrality_betweenness() > 800, name, NA)),
                     nudge_x = g1$data$x * .15, 
                     nudge_y = g1$data$y * .15,
                     repel = TRUE, 
                     point.padding = NA, 
                     box.padding = 0, 
                     force = 0.1)

It is interesting to note that the participants with higher betweenness centrality also tends to be those with higher education level (Graduate and Bachelors). However, in this case, participants of High School/College education level seems to tip over the ratio slightly compared to the other Education levels.

top_betweenness <- city_nodes %>% 
  filter(participantId %in% c(81,390,424,436,488,503,759,851,898,906,978,991)) %>%  
  arrange(by_group=educationLevel)

kable(top_betweenness)
participantId householdSize haveKids age educationLevel interestGroup joviality
390 2 FALSE 31 Bachelors J 0.7653071
906 2 FALSE 22 Bachelors E 0.9306696
81 2 FALSE 27 Graduate D 0.8713203
759 3 TRUE 54 Graduate I 0.8804688
851 2 FALSE 19 Graduate D 0.6777122
898 2 FALSE 60 Graduate J 0.4869203
424 1 FALSE 49 HighSchoolOrCollege D 0.7235765
436 1 FALSE 36 HighSchoolOrCollege J 0.9494120
488 1 FALSE 51 HighSchoolOrCollege F 0.7701043
503 1 FALSE 24 HighSchoolOrCollege A 0.8924825
978 1 FALSE 44 HighSchoolOrCollege D 0.3983793
991 1 FALSE 21 Low C 0.9704926

Taking a quick look at the table (with kable()) after filtering for the respective top betweenness centrality scoring participants, these characteristics were observed.


Plotting static graphs - Network Graph with Eigenvector Centrality focus

Now that we have covered degree centrality and betweenness centrality analysis, next we will be looking at another centrality measurement, eigenvector centrality.

Eigenvector centrality measures the node’s importance by taking into consideration of its neighbours’ importance too.

For example, a participant with 20 unpopular connections would have lower an eigenvector centrality than another participant who has 20 popular connections, hence implying the influential capability of the node within the network.

In the code chunk below, the network graph is plotted with codes that will only highlight and label nodes with eigenvector centrality above 0.8 (intent was to only focus on high eigenvector scoring nodes, instead of showing everything which can be quite confusing with all the colours).

g3 <- ggraph(network_graph_all,
       layout = "linear", 
       circular = TRUE) +
  geom_edge_arc(aes(width = Weight),
                 alpha = 0.1) +
  scale_edge_width(range = c(0.1, 2)) +
  geom_node_point(aes(colour = ifelse(centrality_eigen() > 0.6, `Education Level`, 'grey60'), 
                      size = centrality_eigen())) +
  labs(title = "Network Graph by Education Level wrt Eigenvector Centrality",
       subtitle ='Layout: Linear and Circular',
       size = 'Eigenvector Centrality') +
  theme_graph(base_family = 'Helvetica') +
  scale_color_manual(name ='Education Level',
                     values = c('Graduate'='green','Bachelors'='red','HighSchoolOrCollege'='cyan','Low'='pink'))

g3 + geom_node_label(aes(label=ifelse(centrality_eigen() > 0.6, name, NA)),
                     nudge_x = g1$data$x * .15, 
                     nudge_y = g1$data$y * .15,
                     repel = TRUE, 
                     point.padding = NA, 
                     box.padding = 0, 
                     force = 0.1)

It is interesting to note that the participants with higher eigenvector centrality also tends to be those with higher education level (Graduate and Bachelors).

top_eigenvector <- city_nodes %>% 
  filter(participantId %in% c(157,216,277,312,529,540,563,573,620,624,648)) %>%  
  arrange(by_group=householdSize)

kable(top_eigenvector)
participantId householdSize haveKids age educationLevel interestGroup joviality
216 1 FALSE 38 HighSchoolOrCollege C 0.1539006
157 2 FALSE 25 Graduate J 0.2540057
620 2 FALSE 31 Bachelors A 0.6735412
624 2 FALSE 31 Bachelors C 0.9161637
648 2 FALSE 49 Bachelors B 0.6349943
277 3 TRUE 34 HighSchoolOrCollege E 0.8367877
312 3 TRUE 42 HighSchoolOrCollege A 0.2607990
529 3 TRUE 27 Graduate D 0.5684054
540 3 TRUE 44 Bachelors C 0.8208976
563 3 TRUE 41 Bachelors I 0.9701663
573 3 TRUE 49 Graduate I 0.9784062

Taking a quick look at the table (with kable()) after filtering for the respective top betweenness centrality scoring participants, these characteristics were observed.



Plotting static graphs with different layout - Network Graph with Degree Centrality focus

Here, we will plot the same network graph, but with a different layout setting (‘nicely’), so that the visualization of the social circles can be enhanced.

set.seed(1234)

g2a <- ggraph(network_graph_all,
       layout = "nicely") +
  geom_edge_arc(aes(width = Weight),
                alpha = 0.1,
                color = 'grey') +
  scale_edge_width(range = c(0.1, 2)) +
  geom_node_point(aes(colour = ifelse(centrality_betweenness() > 300, `Education Level`, 'grey60'), 
                      size = centrality_betweenness())) +
  labs(title = "Network Graph by Education Level wrt Betweenness Centrality",
       subtitle ='Layout: nicely',
       size = 'Betweenness Centrality') +
  theme_graph(base_family = 'Helvetica') +
  scale_color_manual(name ='Education Level',
                     values = c('Graduate'='green','Bachelors'='red','HighSchoolOrCollege'='cyan','Low'='pink'))

g2a + geom_node_label(aes(label=ifelse(centrality_betweenness() > 300, name, NA)),
                     nudge_x = g1$data$x * .15, 
                     nudge_y = g1$data$y * .15,
                     repel = TRUE, 
                     point.padding = NA, 
                     box.padding = 0, 
                     force = 0.1)

Here, we can actually see these how these important nodes played their part in connecting smaller social circles to one another, to create a larger social network ecosystem.


Plotting static graphs with different layout - Network Graph with Eigenvector Centrality focus

Here, we will plot the same network graph, but with a different layout setting (‘nicely’), so that the visualization of the social circles can be enhanced.

set.seed(1234)

g3a <- ggraph(network_graph_all,
       layout = "nicely") +
  geom_edge_arc(aes(width = Weight),
                alpha = 0.1,
                color = 'grey') +
  scale_edge_width(range = c(0.1, 2)) +
  geom_node_point(aes(colour = ifelse(centrality_eigen() > 0.6, `Education Level`, 'grey60'), 
                      size = centrality_eigen())) +
  labs(title = "Network Graph by Education Level wrt Eigenvector Centrality",
       subtitle ='Layout: nicely',
       size = 'Eigenvector Centrality') +
  theme_graph(base_family = 'Helvetica') +
  scale_color_manual(name ='Education Level',
                     values = c('Graduate'='green','Bachelors'='red','HighSchoolOrCollege'='cyan','Low'='pink'))

g3a + geom_node_label(aes(label=ifelse(centrality_eigen() > 0.6, name, NA)),
                     nudge_x = g1$data$x * .15, 
                     nudge_y = g1$data$y * .15,
                     repel = TRUE, 
                     point.padding = NA, 
                     box.padding = 0, 
                     force = 0.1)

It seems like all these high eigenvector scoring participants are from the same social circle. With that, I am guessing that within these group of participants are highly connected to one another within this social group, therefore, boosting their eigenvector scores tremendously.


Preparation to plot interactive graphs

The following steps were done to transform the data,

city_edges_agg_all_v2 <- city_edges %>%
  mutate(Yr_Mth = format(as.Date(timestamp), "%Y-%m")) %>%
  mutate(Day = wday(timestamp,
                    label = TRUE,
                    abbr = FALSE)) %>% 
  rename(from = participantIdFrom, 
         to = participantIdTo) %>% 
  group_by(from, to, Day) %>% 
  summarise(Weight = n(), 
            .groups = "keep") %>%
  filter(from != to)%>%
  filter(Weight > 49) %>%
  ungroup()


Further transformation of the data

The following code chunk will be extracting the unique participants involved in the analysis after our previous step of edge weight filtering (above 49), via usage of these functions, stack() and unique().

nodes_involved_all_v2 <- city_edges_agg_all_v2 %>% 
  select(from, to) %>% 
  stack() %>%
  mutate(participantId = (values)) %>% 
  select(participantId) %>% 
  arrange(participantId) %>%
  unique()

The following code chunk will then merge (inner join) the participants dataset (city_nodes after importing) with the list of unique participants involved in this analysis (after filtering them based on edge weight). The output will provide a list of unique participants and their respective details.

participants_nodes_all_v2 <- merge(x=city_nodes,y=nodes_involved_all_v2,by="participantId") %>% 
  mutate(participantId = as.character(participantId)) %>%
  rename(id = participantId,
         group = educationLevel,
         `Interest Group` = interestGroup,
         `Have Kids` = haveKids,
         `Household Size` = householdSize,
         `Joviality` = joviality,
         `Age` = age)


Plotting interactive graphs - Network Graph with Betweenness Centrality focus

As mentioned earlier, visNetwork will be used to chart out an interactive network graph.

With this new found interactivity, we can now select the participant (via ID) to locate their position in their respective social network. Also, we are able to filter for nodes by their group (in this case, set as education level).

visNetwork(participants_nodes_all_v2,
           city_edges_agg_all_v2) %>%
  visIgraphLayout(layout = "layout_nicely") %>%
  visEdges(arrows = "to",
           smooth = list(enabled = TRUE,
                         type = "curvedCW")) %>% 
  visOptions(highlightNearest = TRUE,
             nodesIdSelection = list(enabled = TRUE),
             selectedBy = list(variable = "group")) %>%
  visLegend(main = "Education Level Legend",
            position = "right") %>%
  visEdges(arrows = "to", 
           color = list(highlight="black")) %>% 
  visLayout(randomSeed = 123)

Education Level Legend


Plotting interactive graphs - Network Graph with Betweenness Centrality focus

For curiosity, I’ve attempted to plot another interactive network graph, but with Interest Group as the group instead.

participants_nodes_all_v3 <- merge(x=city_nodes,y=nodes_involved_all_v2,by="participantId") %>% 
  mutate(participantId = as.character(participantId)) %>%
  rename(id = participantId,
         group = interestGroup,
         `Education Level` = educationLevel,
         `Have Kids` = haveKids,
         `Household Size` = householdSize,
         `Joviality` = joviality,
         `Age` = age)

With this new found interactivity, we can now select the participant (via ID) to locate their position in their respective social network. Also, we are able to filter for nodes by their group (in this case, set as interest group).

visNetwork(participants_nodes_all_v3,
           city_edges_agg_all_v2) %>%
  visIgraphLayout(layout = "layout_nicely") %>%
  visEdges(arrows = "to",
           smooth = list(enabled = TRUE,
                         type = "curvedCW")) %>% 
  visOptions(highlightNearest = TRUE,
             nodesIdSelection = list(enabled = TRUE),
             selectedBy = list(variable = "group")) %>%
  visLegend(main = "Interest Group Legend",
            position = "right") %>%
  visEdges(arrows = "to",
           color = list(highlight="black")) %>% 
  visLayout(randomSeed = 123)

Interest Group Legend

One interesting observation was that in those larger network circles, participants from Interest Group D and J seems to be critical in connecting the smaller social groups together, forming a larger social circle.


Conclusion

In summary, from the graphs plotted, my observations are,

More can be done to analyze additional details of the above observations so that the analysis result can be more holistic.