Creating data visualisation beyond default
In this take-home exercise, I will be reviewing and providing my critics (in terms of clarity and aesthetics) on one of my classmate’s Take-home Exercise. On top of that, I will attempt to improve on the original visualization, by using the data visualization principles and best practices learnt in Lesson 1 and 2.
Before I get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, I will install the R packages and load them onto R environment.
The chunk code below will do the trick.
packages = c('tidyverse')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The code chunk below imports Participants.csv from the data
folder, into R by using read_csv()
of readr and
save it as an tibble dataframe called participants_data.
participants_data <- read_csv("data/Participants.csv")
In this plot below, the author wanted to see the relationship between participants (grouped by their education level) and their choice of having kids.
pk1 <- ggplot(data = participants_data,
aes(x = educationLevel, fill = haveKids))+
geom_bar() +
ggtitle("Have kids according to Education Level")
pk1
Here, the bar plots are displayed with respect to their count. It is clear to see that there is a larger proportion of High School/College participants compared to the other education qualification groups.
While it does show that there were more people not having kids, it does not tell us whether the proportion of having kids or not, is similar across various education level.
Aesthetically, it can be improved by,
With that, below are some of the adjustments that I’ve made.
Changes:
scale_y_continuousscales::percent in scale_y_continuousscale_x_discretemutate
functiongeom_text
to add the counts inlabs
to alter the title, axis and legend labelsthemeparticipants_data %>%
mutate(`Education Level` = fct_relevel(educationLevel,"Low","HighSchoolOrCollege","Graduate","Bachelors")) %>%
ggplot(aes(x = `Education Level`,
fill = haveKids)) +
geom_bar(position = 'fill') +
geom_text(stat = 'count',
aes(label = stat(count)),
position = position_fill(vjust=0.9)) +
scale_y_continuous(breaks = seq(0,1, by = 0.1),
labels = scales::percent) +
scale_x_discrete(labels = c("Low", "High School Or College", "Bachelors", "Graduate")) +
labs(y = 'Percentage\nof\nParticipants',
title = "Percentage Distribution of Participants' Education Level",
subtitle ='With respect to having kids or not',
fill ='Have Kids?') +
theme(axis.title.y = element_text(angle = 0),
axis.ticks.x = element_blank(),
panel.background = element_blank(),
axis.line = element_line(color = 'grey'))
With this modified plot, we are able to see that the education level seems to play a part in whether a participant is likely to have kids or not. The higher the education level, the less likely that the participant would have kids.
In this plot below, the same author wanted to see the relationship between participants (grouped by their interest group) and their respective interest groups.
ggplot(data = participants_data,
aes(x = interestGroup, fill = haveKids))+
geom_bar() +
ggtitle("Have kids according to Interest Group")
Similarly, the bar plots are displayed with respect to their count. On a high level, it does show that there were more people not having kids, but it does not tell us whether the proportion of having kids or not, is similar across various interest groups.
Aesthetically, similar improvements can be made too by,
With that, below are some of the adjustments that I’ve made.
Changes:
scale_y_continuoussec_axis
with scale_y_continuousscales::percent in scale_y_continuousmutate
functiongeom_text
to add the countslabs
to alter the title, axis and legend labelsthemeparticipants_data %>%
mutate(`Interest Group` = fct_relevel(interestGroup,"D","F","B","C","I","E","G","H","J","A")) %>%
ggplot(aes(x = `Interest Group`,
fill = haveKids)) +
geom_bar(position = 'fill') +
geom_text(stat = 'count',
aes(label = stat(count)),
position = position_fill(vjust = 0.9)) +
scale_y_continuous(breaks = seq(0,1, by = 0.1),
labels = scales::percent,
sec.axis = sec_axis(trans = ~.,
labels = scales::percent,
breaks = seq(0,1, by = 0.1))) +
labs(y = 'Percentage\nof\nParticipants',
title = "Percentage Distribution of Participants' Interest Group",
subtitle = 'With respect to having kids or not',
fill = 'Have Kids?') +
theme(axis.title.y = element_text(angle = 0),
axis.ticks.x = element_blank(),
panel.background = element_blank(),
axis.line = element_line(color= 'grey'))
With this modified plot, we are able to see that the interest groups does not seem to have a strong influence to whether a participant is likely to have kids or not. However, it is interesting to note that ~40% of the participants in Interest Group D have kids, nearly twice as much in proportion to participants in Interest Group A.
As I was making the improvements for the previous plot, I had a thought to combine the plots with facet_grid, in an attempt to see what other interesting insights I can pull out from these data.
On top of applying similar aesthetics from previous attempts, additional modifications were added too,
facet_grid
to lay out the plots in grids
labeller
parameter within facet_gridparticipants_data %>%
mutate(`Interest Group` = interestGroup) %>%
ggplot(aes(x = `Interest Group`,
fill = haveKids)) +
geom_bar(position ='fill') +
facet_grid(~educationLevel,
labeller = labeller(educationLevel = c("HighSchoolOrCollege" = "High School Or College",
"Bachelors" = "Bachelors",
"Graduate" = "Graduate",
"Low" = "Low"))) +
scale_y_continuous(breaks = seq(0,1, by = 0.1),
labels = scales::percent,
sec.axis = sec_axis(trans = ~.,
labels = scales::percent,
breaks = seq(0,1, by = 0.1))) +
labs(y = 'Percentage\nof\nParticipants',
title = "Percentage Distribution of Participants' Interest Group and Education Level",
subtitle ='With respect to having kids or not',
fill = 'Have Kids?') +
theme(axis.title.y = element_text(angle = 0),
axis.ticks.x = element_blank(),
panel.background = element_blank(),
axis.line = element_line(color = 'grey'))
Interestingly, within the Low education level participants,