Analysis and display of Engagement’s economy.
This takehome exercise aims to sharpen the skill of building data visualisation programmatically using appropriate tidyverse family of packages and the preparation of statistical graphics using ggplot2 and its extensions. The specifc requirements can be found in the screenshot below.
You can find the links to then datasets here.
My analysis will focus on the Question 2.
“How does the financial health of the residents change over the period covered by the dataset? How do wages compare to the overall cost of living in Engagement? Are there groups that appear to exhibit similar patterns? Describe your rationale for your answers. Limit your response to 10 images and 500 words.”
I will focus on the dataset FinancialJournal since it contains all the information I need.
To guide my exploration, I will remove all ambiguity by defining what financial health means.
Financial health is defined as a state of
A steady flow of income To determine this, we can use the “Category” called “Wage” from the dataset FinancialJournal.
Rare changes in expenses To determine this, we can use the “Category” called “Education”, “Food”, “Recreation” and “Shelter” from the dataset FinancialJournal.
A cash balance that is growing. To determine this, we can use the difference between income and the different types of costs in the dataset FinancialJournal.
I will use Trellisplotjs to survey the data and pick out on certain peculiarties.
Based on the insights discovered, I will try to create a unifying visualisation solution.
Having locked down our approach, we use it as a guide to explore the data using R for Data Science as a guide.
The packages required are tidyverse, trelliscopejs, ggplot2, and gapminder.
The code chunk below is used to install and load the required packages onto RStudio.
packages = c('tidyverse', 'trelliscopejs', 'ggplot2', 'gapminder', 'lubridate', 'dplyr', 'plotly', 'ggdist', 'reshape')
for(p in packages){
if(!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The main dataset used is FinancialJournal. We use the read_csv() of the readr package is used to import the data.
financialJournal <- read_csv("data/FinancialJournal.csv")
glimpse(financialJournal)
We also convert the data into RDS format since it exceeded Git’s recommended memory limits.
Instead of going through the trouble of mutating data and creating facet plots, we use Trelliscopejs.
qplot(timestamp, amount, data = financialJournal) +
theme_bw() +
facet_trelliscope(~ participantId,
nrow = 2,
ncol = 5,
path = "trellis/",
self_contained = TRUE)
The Trelliscopejs result from using the dataset FinancialJournal in full yielded a result that was not useful. That was because all categories were clunked together. Also, a quick look at the original CSV file shows that there were duplicate values.
Instead, I manually cleaned the CSV file. Next, I created datasets for each category using the code chunk below.
# Creating sub-dataset for the category "Education"
educationExpense <- filter(financialJournal, category == 'Education')
head(educationExpense)
# Creating sub-dataset for the category "Food"
foodExpense <- filter(financialJournal, category == 'Food')
head(foodExpense)
# Creating sub-dataset for the category "Recreation"
recreationExpense <- filter(financialJournal, category == 'Recreation')
head(recreationExpense)
# Creating sub-dataset for the category "Shelter"
shelterExpense <- filter(financialJournal, category == 'Shelter')
head(shelterExpense)
# Creating sub-dataset for the category "Wage"
wageIncome <- filter(financialJournal, category == 'Wage')
head(wageIncome)
Lastly, I create individual Trelliscopejs plots for each category using the follow code chunks. I run them individually to make it easier to see which is which.
# For education expense
qplot(timestamp, amount, data = educationExpense) +
theme_bw() +
facet_trelliscope(~ participantId,
nrow = 2,
ncol = 5,
path = "trellisedu/",
self_contained = TRUE)
# For food expense
qplot(timestamp, amount, data = foodExpense) +
theme_bw() +
facet_trelliscope(~ participantId,
nrow = 2,
ncol = 5,
path = "trellisfd/",
self_contained = TRUE)
# For recreation expense
qplot(timestamp, amount, data = recreationExpense) +
theme_bw() +
facet_trelliscope(~ participantId,
nrow = 2,
ncol = 5,
path = "trellisrec/",
self_contained = TRUE)
# For shelter expense
qplot(timestamp, amount, data = shelterExpense) +
theme_bw() +
facet_trelliscope(~ participantId,
nrow = 2,
ncol = 5,
path = "trellisshel/",
self_contained = TRUE)
# For wage income
qplot(timestamp, amount, data = wageIncome) +
theme_bw() +
facet_trelliscope(~ participantId,
nrow = 2,
ncol = 5,
path = "trelliswage/",
self_contained = TRUE)
I noticed that I was missing something huge. I did not calculate the sum of the expense of all participants in Engagement.
First, I filter away all income data using the code chunk below.
Then, I convert the timestamps into months using the code chunk below.
Now, we sum all values in the category column.
Lastly, I create a Trelliscopejs plot for that this too.
# For total expense
qplot(timestamp, amount, data = allExpense) +
theme_bw() +
facet_trelliscope(~ participantId,
nrow = 2,
ncol = 5,
path = "trellisexp/",
self_contained = TRUE)
Wage seems to largely follow a uniform distribution on a per-participant basis.
Most cases that are contributing to variance seem to be one-off changes in wages.
However, there are also cases where the change in income is more constant.
In general, the maximum income earned was $4096.52 and the least amount earned was $0.83.
Combining the expenses shows that in general, there is a baseline amount spent each day by the participants, punctured by high amounts of spending.
This supported by the overall average breakdown. Where each household spends at least some sort of money of value more than zero. We can tell this from the *amount_max* table because expense values are negative.
Overall, the largest expense is $1637.71 and the smallest expense is $9.92 on average.
It looks like education expense follows a uniform distribution per participant.
And that it is always part of a participant’s expense. A closer look show that it ranges from $12.81 to $91.14 on average.
Lastly, it seems to be a one-time expense for certain users.
It would be premature for me to determine any form of distribution here based on Trelliscopejs plot.
That being said, it seems to centre around $4.75 and have minimal variance. Also, the most spent on food is $14.84 and the least spent is $4.22 on average.
It is also interesting to note that food is a one-time expense for some participants.
It would be premature for me to determine any form of distribution here based on Trelliscopejs plot.
That being said, it seems to have 2 peaks and might follow a bimodal distribution. In this case, we can hypothesize that recreation is both a daily necessity as well as a once-in-awhile splurge for most participants. The largest amount spent on recreation is $36.53 and the smallest amount spent is $0.53 on average.
In addition, there are instances of participants where recreation is a one-time expense.
In general, shelter seems to follow a uniform distribution per participant as well.
That being said, it seems to have some sort of variance.
Upon closer inspection, it seems to be because of a one-time upgrade by certain participants.
However, there are also cases where there is more than one change in shelter expense. These participants could be renters. However, it is also to be noted that they are the minority.
In general, shelter expense seems to follow a normal distribution with the maximum expense being $1556.36 and the minimum expense being $231.71.
We now have an understanding of the income and expenses of participants in Engagmement.
There seem to be “dud” values where the participant just trails off and does not have any income or expense information over some period of time. And in most cases, once the participant trails off, no new data will be found. In the case of income, it could mean that the participant has stopped working. In the case of expense, a stop in recreation expense is fine. However, it is difficult to account for a stop in shelter, education and especially food expense.
Income is constant for almost all participants. So is education expense and shelter expense. Hence, most volatility in a participant’s available balance will be due to changes in other expenses.
The exploration so far does not investigate the relationship between income and expense.
I focus on evaluating the difference between income and expense here.
The goal here is to create a dataframe with expense, income and balance.
First, we convert the wageIncome dataframe so that it uses the same timestamps as the allExpense dataframe using the code chunk below.
We change the column names too with the following code chunk. Also, we need to find out if they have the same number of rows
We also group all wages so that there is at most 1 wage datapoint for each timestamp.
# Create a new data.frame with the newly formatted date field
wageIncome <- wageIncome %>%
group_by(participantId, timestamp) %>%
summarise(incomeAmount = sum(incomeAmount, na.rm = TRUE))
Now we create a completely new dataframe using the wageIncome and allExpense dataframes with the code chunk below.
Lastly, we add a balance column to the datafame.
incomeAndExpenseTable <- incomeAndExpenseTable %>%
mutate(balance = incomeAmount + expenseAmount)
head(incomeAndExpenseTable)
To create a Trelliscopejs plot for each participant’s balance, we first create a dataet for balance
# Get all balance related data
allBalance <- data.frame(incomeAndExpenseTable$participantId, incomeAndExpenseTable$timestamp, incomeAndExpenseTable$balance)
head(allBalance)
Now, we create the Trelliscopejs plot.
# For total expense
qplot(incomeAndExpenseTable.timestamp, incomeAndExpenseTable.balance, data = allBalance) +
theme_bw() +
facet_trelliscope(~ incomeAndExpenseTable.participantId,
nrow = 2,
ncol = 5,
path = "trellisbal/",
self_contained = TRUE)
Balance seems to have a uniform distribution for each participant.
This backed up by the low variance as variance tends to center around 0.
Also, most variance seems to come from occasional changes in balance.
While i cannot tell as much from these tables, it is important to note that some people do go into debt. The lowest balance is $-1335.67 while the highest balance is $4081.58.
To recap the 3 questions again, they are:
1. How does the financial health of the residents change over the period covered by the dataset?
2. How do wages compare to the overall cost of living in Engagement?
3. Are there groups that appear to exhibit similar patterns? Describe your rationale for your answers.
Questions 1 and 2 seem to be the most straightforward, and can be quickly visualised with a multi-line plot using the dataframe incomeAndExpenseTable and its variants.
Let us create 3 plots, one for income, expense and balance, using the code chunks below.
# Dynamic plot for income
plot_ly(
data = incomeAndExpenseTable,
x = ~timestamp,
y = ~incomeAmount,
type = "scatter",
mode = "lines",
color = ~participantId) %>%
layout(title = 'Income for all Participants in Engagememt over time',
xaxis = list(
range = c(as.numeric(as.POSIXct("2022-03-01", format="%Y-%m-%d"))*1000, as.numeric(as.POSIXct("2022-11-05", format="%Y-%m-%d"))*1000),
type = "date"))
Note: Adding the static screenshot as I was facing memory issues during knit
# Dynamic plot for expense
plot_ly(
data = incomeAndExpenseTable,
x = ~timestamp,
y = ~expenseAmount,
type = "scatter",
mode = "lines",
color = ~participantId) %>%
layout(title = 'Expense for all Participants in Engagememt over time',
xaxis = list(
range = c(as.numeric(as.POSIXct("2022-03-01", format="%Y-%m-%d"))*1000, as.numeric(as.POSIXct("2022-11-05", format="%Y-%m-%d"))*1000),
type = "date"))
Note: Adding the static screenshot as I was facing memory issues during knit
# Dynamic plot for balance
plot_ly(
data = incomeAndExpenseTable,
x = ~timestamp,
y = ~balance,
type = "scatter",
mode = "lines",
color = ~participantId) %>%
layout(title = 'Balance for all Participants in Engagememt over time',
xaxis = list(
range = c(as.numeric(as.POSIXct("2022-03-01", format="%Y-%m-%d"))*1000, as.numeric(as.POSIXct("2022-11-05", format="%Y-%m-%d"))*1000),
type = "date"))
Note: Adding the static screenshot as I was facing memory issues during knit
Honestly, there are too many data points and this is just a hot mess. I try to aggregrate and plot the average, minimum, maximum and total value for each over time by first creating new datasets with the code chunk below.
# Creating data frame for total income against time
totalIncome <- incomeAndExpenseTable %>%
group_by(timestamp) %>%
summarise(totalIncome = sum(incomeAmount, na.rm = TRUE))
head(totalIncome)
# Creating data frame for total expense against time
totalExpense <- incomeAndExpenseTable %>%
group_by(timestamp) %>%
summarise(totalExpense = sum(expenseAmount, na.rm = TRUE))
head(totalExpense)
# Creating data frame for total balance against time
totalBalance <- incomeAndExpenseTable %>%
group_by(timestamp) %>%
summarise(totalBalance = sum(balance, na.rm = TRUE))
head(totalBalance)
I also merge the dataframes because I will like to use the plotly add_trace function to plot all 3 dataframes on the same plotly graph.
#First we multiply totalExpense by -1 to get the absolute value. This will make comparison in the plotly graph easier.
totalExpense$totalExpense <- totalExpense$totalExpense * -1
head(totalExpense)
completeDataframe <- merge(x = totalIncome, y = totalExpense, c("timestamp"), all.x = TRUE)
completeDataframeFinal <- merge(x = completeDataframe, y = totalBalance, c("timestamp"), all.x = TRUE)
head(completeDataframeFinal)
Now, we create an interactive multi-line plot for completeDataframeFinal
plot_ly(
data = completeDataframeFinal,
x = ~timestamp,
y = ~totalIncome,
name = 'totalIncome',
type = "scatter",
mode = "lines"
) %>%
add_trace(y = ~totalExpense,
name = 'totalExpense') %>%
add_trace(y = ~totalBalance,
name = 'totalBalance') %>%
layout(title = 'Total for all Participants in Engagememt over time',
yaxis = list(title = 'Amount'))
Note: Adding the static screenshot as I was facing memory issues during knit
Let us do the same for average, maximum and minimum values using the code chunks below.
# Creating plotly graph for average
avgIncome <- incomeAndExpenseTable %>%
group_by(timestamp) %>%
summarise(avgIncome = mean(incomeAmount, na.rm = TRUE))
# head(avgIncome)
avgExpense <- incomeAndExpenseTable %>%
group_by(timestamp) %>%
summarise(avgExpense = mean(expenseAmount, na.rm = TRUE))
# head(avgExpense)
avgBalance <- incomeAndExpenseTable %>%
group_by(timestamp) %>%
summarise(avgBalance = mean(balance, na.rm = TRUE))
# head(avgBalance)
avgExpense$avgExpense <- avgExpense$avgExpense * -1
# head(avgExpense)
completeDataframeAvg <- merge(x = avgIncome, y = avgExpense, c("timestamp"), all.x = TRUE)
completeDataframeAvgFinal <- merge(x = completeDataframeAvg, y = avgBalance, c("timestamp"), all.x = TRUE)
# head(completeDataframeAvgFinal)
plot_ly(
data = completeDataframeAvgFinal,
x = ~timestamp,
y = ~avgIncome,
name = 'avgIncome',
type = "scatter",
mode = "lines"
) %>%
add_trace(y = ~avgExpense,
name = 'avgExpense') %>%
add_trace(y = ~avgBalance,
name = 'avgBalance') %>%
layout(title = 'Average for all Participants in Engagememt over time',
yaxis = list(title = 'Amount'))
Note: Adding the static screenshot as I was facing memory issues during knit
# Creating plotly graph for maximum
maxIncome <- incomeAndExpenseTable %>%
group_by(timestamp) %>%
summarise(maxIncome = max(incomeAmount, na.rm = TRUE))
# head(maxIncome)
maxExpense <- incomeAndExpenseTable %>%
group_by(timestamp) %>%
summarise(maxExpense = max(expenseAmount, na.rm = TRUE))
# head(maxExpense)
maxBalance <- incomeAndExpenseTable %>%
group_by(timestamp) %>%
summarise(maxBalance = max(balance, na.rm = TRUE))
# head(maxBalance)
maxExpense$maxExpense <- maxExpense$maxExpense * -1
# head(maxExpense)
completeDataframemax <- merge(x = maxIncome, y = maxExpense, c("timestamp"), all.x = TRUE)
completeDataframemaxFinal <- merge(x = completeDataframemax, y = maxBalance, c("timestamp"), all.x = TRUE)
# head(completeDataframemaxFinal)
plot_ly(
data = completeDataframemaxFinal,
x = ~timestamp,
y = ~maxIncome,
name = 'maxIncome',
type = "scatter",
mode = "lines"
) %>%
add_trace(y = ~maxExpense,
name = 'maxExpense') %>%
add_trace(y = ~maxBalance,
name = 'maxBalance') %>%
layout(title = 'Max for all Participants in Engagememt over time',
yaxis = list(title = 'Amount'))
Note: Adding the static screenshot as I was facing memory issues during knit
# Creating plotly graph for minimum
minIncome <- incomeAndExpenseTable %>%
group_by(timestamp) %>%
summarise(minIncome = min(incomeAmount, na.rm = TRUE))
# head(minIncome)
minExpense <- incomeAndExpenseTable %>%
group_by(timestamp) %>%
summarise(minExpense = min(expenseAmount, na.rm = TRUE))
# head(minExpense)
minBalance <- incomeAndExpenseTable %>%
group_by(timestamp) %>%
summarise(minBalance = min(balance, na.rm = TRUE))
# head(minBalance)
minExpense$minExpense <- minExpense$minExpense * -1
# head(minExpense)
completeDataframemin <- merge(x = minIncome, y = minExpense, c("timestamp"), all.x = TRUE)
completeDataframeminFinal <- merge(x = completeDataframemin, y = minBalance, c("timestamp"), all.x = TRUE)
# head(completeDataframeminFinal)
plot_ly(
data = completeDataframeminFinal,
x = ~timestamp,
y = ~minIncome,
name = 'minIncome',
type = "scatter",
mode = "lines"
) %>%
add_trace(y = ~minExpense,
name = 'minExpense') %>%
add_trace(y = ~minBalance,
name = 'minBalance') %>%
layout(title = 'Minimum for all Participants in Engagememt over time',
yaxis = list(title = 'Amount'))
Note: Adding the static screenshot as I was facing memory issues during knit
Next up, I will like to get the summary statistics for total, average, maximum and minimum income, expense and balance. We do that using the code chunks below.
# Getting summary statistics for total
summary(completeDataframeFinal)
# Getting summary statistics for average
summary(completeDataframeAvgFinal)
# Getting summary statistics for max
summary(completeDataframemaxFinal)
# Getting summary statistics for min
summary(completeDataframeminFinal)
To identify a pattern, we have to base the data on a solid basis of comparison. Since the crux of this question lies in a participant’s financial health, I will like to use each participant’s balance over time. To make it easier for me to see any form of pattern, I will use a heatmap.
We start by renaming the columns of allBalance using the code chunk below.
Then we build a heat map for each partipant overtime using the code chunk below, by taking reference from the official RPubs guidelines
ggplot(allBalance,
aes(timestamp, participantId)) +
geom_tile(aes(fill = balance))
balanceHeatMap
ggsave("heatmap.png")
Just to recap, these are the questions.
1. How does the financial health of the residents change over the period covered by the dataset?
2. How do wages compare to the overall cost of living in Engagement?
Answering Question 1
On average, participant financial health is consistently positive and subject to the frequent drop because of a spike in expense.
Answering Question 2
On average, wage is higher than overall expense. Also, wage is more consistent and expense is more volatile. This is attributed to all expenses other than education expense and shelter expense, as these 2 expenses are also rather steady for every participant.
Answering Question 1
If we assume that the richest participants have both the highest income and highest expense, then the graph above can be a reflection of the richest participants. In general, the richest participants never face a loss, and know how to minimise their expense spikes.
Answering Question 2
The richer participants know how to keep their income high while moderating the volatility that is commonly associated with expense.
Answering Question 1
If we assume that the poorest participants have the lowest income and expense, then we see that they are consistently earning bare minimum or in debt. In addition, they are hit harder than the average during the spikes in expense.
Answering Question 2
The poorer participants earn lower wages and incur more expense. This could keep them in a poverty cycle.
Are there groups that appear to exhibit similar patterns?
Admittedly, the current heatmap does not solve the problem. A lot more has to be done, and I will continue to work on this over time. More specifically, I will like to