QMM1002 Case Study 2 [20%]

Introduction

The primary objective of this study is to conduct a comprehensive statistical analysis of study hours among Analytics students enrolled in three distinct programs: Business Analytics (BAPG), Crime Analytics (CAGC), and Health Analytics (HAGC). By employing hypothesis testing methods and time series models, the goal is to derive meaningful insights that can help understand the study patterns and differences across these programs.

Data Sources

This analysis leverages two key data sets:

1. Personal Data Set: Collected as part of my coursework in QMM-1001 and QMM-1002, this data set records my daily activities from January 12, 2024, to August 13, 2024. The aim was to capture how I spend my time and evaluate if I’m spending my days the way I want to.The variables recorded include:

Variable	Type
Date	Identifier
Hours on Zoom/Class	Quantitative
Hours Studying	Quantitative
Sleep Hours	Quantitative
Cups of Milk	Quantitative
Went to College (Yes/No)	Boolean
T-Shirt Color	Categorical
Productivity Level	Categorical
Walking Distance (in meters)	Quantitative
Semester (1, 2)	Categorical

2. Combined Data Set: This data set includes the study hours of Analytics students enrolled in the three Analytics programs, across two terms. The variables collected are:

Variable	Type
Date	Identifier
Hours Studying	Quantitative
Term (F22, W23)	Categorical
Program (BAPG, CAGC, HAGC)	Categorical

Summary Statistics

Summary statistics provide an initial overview of the data, allowing for a quick comparison of my study hours among different student groups. Below are the key statistics for the hours spent studying:

Data Set	Mean	Standard Deviation
personal	2.0564103	1.5202745
All Students	3.5666611	2.3061209
BAPG	3.3382567	2.2996478
CAGC	4.079602	2.545816
HAGC	3.8118149	2.1271117

From the summary, it is evident that my average daily study hours are lower than those of other Analytics students, including those in my own program (BAPG). Additionally, the standard deviation indicates that my study hours exhibit less variability compared to those of other students.

Crime Analytics students have the highest average study hours with the greatest variability, while business Analytics students study the least on average. Health Analytics students fall in between, with relative less variability indicating more consistent study habits.

Objective

This case study seeks to address the following key questions:

1. Are there differences in the average study times for students in the different analytics streams?

The average study hours differ among students in the three Analytics programs. Given the variability in workload and individual capabilities, it’s important to determine if these differences are statistically significant or if they are statistically the same. To achieve this, I will perform an Analysis of Variance (ANOVA) test, which is suited for comparing the means of more than two groups.

2. Is the distribution of days studied more than 3.13 hours (the average daily study time for students at McGill) the same for students in the different analytics streams (or in other words, independent of program stream)?

At McGill University, the average daily study time is 3.13 hours. By categorizing study days as above or below this threshold, I aim to determine whether the distribution of these days are the same across the programs. A Chi-square test for independence will be used to assess whether the distribution of above and below days is independent of program

3. How does your personal study time change over time?

Over the course of two semesters, my study hours likely fluctuated due to various factors. By employing time series analysis, I aim to uncover any underlying trends or patterns in my study habits. This analysis will help in understanding how my study time has evolved and may provide a basis for forecasting future study patterns.

Data Analysis

Part 1 - ANOVA

To understand how study time varies between three programs, I’ll first take a random sample of 50 days from each program, totaling 150 days. Then I can use ANOVA to test if the average study times are different across these programs.

\(H_0: \mu_B = \mu_C = \mu_H\)

\(H_A:\) At least one mean is different

The null hypothesis states that the mean study hours are the same across all three program

The alternative hypothesis states that at least one program has different mean study hour compared to the others.

study.anova<-aov(Study~Program, data = analytics_50)

summary(study.anova)

##              Df Sum Sq Mean Sq F value Pr(>F)
## Program       2   27.4  13.722   2.188  0.116
## Residuals   147  921.9   6.271

qf(0.05, df1=2, df2 = 147, lower.tail = FALSE)

## [1] 3.057621

p-value = 0.116

\(\alpha\) = 0.05

\(\alpha\) < \(p-value\)

F.stat = 2.188

\(F^{*}\) = 3.057621

|F.stat| < |\(F^{*}\)|

Since the p-value is greater than the level of significance (alpha), we fail to reject the null hypothesis

There is no evidence that the average study hours is different for students in the Business Analytics, Crime Analytics and Health Analytics programs and their mean study hours are statistically the same. Despite some slight variations in the mean study hours between the programs, ANOVA test indicates that the differences are not statistically significant enough and the variations in study hours could be attributed to random chance. Given that all programs are designed in similar structures, it seems that students across different programs tend to dedicate similar amounts of time to studying, regardless of their specific program.

In order to determine if the test results are valid, we can check the ANOVA conditions:

1. Independence Assumption

As stated above, the samples are randomly selected for all three programs. This random assignment ensures that one students’ study hours should not affect another students’ study hours. Hence, the independence assumption is satisfied.

2. Similar Variance Assumption

We can use box plots to check on the variance of study hours for all three programs

boxplot(analytics_50$Study~analytics_50$Program , col = c("blue", "green", "yellow"),  main = "Variance Comparison for the three Programs")

The three box plots seems identical except for BAPG being slightly larger with a longer whisker on the upper end. However, this minor variations does not affect the validity of the analysis. Hence, it is safe to assume that the similar variation condition has been satisfied

3.Normal Population Assumption

To check the normality, we can plot a histogram of the residuals

hist(study.anova$residuals, col = "navyblue")

The histogram shows a uni-modal distribution with a right skew. Therefore the normality assumption is not met. However, ANOVA tests with equal sample sizes works well even if the normality assumption is violated. In this case, the variances for all three program are the same, as satisfied by the condition above, therefore the accuracy of the ANOVA test can be considered valid.

Tukey’s HSD is a Post-Hoc test that can be used to determine which means are different by comparing all possible pairs of means. Although the ANOVA test we conducted above proved that all means are the same, Tukey’s test shows the p-value for each pairs of programs, to help us compare the means as pairs.

TukeyHSD(study.anova, conf.level = 0.95)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Study ~ Program, data = analytics_50)
## 
## $Program
##            diff        lwr      upr     p adj
## CAGC-BAPG 0.704 -0.4818765 1.889876 0.3406516
## HAGC-BAPG 1.024 -0.1618765 2.209876 0.1053188
## HAGC-CAGC 0.320 -0.8658765 1.505876 0.7989048

Program pairs	p-value
CAGC-BAPG	0.3406516
HAGC-BAPG	0.1053188
HAGC-CAGC	0.7989048

The p-values for each pair of programs are greater than the significance level (alpha), further supporting our conclusion that the average study hours across the programs do not differ significantly. This validates our findings that the study hours are consistent across all programs.

A bar plot with error bars representing 95% confidence interval is best way to visualize the differences in means.

g<-ggplot(data=analytics_50, aes(Program, Study, fill = Program ))


g<-g + stat_summary(fun = "mean", geom = "bar")

g<-g + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width=0.2)

g + labs(x="Program", y="Study Hours", title = "Comparison of average study hours for 3 programs")

The overlapping error bars in the bar plot suggest that the 95% confidence intervals for the three programs intersect, indicating that there is no statistically significant difference in the average study hours among the programs. This visual chart reinforces the idea that the programs are likely similar in terms of the time students dedicate to studying.

Part 2 : Chi-Square tests

To determine if there is evidence that the distribution of above and below average study days is independent of program, we will perform a Chi-square test for independence on the sample data containing 150 random days for students in the three programs

\(H_0:\) The distribution of above and below days is independent of the program

\(H_A:\) The distribution of above and below days is dependent on the program

chisq.test(observed.hours)

## 
##  Pearson's Chi-squared test
## 
## data:  observed.hours
## X-squared = 3.4158, df = 2, p-value = 0.1812

qchisq(0.05, 2, lower.tail = FALSE)

## [1] 5.991465

\(\alpha\) = 0.05

p-value = 0.1812

\(\alpha\) < p-value

X-squared stat = 3.4158

\(X-squared^{*}\) = 5.991465

X-squared stat < \(X-squared^{*}\)

Since the p-value is greater than the level of significance (alpha), we fail to reject the null hypothesis

There is strong evidence that the distribution of days above and below the average study time is independent of the program. Despite variations in study patterns, test shows that these differences are not statistically significant. This suggests that students across all programs are equally likely to have study days above or below the average, indicating that the program does not influence the distribution of above and below study days.

In order to determine if the test results are valid, we can check the ANOVA conditions:

1. Counted Data Condition : The categories of above and below days are counted. Therefore, the condition is met

2. Independence Assumption : The sample of 150 days are randomly selected, so the counts of days are independent of each other. Hence the independence assumption is satisfied

3. Randomization Condition : The sample of 150 days are randomly selected. Hence the condition is met

4. Sample Size Condition : The frequency in each of cell of the sample > 5 , Hence the the condition is satisfied

observed.hours

##        
##         BAPG CAGC HAGC
##   Above   19   27   27
##   Below   31   23   23

chisq.test(observed.hours)$residuals

##        
##               BAPG       CAGC       HAGC
##   Above -1.0811798  0.5405899  0.5405899
##   Below  1.0527227 -0.5263614 -0.5263614

Looking at the table above for standardized residuals, none of the cells have a z-score value greater than 3 or less than -3, meaning all the values are within 3 standard deviations of the mean. This suggests there are no unusual values. This further supports our decision to fail to reject the null hypothesis, as all values fall within the expected range and are not significantly different from each other.

mosaicplot(observed.hours, shade = TRUE, xlab = "Above/Below", ylab = "Program", main ='Mosaic Plot')

Eyeballing the mosaic plot, we can observe that all the boxes are white(not shaded). This indicates that the corresponding residual values fall within 2 standard deviations of the mean, and none of the values are unusual. This validates our earlier decision to fail to reject the null hypothesis, as it shows that the distribution of above and below study days across the three programs is not significantly different.

assocplot(observed.hours, xlab = "Above/Below", ylab = "Program", main ='Association Plot')

In the association plot, the bars represent the z-scores, with black bars indicating positive values and red bars indicating negative values. While the bars are not very tall, it is hard to accurately determine if any z-scores exceed the threshold of 3, but our previous analysis confirms that none of the values are unusual.

Part 3 - Time Series Analysis

To gain insights into how my study habits have evolved throughout the semester, I will plot a time series of my study hours.

personal.ts<-ts(na.contiguous(personal.all), frequency = 7)


plot.ts(personal.ts, xlab="Days Since 12 January 2024", ylab="Hours Studied", main = "Time Series of Study hours")

The time series reveals significant fluctuations in my study hours over time, which is not surprising given that my study habits have been inconsistent, often influenced by submission deadlines and my mental health.

plot(decompose(personal.ts))

Components of the Time Series

Observed : The Overall series displays a stationary trend with high volatility throughout the period.

Trend : Initially, we see no significant trend, followed by a sharp rise that quickly plunges to the low point. After a rapid recovery, the series shows an increasing trend towards the end of the period

Seasonal : A strong seasonal component is visible, with recurring high peaks and low points on a consistent cycle.

Irregularities : Some irregularities are seen. They are obtained after subtracting the combined trend and seasonal components from the actual data. These residuals primarily reflect random noise and do not convey meaningful patterns.

In order to forecast future values of my time series data, we want to identify the underlying consistent behaviour of the series. To achieve that, we can use the moving average models to smooth away the high volatility to capture the underlying pattern of the series.

The moving average models MA-L can be constructed with different lengths (L) depending on the purpose of the analysis. Upon checking the errors for different lengths, MA-9 model is the best with the lowest error term in my time series

library(TTR)

## Warning: package 'TTR' was built under R version 4.3.3

personal.ma9<-SMA(personal.ts, n=9)

plot.ts(cbind(personal.ts, personal.ma9), plot.type = "single" , col = c("black", "red"), ylab = "Study Hours", main = "Actual vs MA-9 Model")  
legend("top", legend = c("Actual", "MA-9"), col = c("black", "red"), lty = 1, cex = 0.8 )

We can see that the MA-9 model has smoothed out the rapid fluctuations of my hours studied series. We can check their errors and compare with other models to try and see how they stand out.

Another smoothing method that can help in forecasting future values are the exponential smoothing models. As more recent values are more relevant in a time series for forecasting, this model is weighted in a way that weights decline exponentially into the past.

The time series data does not exhibit any trend component, but the observed high volatility may signal the presence of an underlying seasonal pattern. Given this, the Holts Winter Exponential Smoothing model is the most suitable for revealing consistent pattern in my study habits,providing a basis for more accurate future forecasts.

personal.hw<-HoltWinters(personal.ts)

plot.ts(cbind(personal.ts, personal.hw$fitted[,1]), col = c("Black", "Red"), plot.type = "single", ylab = "Study Hours", main = "Actual vs HW Model")
legend("top", legend = c("Actual Time Series", "HW Model"), col = c("Black", "Red"), lty = 1, cex = 0.6)

The Holt-Winters model closely follows the data, effectively capturing its seasonal behaviors. It responds well to fluctuations, with some peaks exceeding the actual data slightly, while still adjusting reasonably to the lower values, revealing some consistent pattern in my study hours series. To evaluate the model’s accuracy, we can calculate the errors and compare with those of the MA-9 model.

The forecast error at any time is the difference between each data value and forecast value at that point. By comparing the summaries of the forecast errors, we can choose the model with smaller error values. We will consider the following summaries:

1.Root Means Squared Error (RMSE)

2.Mean Absolute Deviation (MAD)/ Mean Absolute error (MAE)

(errors.ma9<-ERRORS(personal.ts, 9))

##        errors
## RMSE 1.509185
## MAD  1.212346
## MAPE      Inf

(errors.HW<-accuracy(forecast(personal.hw)))

##                      ME     RMSE      MAE MPE MAPE      MASE       ACF1
## Training set -0.1560744 1.564861 1.191976 NaN  Inf 0.8371128 0.08287641

Model	RMSE	MAD/MAE
Moving Average Model (MA-9)	1.509185	1.212346
Simple Exponential Smoothing	1.564861	1.191976

The MA-9 model has lower errors for RMSE while the HW model has lower errors for MAE. Both the models can equally be effective depending on the kind of forecast we want to make. But using moving average models, we can only do the simple moving average forecast or naive forecast and it should only be used to forecast one period into the future. So I can only predict my study hours for the next day and the predicted value will be the same value as the last period which in my case would be the hours studied on the last day of model series. Since I would like to predict my study hours for more days into the future, Holt Winters is the most appropriate model

To gain some insights into how my study hours might vary throughout the week, I will forecast for the next 5 days

personal.forecast<-forecast(personal.hw, h=5)

personal.forecast

##          Point Forecast      Lo 80    Hi 80      Lo 95    Hi 95
## 15.14286       1.851469 -0.1549157 3.857853 -1.2170315 4.919969
## 15.28571       2.529415  0.4776810 4.581148 -0.6084411 5.667270
## 15.42857       2.333816  0.2331395 4.434492 -0.8788913 5.546522
## 15.57143       1.985715 -0.1674735 4.138903 -1.3073025 5.278732
## 15.71429       2.089563 -0.1196677 4.298794 -1.2891640 5.468291

plot(personal.forecast)

Based on the Holt-Winters model, accurately forecasting my study hours for the next five days is challenging. The 95% confidence interval suggests that my daily study hours could rise to as much as 5-5.5 hours or drop to no study at all during this period. The 80% confidence interval narrows this range slightly, indicating a possible increase to 4-4.5 hours or a decrease to less than half an hour of studying on the second and third days, with no study predicted for the remaining days. Given that my study time was only 1 hour on the last day of the series, this wide confidence interval makes precise predictions difficult, allowing only for general speculation.

Conclusion

This comprehensive statistical analysis of study hours among students in Business Analytics (BAPG), Crime Analytics (CAGC), and Health Analytics (HAGC) programs has provided valuable insights into study patterns and differences across these streams.

Summary of Findings

ANOVA Analysis: The ANOVA test revealed no statistically significant differences in average study hours among students from the three programs. The variations observed in study hours across the programs could be attributed to random chance. This outcome suggests that, despite some variations, students in all three programs tend to allocate similar amounts of time to studying, reflecting the similar structural demands of the programs.

Chi-Square Test: The Chi-square test demonstrated that the distribution of study days above and below the average study time of 3.13 hours is independent of the program. This result indicates that students across different programs exhibit similar patterns in their study time distributions, reinforcing the notion that programs does not significantly influence study day distributions.

Time Series Analysis: The time series analysis of personal study habits revealed significant fluctuations over time, possibly influenced by factors such as academic demands and my mental health. The MA-9 moving average model effectively smoothed out these fluctuations, highlighting the underlying pattern of study hours with a focus on immediate trends. The Holt-Winters Exponential Smoothing model, on the other hand, captured seasonal behaviors within the data, providing view of potential future patterns.

Although the MA-9 model exhibited lower RMSE, indicating precision in capturing short-term trends, the Holt-Winters model demonstrated lower Mean Absolute Error (MAE) and proved more suitable for forecasting study hours over multiple days. The wide confidence intervals generated by the Holt-Winters model, however, suggest challenges in making precise predictions, particularly given the variability in past study habits. Nevertheless, this model offers valuable insights for planning study schedules over an extended period, forecasting a range of possible outcomes for study hours in the coming days.

The analysis highlights that study habits among Analytics students are generally consistent across different programs, with no significant differences in average study hours or distribution patterns. Furthermore, the insights gained from the time series analysis and forecasting models can help in understanding and predicting future study patterns.