Model Performance Comparison

Below is a summary of the training and test set performance for each model.

Train Set
Test Set
Model \(R^2\) RMSE MAE \(R^2\) RMSE MAE
Linear Regression (LM) 0.826 360.692 251.108 0.830 347.020 245.548
GLM (Negative Binomial) 0.817 368.451 254.049 0.819 359.205 252.532
GAM (Negative Binomial) 0.829 355.891 246.153 0.833 343.998 246.577
Random Forest (Default) 0.965 174.114 122.431 0.845 336.344 250.219
Random Forest (CV-Tuned) 0.952 193.634 132.163 0.846 330.028 236.966
XGBoost (Default) 0.978 129.561 85.818 0.830 347.255 247.811
XGBoost (CV-Tuned) 0.905 268.267 184.694 0.845 331.139 238.871

Model Descriptions

Linear Regression

Linear regression was used as a baseline model to understand how bike usage responds to weather features under the assumption of a linear relationship. After stepwise variable selection and Box-Cox transformations to stabilize variance, the final model included key predictors like hour of day, temperature, humidity, and weather condition. However, its limitations stem from the assumption of linear relationships, which may oversimplify the complex interactions between weather and bike usage.

GLM

Initially, a Poisson GLM was applied to model hourly bike trip counts, treating them as count data. However, model diagnostics revealed substantial overdispersion, violating the assumptions of the Poisson distribution. To address this, the final model used a Negative Binomial distribution, which better accounted for variance. The fitted model included hour, temperature, humidity, and weather conditions. The model confirmed expected effects, such as higher demand during commuting hours and lower usage under unfavorable weather.

GAM

To flexibly capture nonlinear patterns, a GAM with a Negative Binomial family was fitted using smooth terms for key continuous variables. The final model incorporated smooth splines for temperature and wind speed, allowing the effect of these variables to vary across their range. The results showed a sharp increase in ridership with temperature up to about 20°C before flattening, while wind speed had a steady negative impact. By relaxing linearity assumptions, the GAM uncovered important behavioral dynamics that were missed by simpler models.

Random Forest

Random Forests combine the predictions of many individual decision trees, using bootstrapped samples and random subsets of features to reduce variance and avoid overfitting. In this project, a tuned Random Forest model was selected using 5-fold cross-validation, with the final configuration using mtry = 4, ntree = 500, and nodesize = 10. This model consistently delivered strong predictive performance. The variable importance analysis revealed that hour of day and temperature were by far the most influential factors, indicating that temporal patterns and thermal comfort are key drivers of bikeshare usage in Toronto. The full hyperparameter grid used for tuning is detailed in Appendix B.

XGBoost

XGBoost (Extreme Gradient Boosting) is a powerful tree-based algorithm that builds models sequentially, where each tree focuses on correcting the errors of the previous one. After tuning through extensive cross-validation, the best-performing model used eta = 0.05, max_depth = 6, min_child_weight = 3, subsample = 0.8, and colsample_bytree = 1. This configuration balanced bias and variance well and produced accurate forecasts. Like Random Forest, XGBoost identified hour of day and temperature as the most critical predictors, though its feature importance plot showed a steeper concentration of influence on fewer variables, reflecting the model’s ability to prioritize the strongest signals in the data. The full hyperparameter grid used for tuning is detailed in Appendix B.