Data Analysis & Statistical Modeling
Course: University of Michigan STATS 415: Data Mining and Statistical Learning (Fall 2023)
Stock Market Direction Prediction (R) September 2023
Focus Areas: Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Logistic Regression
-
Implemented LDA and QDA to classify stock market trends using the Smarket dataset, comparing decision boundaries.
-
Developed classification models to predict daily stock market trends using the Smarket dataset.
-
Implemented Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) to classify stock movements based on lagged returns and trading volume.
-
Built a logistic regression model to estimate market direction probabilities, interpreting coefficients and statistical significance.Evaluated model accuracy using confusion matrices, error rates, and cross-validation.
Vehicle Fuel Efficiency (R) September 2023
Focus Areas: Polynomial Regression, Cross-Validation, Bias-Variance Tradeoff
-
Built polynomial regression models to predict miles per gallon (mpg) and optimized model complexity using cross-validation.
Financial Risk Prediction (R) October 2023
Focus Areas: Logistic Regression, Cross-Validation, Financial Risk Modeling
-
Developed a logistic regression model to predict loan default probabilities, applying k-Fold Cross-Validation for performance evaluation.
Housing Price Analysis (R) October 2023
Focus Areas: Bootstrap Resampling, Confidence Intervals, Statistical Analysis
-
Used bootstrap resampling to estimate confidence intervals for median housing prices, analyzing neighborhood price variations.
College Acceptance Rate Prediction (R) October 2023
Focus Areas: Ridge & Lasso Regression, Principal Component Regression (PCR), Gradient Boosting
-
Applied ridge and lasso regression to predict college acceptance rates, optimizing model selection with Principal Component Regression (PCR).
-
Developed a gradient boosting model, identifying graduation rate as the most influential predictor.
Crabs Species Classification (R) October 2023
Focus Areas: Linear Classification, Hyperplane Distance, Misclassification Analysis
-
Built a linear classifier for crab species identification, calculating Euclidean distances and analyzing misclassification errors.
Air Pollution & Industrialization Analysis (R) October 2023
Focus Areas: Nonlinear Regression, Natural Splines, Smoothing Splines, Bootstrapping
-
Modeled the relationship between industrialization and air pollution (NOx) using nonlinear regression, splines, and bootstrapping.
Health Data Analysis (R) October 2023 – November 2023
Focus Areas: Ridge Regression, Bootstrap Resampling, k-Nearest Neighbors (KNN), Cross-Validation
-
Examined the impact of maternal smoking on infant birth weight using NHANES data (5,000+ entries).
-
Applied Ridge Regression to predict birth weight while addressing collinearity.
-
Performed 10-fold Cross-Validation to optimize the ridge penalty parameter (λ) using glmnet.
-
Used Bootstrap Resampling (1,000 samples) to create 99% Confidence Intervals for smoking impact.
-
Babies of mothers who smoked had significantly lower birth weights (no overlap in confidence intervals).
-
Compiled a comprehensive report detailing findings and ensuring analysis reproducibility.
Crime Rate Clustering & Analysis (R) November 2023
Focus Areas: Hierarchical Clustering, K-Means Clustering
-
Performed hierarchical clustering on the USArrests dataset, grouping U.S. states based on crime statistics.
-
Compared clustering methods using complete linkage, single linkage, and K-Means clustering to determine the best cluster structure.
-
Analyzed cluster consistency by computing silhouette coefficients, identifying the best-performing method.
Course: University of Michigan STATS 413: Applied Regression Analysis (Fall 2023)
Prostate Cancer Risk Prediction (R) November 2023
Focus Areas: Feature Selection, Model Evaluation, Backward Elimination, AIC, Adjusted R², Mallows' Cp
-
Developed multiple regression models to predict prostate-specific antigen (lpsa) levels, a key marker for prostate cancer progression.
-
Applied stepwise selection methods, including forward selection, backward elimination, and best subset selection, to determine the optimal feature set.
-
Evaluated models using Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and Mallows’ Cp.
-
Compared model performance using Adjusted R² and residual analysis, selecting the best model based on interpretability and generalization.
Industrial Emissions & Stack Loss Analysis (R) November 2023
Focus Areas: Robust Regression Methods, Outlier Detection
-
Analyzed industrial emissions from an ammonia oxidation plant, examining the effects of airflow, water temperature, and acid concentration on stack loss.
-
Implemented multiple regression techniques (Least Squares, Least Absolute Deviations (LAD), Huber Regression, and Least Trimmed Squares (LTS)) to assess model robustness.
-
Identified outliers using Cook’s Distance and Residual Analysis, comparing the effectiveness of robust regression methods.
Ozone Concentration Prediction & Climate Trends (R) November 2023
Focus Areas: Multiple Regression, Box-Cox Transformation
-
Modeled ozone levels using multiple regression, applying the Box-Cox transformation to improve fit.
Hip Center Prediction (R) November 2023
Focus Areas: Principal Component Regression (PCR), Partial Least Squares (PLS), Model Optimization
-
Predicted hip center position using body measurements like height, arm length, and leg length.
-
Applied Principal Component Regression (PCR) and Partial Least Squares (PLS) for dimensionality reduction and model optimization.
-
Determined optimal component count and validated models using cross-validation.
Body Fat Prediction (R) November 2023
Focus Areas: Linear Regression, Ridge Regression, Principal Component Regression (PCR), Partial Least Squares (PLS), Model Selection
-
Compared Linear, Ridge, PCR, and PLS regression models to estimate body fat percentage, selecting ridge regression for optimal performance.
Regional Income Disparities (R) November 2023
Focus Areas: ANOVA, Tukey’s HSD, Box-Cox Transformation
-
Used ANOVA and Tukey’s HSD to analyze income gaps between world regions, validating findings with the Box-Cox transformation.