top of page

Data Analysis & Statistical Modeling

Course: University of Michigan STATS 415: Data Mining and Statistical Learning (Fall 2023)

Stock Market Direction Prediction (R)                                                                                                                                                      September 2023

Focus Areas: Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Logistic Regression

  • Implemented LDA and QDA to classify stock market trends using the Smarket dataset, comparing decision boundaries.

  • Developed classification models to predict daily stock market trends using the Smarket dataset.

  • Implemented Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) to classify stock movements based on lagged returns and trading volume.

  • Built a logistic regression model to estimate market direction probabilities, interpreting coefficients and statistical significance.Evaluated model accuracy using confusion matrices, error rates, and cross-validation.

Vehicle Fuel Efficiency (R)                                                                                                                                                                       September 2023

Focus Areas: Polynomial Regression, Cross-Validation, Bias-Variance Tradeoff

  • Built polynomial regression models to predict miles per gallon (mpg) and optimized model complexity using cross-validation.

Financial Risk Prediction (R)                                                                                                                                                                      October 2023

Focus Areas: Logistic Regression, Cross-Validation, Financial Risk Modeling

  • Developed a logistic regression model to predict loan default probabilities, applying k-Fold Cross-Validation for performance evaluation.

Housing Price Analysis (R)                                                                                                                                                                          October 2023

Focus Areas: Bootstrap Resampling, Confidence Intervals, Statistical Analysis

  • Used bootstrap resampling to estimate confidence intervals for median housing prices, analyzing neighborhood price variations.

College Acceptance Rate Prediction (R)                                                                                                                                                      October 2023

Focus Areas: Ridge & Lasso Regression, Principal Component Regression (PCR)​, Gradient Boosting

  • Applied ridge and lasso regression to predict college acceptance rates, optimizing model selection with Principal Component Regression (PCR).

  •  Developed a gradient boosting model, identifying graduation rate as the most influential predictor.

Crabs Species Classification (R)                                                                                                                                                                  October 2023

Focus Areas: Linear Classification, Hyperplane Distance, Misclassification Analysis

  • Built a linear classifier for crab species identification, calculating Euclidean distances and analyzing misclassification errors.

Air Pollution & Industrialization Analysis (R)                                                                                                                                               October 2023

Focus Areas: Nonlinear Regression, Natural Splines, Smoothing Splines, Bootstrapping

  • Modeled the relationship between industrialization and air pollution (NOx) using nonlinear regression, splines, and bootstrapping.

Health Data Analysis (R)                                                                                                                                                October 2023 – November 2023

Focus Areas: Ridge Regression, Bootstrap Resampling, k-Nearest Neighbors (KNN), Cross-Validation

  • ​​Examined the impact of maternal smoking on infant birth weight using NHANES data (5,000+ entries).

  • Applied Ridge Regression to predict birth weight while addressing collinearity.

  • Performed 10-fold Cross-Validation to optimize the ridge penalty parameter (λ) using glmnet.

  • Used Bootstrap Resampling (1,000 samples) to create 99% Confidence Intervals for smoking impact.

  • Babies of mothers who smoked had significantly lower birth weights (no overlap in confidence intervals).

  • Compiled a comprehensive report detailing findings and ensuring analysis reproducibility.

Crime Rate Clustering & Analysis (R)                                                                                                                                                      November 2023

Focus Areas: Hierarchical Clustering, K-Means Clustering

  • Performed hierarchical clustering on the USArrests dataset, grouping U.S. states based on crime statistics.

  • Compared clustering methods using complete linkage, single linkage, and K-Means clustering to determine the best cluster structure.

  • Analyzed cluster consistency by computing silhouette coefficients, identifying the best-performing method.

Course: University of Michigan STATS 413: Applied Regression Analysis (Fall 2023)

Prostate Cancer Risk Prediction (R)                                                                                                                                                         November 2023

Focus Areas: Feature Selection, Model Evaluation, Backward Elimination, AIC, Adjusted R², Mallows' Cp

  • Developed multiple regression models to predict prostate-specific antigen (lpsa) levels, a key marker for prostate cancer progression.

  • Applied stepwise selection methods, including forward selection, backward elimination, and best subset selection, to determine the optimal feature set.

  • Evaluated models using Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and Mallows’ Cp.

  • Compared model performance using Adjusted R² and residual analysis, selecting the best model based on interpretability and generalization.

Industrial Emissions & Stack Loss Analysis (R)                                                                                                                                       November 2023

Focus Areas: Robust Regression Methods, Outlier Detection

  • Analyzed industrial emissions from an ammonia oxidation plant, examining the effects of airflow, water temperature, and acid concentration on stack loss.

  • Implemented multiple regression techniques (Least Squares, Least Absolute Deviations (LAD), Huber Regression, and Least Trimmed Squares (LTS)) to assess model robustness.

  • Identified outliers using Cook’s Distance and Residual Analysis, comparing the effectiveness of robust regression methods.

Ozone Concentration Prediction & Climate Trends (R)                                                                                                                             November 2023

Focus Areas: Multiple Regression, Box-Cox Transformation

  • Modeled ozone levels using multiple regression, applying the Box-Cox transformation to improve fit.

Hip Center Prediction (R)                                                                                                                                                                         November 2023

Focus Areas: Principal Component Regression (PCR), Partial Least Squares (PLS), Model Optimization

  • Predicted hip center position using body measurements like height, arm length, and leg length.

  • Applied Principal Component Regression (PCR) and Partial Least Squares (PLS) for dimensionality reduction and model optimization.

  • Determined optimal component count and validated models using cross-validation.

Body Fat Prediction (R)                                                                                                                                                                         November 2023

Focus Areas: Linear Regression, Ridge Regression, Principal Component Regression (PCR), Partial Least Squares (PLS), Model Selection

  • Compared Linear, Ridge, PCR, and PLS regression models to estimate body fat percentage, selecting ridge regression for optimal performance.

Regional Income Disparities (R)                                                                                                                                                              November 2023

Focus Areas: ANOVA, Tukey’s HSD, Box-Cox Transformation

  • Used ANOVA and Tukey’s HSD to analyze income gaps between world regions, validating findings with the Box-Cox transformation.

bottom of page