Huber Regression (Week 5 – Friday)

Since my data had some outliers I wanted a model that is less sensitive to them. I tried Huber Regression today after experimenting with a couple of other regression models.  Although I obtained lower R-squared values from cross-validation which may indicate that my Huber Regression model is not explaining as much of the variance in the data as my Multiple Linear Regression model made earlier, I found it to be performing better in the presence of outliers.

The dilemma is whether I want a model that is less sensitive to outliers, the one that’s capturing the overall trend of the data is more critical, the Multiple Linear Regression model might be more suitable.

Also, while R-squared is just one metric, it’s essential to consider other evaluation metrics MSE and MAE to get a holistic view of predictive performance.

Week 5 – Wednesday

I worked on my project today, focusing on the analysis of CDC data. I utilized inactivity percentage and obesity percentage as predictors, with diabetes percentage as the dependent variable. I experimented with multiple linear regression and polynomial regression, varying the degrees trying to achieve the optimal R-squared value. Subsequently, I applied the Breusch-Pagan test to examine the hypothesis that the residuals (variance in errors) in my model are constant across all levels of the independent variables.

I constructed a new regression model with the squared residuals as the dependent variable and the independent variables from my original model. The goal was to determine if the squared residuals could be predicted by the independent variables. After calculating a test statistic, I obtained a critical value from the chi-squared distribution with 1 degree of freedom (df=1). Since the test statistic was greater than the critical value, I concluded that heteroskedasticity was present

Week 5 – Monday

Last week, I used the R-squared metric for cross-validation, which measures the proportion of variance in the dependent variable that is predictable from predictors. In further exploration today, I tried evaluating my models using alternative scoring metrics, reading about their differences. Notably, I discovered that in the absence of a specified scoring metric in the parameters, the cross_val_score function defaults to calculating the negative Mean Squared Error (MSE) for each fold, a metric that is particularly sensitive to outliers. Furthermore, I gained an understanding of the Mean Absolute Error (MAE) metric, which should be preferred when an equal weightage to all errors is desired .

Week 4 – Friday

I’ve acquired knowledge on implementing polynomial regression and decided to experiment with CDC data on inactivity and obesity. My objective was to observe the differences between polynomial regression and simple linear regression. I conducted this comparison by assessing their respective R-squared values and further validated the models using k-fold cross-validation. The exercise provided a clearer understanding of how a lower R-squared value can serve as an indicator of variability, aiding in the evaluation and judgment of a model’s performance.