I’m absolutely delighted to present our project report, which examines the examination of data from the Centres for Disease Control and Prevention for 2018 in depth. Our attention to the prevalence of diabetes, obesity, and inactivity in US counties has revealed insightful findings. I’m eager to share what we’ve learned after our journey of exploration!
MTH522 Diabetes ReportHuber Regression (Week 5 – Friday)
Since my data had some outliers I wanted a model that is less sensitive to them. I tried Huber Regression today after experimenting with a couple of other regression models. Although I obtained lower R-squared values from cross-validation which may indicate that my Huber Regression model is not explaining as much of the variance in the data as my Multiple Linear Regression model made earlier, I found it to be performing better in the presence of outliers.
The dilemma is whether I want a model that is less sensitive to outliers, the one that’s capturing the overall trend of the data is more critical, the Multiple Linear Regression model might be more suitable.
Also, while R-squared is just one metric, it’s essential to consider other evaluation metrics MSE and MAE to get a holistic view of predictive performance.
Week 5 – Wednesday
I worked on my project today, focusing on the analysis of CDC data. I utilized inactivity percentage and obesity percentage as predictors, with diabetes percentage as the dependent variable. I experimented with multiple linear regression and polynomial regression, varying the degrees trying to achieve the optimal R-squared value. Subsequently, I applied the Breusch-Pagan test to examine the hypothesis that the residuals (variance in errors) in my model are constant across all levels of the independent variables.
I constructed a new regression model with the squared residuals as the dependent variable and the independent variables from my original model. The goal was to determine if the squared residuals could be predicted by the independent variables. After calculating a test statistic, I obtained a critical value from the chi-squared distribution with 1 degree of freedom (df=1). Since the test statistic was greater than the critical value, I concluded that heteroskedasticity was present
Week 5 – Monday
Last week, I used the R-squared metric for cross-validation, which measures the proportion of variance in the dependent variable that is predictable from predictors. In further exploration today, I tried evaluating my models using alternative scoring metrics, reading about their differences. Notably, I discovered that in the absence of a specified scoring metric in the parameters, the cross_val_score function defaults to calculating the negative Mean Squared Error (MSE) for each fold, a metric that is particularly sensitive to outliers. Furthermore, I gained an understanding of the Mean Absolute Error (MAE) metric, which should be preferred when an equal weightage to all errors is desired .
Week 4 – Friday
I’ve acquired knowledge on implementing polynomial regression and decided to experiment with CDC data on inactivity and obesity. My objective was to observe the differences between polynomial regression and simple linear regression. I conducted this comparison by assessing their respective R-squared values and further validated the models using k-fold cross-validation. The exercise provided a clearer understanding of how a lower R-squared value can serve as an indicator of variability, aiding in the evaluation and judgment of a model’s performance.
Week 4 – Wednesday
I have acquired the knowledge necessary to conduct both z-tests and t-tests using python. In the context of a normal distribution, a fundamental requirement for performing a z-test is that the sample size is either equal to or exceeds 30, and furthermore, the population standard deviation must be known. It is imperative to understand that when the test statistic surpasses the critical value indicated by the stated decision rule using the z-table, it leads to the rejection of the null hypothesis. In cases where information about the population standard deviation is unavailable, the one-sample t-test is employed. For this test, critical values are referenced from the t-table instead.
Week 4 – Monday
I performed an analysis on a dataset with a focus on outlier detection with five-number summary. I defined a function, detect_outliers(data) where I used the z-score method to get the outliers. I kept the threshold to 3 standard deviations from mean. I calculated quartiles (Q1 and Q3) and the interquartile range (IQR), to understand the central tendency and spread of the data. I then obtained the upper and lower fences.
Lastly, using seaborn, I made a boxplot to get the distribution and potential outliers. I also learnt how to implement z-test and t-test using python which I will be trying tomorrow.
Week 3- Friday
Implementation of multiple linear regression, continued..
After encoding the data, I divided the data into training and test set using train_test_split from sklearn (test size 0.2) like I did in Simple linear regression. Once I created a muiltiple linear regression model using sklearn’s LinearRegression class, I trained it using the training data for it to learn relationships between the predictors (R&D Spend,Administration,Marketing Spend,State) and the profits. Using the model I made profit predictions on the test data.
So overall, I gained insights into the importance of data preprocessing (encoding categorical data) and application of regression models in real world scenarios.
Week 3 – Thursday
Today I tried implementing a more advanced form of regression analysis to model the relationships between multiple predictors and a dependent variable. The data I used had a list of 50 startups, characterized by its spending in different departments, (R&D Spend,Administration,Marketing Spend,State,Profit)
The data consisted of categorical data. I learned to use OneHotEncoder and ColumnTranformer to convert it into numerical format to use it in my regression model.
Week 3 – Wednesday
I learnt to implement simple linear regression and I made use of salary data to find a relationship between years of experience and salary. I imported the essential libraries like numpy, matplotlip and pandas and loaded the csv file. I divided it into 2 arrays X and Y. The former being the independent variable and the latter, dependent. Then I split the dataset into training and test sets using train_test_split function of scikit-learn. Using the LinearRegression class of sklearn, I created a regression model and trained it on the training data to learn the relationship between years of experience and salary. After that predictions are made on the test data, and the results are stored it in a variable. I used matplotlib to create scatter plots of training and test data.