Week 4 – Wednesday

I have acquired the knowledge necessary to conduct both z-tests and t-tests using python. In the context of a normal distribution, a fundamental requirement for performing a z-test is that the sample size is either equal to or exceeds 30, and furthermore, the population standard deviation must be known. It is imperative to understand that when the test statistic surpasses the critical value indicated by the stated decision rule using the z-table, it leads to the rejection of the null hypothesis. In cases where information about the population standard deviation is unavailable, the one-sample t-test is employed. For this test, critical values are referenced from the t-table instead.

Week 4 – Monday

I performed an analysis on a dataset with a focus on outlier detection with five-number summary. I defined a function, detect_outliers(data) where I used the z-score method to get the outliers. I kept the threshold to 3 standard deviations from mean. I calculated quartiles (Q1 and Q3) and the interquartile range (IQR), to understand the central tendency and spread of the data. I then obtained the upper and lower fences.

Lastly, using seaborn,  I made a boxplot to get the distribution and potential outliers. I also learnt how to implement z-test and t-test using python which I will be trying tomorrow.

Week 3- Friday

Implementation of multiple linear regression, continued..
After encoding the data, I divided the data into training and test set using train_test_split from sklearn (test size 0.2) like I did in Simple linear regression. Once I created a muiltiple linear regression model using sklearn’s LinearRegression class, I trained it using the training data for it to learn relationships between the predictors (R&D Spend,Administration,Marketing Spend,State) and the profits. Using the model I made profit predictions on the test data.

So overall, I gained insights into the importance of data preprocessing (encoding categorical data) and application of regression models in real world scenarios.

 

Week 3 – Thursday

Today I tried implementing a more advanced form of regression analysis to model the relationships between multiple predictors and a dependent variable. The data I used had a list of 50 startups, characterized by its spending in different departments, (R&D Spend,Administration,Marketing Spend,State,Profit) 

The data consisted of categorical data. I learned to use OneHotEncoder and ColumnTranformer to convert it into numerical format to use it in my regression model.

Week 3 – Wednesday

I learnt to implement simple linear regression and I made use of salary data to find a relationship between years of experience and salary. I imported the essential libraries like numpy, matplotlip and pandas and loaded the csv file. I divided it into 2 arrays X and Y. The former being the independent variable and the latter, dependent. Then I split the dataset into training and test sets using train_test_split function of scikit-learn. Using the LinearRegression class of sklearn, I created a regression model and trained it on the training data to learn the relationship between years of experience and salary. After that predictions are made on the test data, and the results are stored it in a variable. I used matplotlib to create scatter plots of training and test data.

Week 3 – Monday

I gained an understanding of the assumptions (linearity, homoscedasticity, multivariate normality, independence of observations, lack of multicollinearity)  underlying linear regression and situations where it may not be applicable. I also explored the importance of avoiding the dummy variable trap. Additionally, I conducted multiple linear regression analysis on sample data using the scikit-learn library in Python within Jupyter Lab. Furthermore, I familiarized myself with five methods of model building (All-in where you throw in all the predictors, Backward Elimination, Forward Selection, Bidirectional Elimination, All-possible-models).

Week-1 Wednesday

I read about hypothesis testing, null and alternative hypotheses (H0 and H1). Essentially, it helps us assess whether our sample is extreme enough to reject the null hypothesis (H0). This understanding also helped me grasp p-values better, which measure how extreme our sample is. I also delved into how p-values are calculated. The smaller our p-value, the more extreme our sample must have been or the closer it was to the extreme. Then, I did some cursory reading on the additional material provided on the Breusch-Pagan test. I learned the difference between R-squared and the Standard Error of the Estimate.

Week 1- Mon, Tue

I went through all the basics of statistics, like data types, distributions, sampling and estimations, hypothesis testing and p values. Addition to that I read about new topics like kurtosis and heteroskedecity. I went through the notes and learnt why extreme values or outliers in the data for diabetes might requires a statistician to consider alternative statistical methods that are more accommodative to deviations from normality.