Day 4 with “TSFP”

If I were you, I would question why we haven’t discussed AR or MA in isolation before turning to S/ARIMA. This is due to the fact that ARIMA is merely a combination of them, and you can quickly convert an ARIMA to an AR by offsetting the parameter for the MA and vice versa. Before we get started, let’s take a brief look back. We have seen how the ARIMA model operates and how to manually determine the proper method for determining the model parameters through trial and error. I have two questions for you: is there a better way to go about doing this? Secondly, we have seen models like AR, MA, and their combination ARIMA, how do you choose the best model?

Simply put, an MA model uses past forecast errors (residuals) to predict future values, whereas an AR model uses past values of the time series to do so. It assumes that future values are a linear combination of past values, with coefficients representing the weights of each past value. It makes the assumption that the future values are the linear sum of the forecast errors from the past.

Choosing Between AR and MA Models:

Understanding the type of data is necessary in order to select between AR and MA models. An AR model could be appropriate if the data shows distinct trends, but MA models are better at capturing transient fluctuations. Model order selection entails temporal dependency analysis using statistical tools such as the Partial AutoCorrelation Function (PACF) and AutoCorrelation Function (ACF). Exploring both AR and MA models and contrasting their performance using information criteria (AIC, BIC) and diagnostic tests may be part of the iterative process.

A thorough investigation of the features of a given dataset, such as temporal dependencies, trends, and fluctuations, is essential for making the best decision. Furthermore, taking into account ARIMA models—which integrate both AR and MA components—offers flexibility for a variety of time series datasets.

In order to produce the most accurate and pertinent model, the selection process ultimately entails a nuanced understanding of the complexities of the data and an iterative refinement approach.

Let’s get back to our data from “Analyze Boston”. To discern the optimal AutoRegressive (AR) and Moving Average (MA) model orders, Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots are employed.

Since the first notable spike in our plots occurs at 1, we tried to model the AR component with an order of 1. And guess what? After a brief period of optimism, the model stalled at zero. The MA model, which had the same order and a flat line at 0, produced results that were comparable. For this reason, in order to identify the optimal pairing of AR and MA orders, a comprehensive search for parameters is necessary.

As demonstrated in my earlier post, these plots can provide us with important insights and aid in the development of a passable model. However, we can never be too certain of anything, and we don’t always have the time to work through the labor-intensive process of experimenting with the parameters. Why not just have our code handle it.

To maximise the model fit, the grid search methodically assessed different orders under the guidance of the Akaike Information Criterion (AIC). As a result of this rigorous analysis, the most accurate AR and MA models were found, and they were all precisely calibrated and predicted. I could thus conclude that (0,0,7) and (8,0,0) were the optimal orders for MA and AR respectively.

Interesting!

The best-fitting AR and MA models’ forecasting outcomes for second-order differenced Logan International Flights data are shown in the plot. The forecast from the best AR model, which captures the autoregressive temporal dependencies, is shown by the blue dashed line. The forecast from the best MA model, which captures short-term fluctuations, is indicated by the orange dashed line. By attempting to predict the differed data, both models demonstrate their unique advantages. By comparing the models, one can evaluate the degree to which each one accurately depicts the underlying patterns in the time series, which can help determine which forecasting strategy is best.

Looking back, our experience serves as a prime example of why time series analysis must progress beyond visual exploration. The combination of statistical analysis, model fitting, and iterative refinement proved crucial, even though visual cues provide a foundational understanding. Even though the AR and MA models were a part of the ARIMA framework, their behaviours were subtle. ACF, PACF, and the grid search all interacted to determine the orders selected for the AR and MA components, which were critical to the models’ performance.

Now that we are working with a combination corresponding to least AIC without having to perform the work manually, it is finally time to build a model about which we can be optimistic.

 

The main performance evaluation metric was Mean Absolute Error (MAE). With its lowest MAE of 147.32, the AR model demonstrated the highest predictive accuracy. But the process of making decisions went beyond MAE. Model interpretability, complexity, and robustness were taken into account, acknowledging that choosing a model necessitates a careful assessment.

Call this a win-win!

With both AR and MA components combined, the ARIMA model showed a competitive MAE of 175.82. This all-encompassing strategy guarantees a thorough choice that strikes a balance between practical concerns and statistical measurements, realising that lower MAE, although important, is only one factor in the larger picture of successful forecasting.

What if I told you there are other approaches as well?

I hope that I was able to adequately introduce you to all of the time series concepts that we covered. We will now go over something called the General Modelling Procedure (GMP). It’s like a forecasting road map, guiding us to the best way to model our data and make predictions.

General Modelling Procedure:

  • The steps for identifying a stationary ARMA(p,q) process were covered in the previous section.
  • Our time series can be modelled by an ARMA(p,q) process if both the ACF and PACF plots show a sinusoidal or decaying pattern.
  • Neither plot, however, was useful in determining the orders p and q. In both plots of our simulated ARMA(1,1) process, we found that coefficients were significant after lag 1 and the model flatlined.
  • As a result, we had to devise a method for determining the orders p and q. But this procedure we are going to talk about now has the advantage of being applicable in cases where our time series is non-stationary and has seasonal effects. It will also be appropriate in cases where p or q are equal to zero.

Wait a minute, does that mean everything we’ve been doing has been a waste of time when we could have gotten right to this? No!

  • The first few steps are the same as those we gradually built up until the first half of this post, as we still need to collect data, test for stationarity, and apply transformations as needed. Then we list the various possible values of p and q. We can fit every unique combination of ARMA(p,q) to our data using a list of possible values.
  • After that, we can calculate the Akaike information criterion (AIC). This measures the quality of each model in comparison to the others. After that, the model with the lowest AIC is chosen.
  • The residuals of the model, which are the differences between the model’s actual and predicted values, can then be examined. Ideally, the residuals should resemble white noise, implying that any difference between predicted and actual values is due to randomness. As a result, the residuals must be uncorrelated and independent.
  • We can evaluate those properties by examining the quantile-quantile plot (Q-Q plot) and performing the Ljung-Box test.
  • If the analysis leads us to the conclusion that the residuals are completely random, we have a forecasting model.
  • This is all you need!

Understanding AIC (Akaike information criterion)

  • A model’s quality in comparison to other models is estimated by the AIC.
  • The AIC measures the relative amount of information lost by the model, taking into account that some information will always be lost during the fitting process. The better the model, the lower the AIC value and the less information lost.
  • AIC = 2k – 2log(L)
  • The maximum value of the likelihood function L and the number of parameters (k) in a model determine the AIC’s value, the lower the AIC, the better the model. We can maintain a balance between a model’s complexity and its goodness of fit to the data by making selections based on the AIC.
  • An ARMA(p,q) model’s order (p,q) is directly correlated with the number of estimated parameters, k. To estimate, we have 2 + 2 = 4 parameters if we fit an ARMA(2,2) model.
  • It is evident how fitting a more complicated model can penalise the AIC score, the AIC rises as the order (p,q) increases along with the number of parameters (k).
  • The likelihood function calculates a model’s goodness of fit. It can be thought of as the distribution function’s opposite. The probability of observing a data point is determined by the distribution function, given a model with fixed parameters.
  • The logic is inverted by the likelihood function. It will calculate the likelihood that various model parameters will produce the observed data given a set of observed data.
  • We can think of the likelihood function as an answer to the question “How likely is it that my observed data is coming from an ARMA(2,2) model?” If it is very likely, meaning that L is large, then the ARMA(2,2) model fits the data well.

 

 

Day 3 with “TSFP” and “Analyze Boston”

The Moving Average Model: or MA(q), is a time series model that predicts the current observation by taking into account the influence of random or “white noise” terms from the past. It is a part of the larger category of models referred to as ARIMA (Autoregressive Integrated Moving Average) models. Let’s examine the specifics:

  • Important Features of MA(q):
    1. Order (q): The moving average model’s order is indicated by the term “q” in MA(q). It represents the quantity of historical white noise terms taken into account by the model. In MA(1), for instance, the most recent white noise term is taken into account.
    2. White Noise: The current observation is the result of a linear combination of the most recent q white noise terms and the current white noise term. White noise is a sequence of independent and identically distributed random variables with a mean of zero and constant variance.
    3. Mathematical Equation: An MA(q) model’s general form is expressed as follows:
      Y t = μ + Et + (θ(1) ​E(t−1) )​+( θ(2) E(t-2) ​)+…+(θq​Et−q)
      Y t is the current observation.
      The time series mean is represented by μ.
      At time t, the white noise term is represented by Et.
      The weights allocated to previous white noise terms are represented by the model’s parameters, which are θ 1​, θ 2​,…, θ q​.
  • Key Concepts and Considerations:
    1. Constant Mean (μ): The moving average model is predicated on the time series having a constant mean (μ). 
    2. Stationarity: The time series must be stationary in order for MA(q) to be applied meaningfully. Differencing can be used to stabilise the statistical characteristics of the series in the event that stationarity cannot be attained. 
    3. Model Identification: The order q is a crucial aspect of model identification. It is ascertained using techniques such as statistical criteria or autocorrelation function (ACF) plots.
  • Application to Time Series Analysis:
    1. Estimation of Parameters: Using statistical techniques like maximum likelihood estimation, the parameters θ 1, θ 2, …, θ q, are estimated from the data.
    2. Model Validation: Diagnostic checks, such as residual analysis and model comparison metrics, are used to assess the MA(q) model’s performance.
    3. Forecasting: Following validation, future values can be predicted using the model. Based on the observed values and historical white noise terms up to time t−q, the forecast at time t is made.
  • Use Cases:
    1. Capturing Short-Term Dependencies: When recent random shocks have an impact on the current observation, MA(q) models are useful for detecting short-term dependencies in time series data. 
    2. Complementing ARIMA Models: To create ARIMA models, which are strong and adaptable in capturing a variety of time series patterns, autoregressive (AR) and differencing components are frequently added to MA(q) models.

Let’s try to fit an MA(1) model to the ‘logan_intl_flights’ time series from Analyze Boston. But before that it’s important to assess whether the ‘logan_intl_flights’ time series is appropriate for this type of model. The ACF and PACF plots illustrate the relationship between the time series and its lag values, which aids in determining the possible order of the moving average component (q).

Okay, now what?

Perplexing, isn’t it? In a time series model, understanding the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots can help determine the Moving Average (MA) component’s order.

  • Reading the ACF (Autocorrelation Function) plot:
    • Lags on the x-axis: The x-axis represents the number of lags, indicating the time span between the current observation and its lagged values.
    • Correlation values on the y-axis: The y-axis displays the correlation coefficient between the time series and its lagged values.
    • Interpretation: A significant spike at a particular lag indicates a strong correlation with the observations at that lag. In the ACF plot, we’re looking for significant spikes that decay as lags increase.
  • Reading the PACF (Partial Autocorrelation Function) plot:
    • Lags on the x-axis: Similar to the ACF plot, the x-axis represents the number of lags. 
    • Partial correlation values on the y-axis: The y-axis displays the partial correlation coefficient, which measures the unique correlation between the current observation and its lagged values, removing the influence of intermediate lags. 
    • Interpretation: Similar to the ACF plot, significant spikes in the PACF plot indicate a strong partial correlation with the observations at those lags. PACF helps identify the direct influence of each lag on the current observation.

In essence, ACF is “How each lag influences subsequent times” while PACF  tells you “The direct influence of each lag on the current time” .

To interpret our plots we look for ACF Decay, that is if the ACF plot shows a rapid decay after a few lags, it suggests that the series might not require a large MA order. And PACF Cutoff, whether or not the PACF plot has a significant spike at lag 1 and then cutoffs for subsequent lags, it may indicate an MA(q) model where q is the last significant lag in the PACF plot. A potential direct relationship between the current observation and its immediate past value is indicated by a cutoff in the Partial Autocorrelation Function (PACF) at lag 1.

Let’s go through each component of the SARIMAX results:

  • Coefficients Table:
    1. const (Constant):
      • Coefficient (coef): -0.6513
      • Standard Error (std err): 1.579
      • z-value: -0.412
      • P-value (P>|z|): 0.680
      • 95% Confidence Intervals: [-3.746, 2.444]
      • Interpretation: The constant term represents the intercept of the model. In this case, the estimated intercept is approximately -0.6513. The p-value (0.680) suggests that the constant is not statistically significant.
    2. ma.L1 (Moving Average Order 1):
      • Coefficient (coef): -0.9989
      • Standard Error (std err): 2.897
      • z-value: -0.345
      • P-value (P>|z|): 0.730
      • 95% Confidence Intervals: [-6.678, 4.680]
      • Interpretation: The ma.L1 coefficient represents the strength of the moving average effect at lag 1. In this case, it is estimated as approximately -0.9989. The p-value (0.730) suggests that this coefficient is not statistically significant.
    3. sigma2 (Residual Variance):
      • Coefficient (coef): 1.142e+05
      • Standard Error (std err): 3.4e+05
      • z-value: 0.335
      • P-value (P>|z|): 0.737
      • 95% Confidence Intervals: [-5.53e+05, 7.81e+05]
      • Interpretation: Sigma2 represents the estimated variance of the model residuals. The large standard error and p-value (0.737) suggest uncertainty about the accuracy of this estimate.
  • Diagnostic Tests:
    1. Ljung-Box (Q) Statistic:
      • Value: 0.86
      • P-value (Prob(Q)): 0.35
      • Interpretation: The Ljung-Box test assesses the autocorrelation of residuals at lag 1. A non-significant p-value (0.35) suggests that there is no significant autocorrelation in the residuals.
    2. Jarque-Bera (JB) Statistic:
      • Value: 1.45
      • P-value (Prob(JB)): 0.48
      • Interpretation: The Jarque-Bera test assesses the normality of residuals. A non-significant p-value (0.48) suggests that the residuals do not significantly deviate from a normal distribution.
  • Other Statistics:
    1. Heteroskedasticity (H):
      • Value: 1.18
      • P-value (Prob(H)): 0.68
      • Interpretation: The test for heteroskedasticity assesses whether the variance of the residuals is constant. A non-significant p-value (0.68) suggests that there is no significant evidence of heteroskedasticity.
    2. Skewness:
      • Value: 0.26
      • Interpretation: Skewness measures the asymmetry of the residuals. A value around 0 suggests a symmetric distribution.
    3. Kurtosis:
      • Value: 2.61
      • Interpretation: Kurtosis measures the “tailedness” of the residuals. A value of 2.61 suggests moderate tailedness.
  • Summary:
    • The coefficients table provides estimates for the model parameters, including the constant, moving average effect, and residual variance.
    • Diagnostic tests indicate that the residuals do not exhibit significant autocorrelation, deviate significantly from a normal distribution, or show evidence of heteroskedasticity.
    • Overall, while the model provides parameter estimates, the lack of statistical significance for some coefficients and the uncertainty in the residual variance estimation may indicate that the model might need further refinement or exploration. 

 

  • Refinement? In the field of time series modelling, choosing the right values for an ARIMA (AutoRegressive Integrated Moving Average) model—often represented by the letters p, d, and q—necessitates a careful process of trial and error. The model’s moving average, differencing, and autoregressive components are determined, respectively, by these parameters.
    • Step 1: An examination of the autocorrelation function (ACF) plot sets the stage for the journey. The correlation between a time series and its lagged values can be understood by looking at this graphical representation. Potential values for p, the autoregression order, are represented by peaks in the ACF plot, and a good value for q, the moving average order, is indicated by the first notable dip.
    • Step 2: Iterations and comparisons against statistical metrics are part of the process. AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are two of the most notable of these. Lower values of these information criteria indicate a better model. They assess the trade-off between model fit and complexity. 
    • Step 3: Equally important is the coefficients’ statistical significance. Each model coefficient denotes the impact of values that are lags on the current observation. We are looking for a middle ground, avoiding an unduly complicated model and choosing statistically significant coefficients (p-values usually ≤ 0.05). For the model to be both statistically reliable and comprehensible, this equilibrium is essential.

To help you understand it much better, allow me to walk you through the procedure once more using our plot. Take another look at the ACF. Do you see how significant the spikes are up until the third lag before the graph disappears into the confidence interval? Indeed, it rises once more after 10 lags. Thus, there are roughly six options for “p”. Now look at the PCF plot, you can see that 1, 7, 10, 13, and so forth are the spikes that are outside of the confidence interval. These are the potential values for “q”. We know that “d” should be 2 because our data isn’t stationary and that second-order differencing fixes it from the previous blog.

A common strategy is to start with a simple model, gradually introducing complexity until reaching a point where further additions do not substantially improve fit or interpretability. This careful calibration ensures that the model captures essential patterns without succumbing to overfitting.

In essence, the journey to find the right combination of p, d, and q in ARIMA modeling is a dance between visual exploration, statistical metrics, and the art of striking the delicate balance between model complexity and effectiveness. 

If you’re still reading, you probably already understand how to interpret this.

  • ARIMA Order: The model is specified as ARIMA(0, 2, 7), indicating a non-seasonal differencing of order 2 and a seasonal differencing of order 7.
  • MA Coefficients (ma.L1 to ma.L7)
    • Interpretation: A subset of the coefficients (p-value < 0.05) are not statistically significant, but some are. This implies that different lags have different relative importance in explaining the current observation.
  • Sigma 2 Residual Variance: sigma2 (p-value < 0.05) = 6.606e+04
    • Interpretation: The statistical significance of the estimated residual variance indicates the presence of variability that cannot be explained by the model.
  • Diagnostic Tests
    • Q-statistic for Ljung-Box (L1): 0.04 (p-value > 0.05)
    • JB test result: 2.50 (p-value greater than 0.05).
    • 1.10 is the heteroskedasticity (H) (p-value > 0.05).
    • Interpretation: The Jarque-Bera test and the Heteroskedasticity test indicate that there are no significant problems with normality or heteroskedasticity, respectively, and the Ljung-Box test reveals no significant autocorrelation at lag 1.

We’re probably at our best here. Please let me know if you are aware of a more effective way to achieve the desired combination without repeatedly iterating. Our next goal is to try and forecast the future values of flights at Logan International Airport using the SARIMA(0, 2, 7) model.

I tested the model’s performance on unseen data (10% of the dataset) after training it on historical data, which accounted for 90% of the dataset. 386.20 was determined to be the Mean Absolute Error (MAE), a metric used to assess the accuracy of predictions.

In order to forecast future flights on the test set, the code first divided the data into two sets: a training set and a test set. Then, it fitted the SARIMA(0, 2, 7) model to the training set. Next, a comparison was made between the outcomes and the real test set values.

The mean absolute error (MAE) of 386.20 indicates that our predictions were off by about 386 flights on average from the real values. Although the MAE is a helpful metric, it’s crucial to understand its interpretation in relation to the particular dataset and the flight domain at Logan International Airport. Additionally, the Mean Absolute Percentage Error (MAPE) of 8.96% suggests that, on average, our predictions deviated by about 8.96% from the actual values.

MAPE of 8.96% suggests that the model is providing reasonably accurate predictions on the test set.

It’s critical to remember that the particular context and demands of your forecasting task will determine how MAPE should be interpreted. A MAPE of less than 5% is usually desired but  however the acceptable degree of error can change depending on the application and industry.

 

Day 2 with “TSFP” by Marco Peixeiro and “Analyze Boston”

Stationarity:

  • In time series analysis, stationarity is an essential concept. A time series that exhibits constant statistical attributes over a given period of time is referred to as stationary. The modelling process is made simpler by the lack of seasonality or trends. Two varieties of stationarity exist:
    • Strict Stationarity: The entire probability distribution of the data is time-invariant.
    • Weak Stationarity: The mean, variance, and autocorrelation structure remain constant over time.
  • Transformations like differencing or logarithmic transformations are frequently needed to achieve stationarity in order to stabilise statistical properties.
  • Let’s check the stationarity of the ‘logan_intl_flights’ time series. Is the average number of international flights constant over time?
  • A visual inspection of the plot will tell you it’s not. But let’s try performing an ADF test.
  • The Augmented Dickey-Fuller (ADF) test is a prominent solution to this problem. Using a rigorous examination, this statistical tool determines whether a unit root is present, indicating non-stationarity. The null hypothesis of a unit root is rejected if it is less than the traditional 0.05 threshold, confirming stationarity. The integration of domain expertise and statistical rigour in this comprehensive approach improves our comprehension of the dataset’s temporal dynamics.
  • The ADF test starts with a standard autoregressive (AR) model: Y(t) = pY(t-1) + α + E(t);  where is the value of the time series at time t, p is the autoregressive coefficient, is a constant, and E(t) is a white noise error term. The presence of a unit root is indicated by p=1.
  • The ADF test extends this by including lagged differences of the time series:
    • : Differenced series at time .
    • : Lagged value at time .
    • : Coefficient of the lagged difference term that is
    • : Coefficient of the lagged level term that is Y(t-1) representing the value of the time series at the previous time step.
    • : Constant term.
    • : White noise error term at time .
  • The difference between a time series variable’s current value and its value at a prior time point (referred to as the “lagged value”) is known as the “lagged difference“.
  • The ADF test typically involves choosing the number of lags () to include in the model to account for autocorrelation in the time series. The corrected equation would be:
    • ΔYt= β1​(ΔYt1)+ β2(ΔYt2)+….+ βp​(ΔYtp)+ γYt1+α+εt
    • are the coefficients for the lagged differences up to lag . The parameter is determined based on statistical criteria or domain knowledge.​
  • You might wonder what exactly this p does in relation to the our ADF test. It is nothing but the number of lags of the differenced series that are included in the model. These lags are added to the time series in order to account for autocorrelation. The correlation between a time series and its own historical values is known as autocorrelation. Any residual autocorrelation in the differenced series is better captured by the model’s inclusion of lag differences.
  • The goal is to find an appropriate value for such that the differenced series  becomes stationary.
  • With a Unit Root (p):
    • If p, the autoregressive process has a unit root, leading to non-stationarity.
    • With p = 1, is solely dependent on the immediately preceding value plus a random noise term E.
    • In the ADF equation with
  • Without a Unit Root ():
    • If , the autoregressive process does not have a unit root, indicating stationarity.
    • The series lacks a stochastic trend, and its statistical properties are more likely to be constant over time.
    • This would imply that the model only incorporates a small number of lag differences.
    • If the p-value is smaller, it could indicate that the autocorrelation structure is modelled with fewer historical observations.
    • The results of the hypothesis test and the corresponding p-value are used to determine whether there is stationarity or non-stationarity; a lower p does not necessarily indicate non-stationarity.

Differencing:

  • We have already seen how differencing works, it is nothing but a method for achieving stationarity in a time series is differencing. It entails calculating the variations between subsequent observations. 
  • By removing seasonality and trends, this procedure improves the time series’ analytical accessibility. The first-order difference is calculated as Y (t) – Y (t−1) 
  • How does the first-order differenced series look compared to the original series?
  • Has differencing stabilized the statistical properties of the time series? A visual assessment tells me that the differencing has been successful in making the series more stationary, as I observe a stabilisation of the mean, a removal of trends. But determining stationarity may not always be possible through visual inspection.
  • NO!
  • Let’s try a higher-order differencing method, more precisely, the second-order differencing and then run the ADF test once more.
  • YES!

Autocorrelation Function (ACF):

  • A statistical tool called the Autocorrelation Function (ACF) calculates the correlation between a time series and its own lagged values. It facilitates the discovery of trends, patterns, and seasonality in the data. Correlation coefficients for various lags are shown in the ACF plot, making it possible to identify important lags and possible autocorrelation structures in the time series.
  • The ACF at lag k is the correlation between the time series and itself at lag k.
  • Mathematically, ACF(k)=Cov(Yt,Ytk)/ root [Var(Yt).Var(Ytk)]
  • Positive ACF: High values at one time point may be related to high values at another time point if the ACF is positive, indicating a positive correlation.
  • Negative ACF: An inverse relationship between values at different times is suggested by a negative ACF, which denotes a negative correlation.
  • ACF values in the neighbourhood of zero signify little to no linear correlation.​
  • Lag Structure: Plotting the ACF against various lags usually reveals the correlation structure over time.
    Correlation between adjacent observations is represented by the first lag (lag 1), correlation between observations two time units apart by the second lag (lag 2), and so on.
  • Significance Analysis: ACF values are frequently subjected to significance testing in order to ascertain their statistical significance. Each lag is plotted with confidence intervals and an ACF value may be deemed statistically significant if it is outside of these intervals.
  • Seasons and Recurring Trends: Seasonality or recurring patterns in the time series may be indicated by peaks or valleys in the ACF plot at particular lags. A notable peak at lag 12 in a monthly time series, for instance, can indicate annual seasonality
  • How to Interpret Noisy ACF:  Autocorrelation values may indicate that the time series is less significant and more random if they oscillate around zero without any discernible pattern.
  • Identification of the Model: In time series models, the autoregressive (AR) and moving average (MA) terms can have potential parameters that can be found using the ACF plot. An AR term may be required if there is positive autocorrelation at lag 1, whereas an MA term may be required if there is negative autocorrelation at lag 1.
  • To give you a condensed explanation:
    • The presence of a significant positive peak at lag 1 implies a correlation between the value at time t and the value at time t−1, suggesting the possibility of autoregressive behaviour.
    • The presence of a noteworthy negative peak at lag 1 implies a negative correlation between the value at time t and the value at time t−1, suggesting the possibility of moving average behaviour.
    • In upcoming blogs, I’ll try to go into much more detail about these behaviours. Stay tuned!
  • Let’s analyse the Economic Indicators to see if we can answer these questions:
    • Are there significant autocorrelations at certain lags?
    • Does the ACF reveal any seasonality or repeating patterns?
  • This provides insights into into the temporal dependencies within the ‘logan_intl_flights’ time series. The ACF plot indicates a strong positive correlation with the past one month (lag 1), meaning that there is a tendency for the number of international flights in a given month to positively correlate with the number in the month before. This discovery can direct additional research and modelling, particularly when taking into account techniques like autoregressive models that capture these temporal dependencies.
  • Peaks or spikes signify significant autocorrelation at particular lags. The time series exhibits a strong correlation with its historical values at these spikes.
  • The autocorrelation’s direction and strength are shown on the y-axis. Perfect positive correlation is represented by a value of 1, perfect negative correlation by a value of -1, and no correlation is represented by a value of 0.
  • Generally speaking, autocorrelation values get lower as you get farther away from lag 0. A quick decay indicates that the impact of previous observations fades quickly, whereas a slow decay could point to longer-term dependencies.
  • Values of Autocorrelation: lag 0: The series’ correlation with itself is represented by the autocorrelation, which is always 1.
    lag 1 through lag 10: As the lag grows, the autocorrelation values progressively drop. Significantly, there is a strong positive correlation between consecutive observations at lag 1, as indicated by the high autocorrelation (0.87) at this point.
  • Confidence Intervals of 95%: The range that the true population autocorrelation values are most likely to fall inside is provided by the confidence intervals. The autocorrelation values are regarded as statistically significant if they are outside of these ranges. The absence of zero in the lag 1 confidence interval ([0.6569, 1.0846]) indicates that the autocorrelation at lag 1 is statistically significant. This is consistent with the ACF plot’s clearly visible strong positive correlation.
  • The significant positive correlation between the number of international flights in consecutive months is indicated by the high autocorrelation at lag 1 (0.87). This might suggest a level of momentum or forward motion in the sequence.
    As the lag lengthens, the autocorrelation values may show a diminishing impact from earlier observations. The statistical significance of the autocorrelation values is evaluated with the aid of the confidence intervals. Values that fall outside of these ranges are probably not random fluctuations but rather real correlations.

    • The ACF values gradually decrease as the lag increases, indicating a declining influence of past observations.

Forecasting a Random Walk:

  • We have already seen what a Random Walk is in the previous blog.
  • The underlying premise of the concept is that any deviation from the most recent observed value in a time series is essentially random and that future values in the series are dependent on this value.
  • The random walk, in spite of its simplicity, is a standard by which forecasting models are measured, particularly when predicting intricate patterns is difficult. 
  • Forecasting Process: Initialization: The last observed value in the historical data is frequently needed as an initial value to begin the forecasting process.Iterative Prediction: The forecast for the following observation is just the most recent observed value for each successive time period. Any deviations or modifications are presumed by the model to be random and unpredictable.Evaluation: By comparing the predicted values to the actual observations, metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), or Root Mean Squared Error (RMSE) are commonly used to evaluate the performance of the random walk model.
  • Use cases and Limitations:Baseline Comparison: Random walk forecasting is frequently used as a benchmark model to assess how well more complex forecasting methods perform. If a more sophisticated model is unable to beat the random walk, it indicates that it may be difficult to capture the inherent randomness in the data.Short-Term Predictions: These models work well for short-term forecasts when the persistence assumption(refer last blog post) is somewhat true.Limitations: The random walk model’s simplicity is both a strength and a weakness. Although it offers a baseline, it might miss intricate patterns or modifications to the time series’ underlying structure.
  • Time to answer some questions.
    • How well does the random walk model predict future international flights?
    • Does the model capture the inherent randomness in the time series?
  • What’s interesting to me is that, by its very nature, stationarity is not required for the random walk model. Actually, non-stationary time series are frequently subjected to the random walk method, and the effectiveness of this method is assessed by its capacity to capture short-term dynamics as opposed to long-term trends.
  • I went on to evaluate the model using the MAE metric and found that, on average, our random walk model’s predictions deviate by approximately 280.16 units from the actual international flight counts.
  • The interpretation of the MAE is dependent on the scale of our data. In this case, the MAE value of 280.16 should be interpreted in the context of the total international flight counts.
  • The performance of more complex forecasting models is measured against this MAE value. When a more complex model attains a lower mean absolute error (MAE), it signifies an enhancement in forecast precision in contrast to the naive random walk.
  • The evaluation metrics assist you in determining how well the random walk captures the underlying patterns in the ‘logan_intl_flights’ time series, even though it offers a baseline.
  • When considering the scale of the data, the Random Walk model’s average prediction error is approximately 5.33% of the maximum flight count.
  • On average, the international flights at Logan Airport amount to around 3940.51.
  • The MAE for the Mean Benchmark is approximately 578.06. That is, if a simple benchmark model that predicts the mean were used, the average prediction error would be approximately 578.06 units.
  • The Random Walk model, with an MAE of 280.16, outperforms a simple benchmark model that predicts the mean (MAE of 578.06). This indicates that the Random Walk captures more information than a naive model that predicts the mean for every observation.
  • The scale-adjusted MAE of 5.33% provides a relative measure, suggesting that, on average, the Random Walk’s prediction errors are modest in comparison to the maximum flight count.
  • When interpreting MAE, it’s essential to consider the specific context of your data and the requirements of your forecasting task.

Moving Average Model: MA(q):

  • The Moving Average Model, also known as MA(q), is a time series model that takes into account how previous random or “white noise” terms may have affected the current observation. The moving average model’s order, or the number of previous white noise terms that are taken into account, is indicated by the notation MA(q). For instance, MA(1) takes the most recent white noise term’s impact into account.
  • What is the optimal order (q) for the moving average model?
  • How does the MA(q) model compare to simpler models in predicting international flights?
  • We will be answering these questions next time.

 

Day 1 with “Time Series Forecasting with Python” by Marco Peixeiro

I feel compelled to share the pearls of knowledge this insightful book on time series analysis has bestowed upon me as I delve deeper into its pages.

  1. Defining Time Series: A data point sequence arranged chronologically is called a time series. It is a compilation of measurements or observations made at regular intervals that are equally spaced apart. Time series data are widely used in many disciplines, such as environmental science, biology, finance, and economics. Understanding the underlying patterns, trends, and behaviours that might be present in the data over time is the main goal when working with time series. Time series analysis is the study of modelling, interpreting, and projecting future values from past trends.
  2. Time Series Decomposition: A technique for dissecting a time series into its fundamental elements-trend, seasonality, and noise, is called time series decomposition. These elements aid us in comprehending data patterns more clearly.
    • Trend: The data’s long-term movement or direction. It aids in determining if the series is steadily rising, falling, or staying the same over time.
    • Seasonality: Seasonal components identify recurring, regular patterns in the data that happen on a regular basis. Retail sales, for instance, may show seasonality, with higher values around the holidays.
    • Noise (or Residuals): This element stands for the sporadic variations or anomalies in the data that are not related to seasonality or trends. It is the time series’ “unexplained” portion, in essence.

    Decomposing a time series into these components aids in better understanding the data’s structure, facilitating more accurate forecasting and analysis.

  3. Forecasting Project Lifecycle: The entire process of project lifecycle forecasting entails making predictions about future trends or outcomes using historical data. Usually, this lifecycle has multiple stages:
    • Data Collection
    • Exploratory Data Analysis (EDA): Examine the data to find patterns, outliers, and other features that might have an impact on the forecasting model.
    • Model Selection: Choose an appropriate forecasting model based on the nature of the data. Common models include ARIMA (AutoRegressive Integrated Moving Average), Exponential Smoothing, and machine learning algorithms.
    • Training the Model: Utilise past data to train the chosen model. This entails matching the parameters of the model to the historical observations..
    • The usuals: Validation and Testing, Deployment ,Monitoring and Maintenance

    In order to guarantee precise and current forecasts, the forecasting project lifecycle is iterative, requiring frequent updates and modifications.

  4. Baseline Models: Simple benchmarks or reference points for more complex models are provided by baseline models. They offer a minimal level of prediction, which is expected to be surpassed by more sophisticated models. In order to determine whether the increased complexity of a more complex model is warranted by its superior performance over a simpler method, baseline models are essential.
    • Mean or Average Baseline: This is projecting a time series’ future value using the mean of its historical observations. The mean baseline, for instance, would be the average temperature over a given historical period if you were forecasting the daily temperature.
    • Naive Baseline: Using the most recent observation as a basis, this model forecasts the future value. This refers to the time series assumption that the subsequent value will coincide with the most recent observed value.
    • Seasonal Baseline: The seasonal baseline is a method of forecasting future values for time series that exhibit a distinct seasonal pattern by utilising the average historical values of the corresponding season.

    Baseline models are essential for establishing a performance baseline and ensuring that any advanced model provides a significant improvement over these simple approaches.

    1. Random Walk Model: For time series forecasting, the random walk model is a straightforward but surprisingly powerful baseline. It assumes that any variations are totally random and that a time series’ future value will be equal to its most recent observed value. It can be stated mathematically as: is the value at time, is the most recent observed value, and E is a random error term.

      Key characteristics of the random walk model:

      1. Persistence: According to the model, the present is the best indicator of the future. In the event that the series exhibits non-stationarity, a trend will be followed by the random walk.
      2. Noisy Movements: The model can capture noise or short-term fluctuations because of the random error term E, which adds randomness.
      3. Usefulness as a Baseline: Even though it is straightforward, the random walk model can be surprisingly successful for some kinds of time series, particularly those that have erratic, unpredictable movements.

      When determining whether more intricate models result in appreciable increases in prediction accuracy, the random walk model is frequently employed as a benchmark. If an advanced model is unable to beat the random walk, it may be difficult to identify the underlying patterns in the data.

Let’s try exploring the ‘Economic Indicators’ dataset from Analyze Boston and try to see what the baseline(mean) for Total International flights at Logan Airport looks like.

The computation of the historical average of international flights at Logan Airport forms the basis of the baseline model. For this simple measure, depending on the temporal granularity of the dataset, the mean of international flights is calculated for each unit of time, such as months or years. The basis for forecasting upcoming international flights is the computed historical average.

The underlying assumption here is that future values will mirror the historical average, a concept that is explained by the formula explained earlier Y(t) = Y(t-1) + E.
The blue and red lines’ alignment in the visualization, or lack thereof, offers a quick indicator of how well the baseline model is performing. A tight harmony implies that the model encapsulates the essence of historical trends, providing a strong foundation for more complex models that come after. On the other hand, deviations invite a more thoughtful analysis as they indicate possible shortcomings in the baseline model and encourage the investigation of more sophisticated forecasting techniques.

A Comparative Exploration of Time Series Analysis and Predictive Modeling Techniques

In the vast landscape of data analysis, knowing the subtle differences between different methods is essential. Time series analysis is one area of expertise that has been specifically developed for analysing data that has been gathered over time. The unique advantages and disadvantages of each strategy are revealed as we compare time series methods with conventional predictive modelling.

The sentinel for tasks where temporal dependencies weave the narrative is time series analysis, represented by models such as ARIMA and Prophet. Moving averages, differencing, and autoregression are the three elegant ways that ARIMA captures elusive trends and seasonality, while Prophet—which was developed by Facebook—manages missing data and unusual events with skill.

On the flip side, conventional predictive models such as random forests, decision trees, and linear regression offer a flexible set of tools. Their broad strokes, however, might become unwieldy in the complex brushwork of dynamic temporal patterns. The assumption of linearity in linear regression may cause it to miss the complex dance of trends, and decision trees and random forests may fail to capture subtle long-term dependencies.

Heavyweights in machine learning, like neural networks and support vector machines, are adept at a variety of tasks, but they might not have a sophisticated perspective on temporal nuances. It may be difficult for even K-Nearest Neighbours to understand the language of time due to its simplicity.

In summary, the selection between time series analysis and conventional predictive modelling is contingent upon the characteristics of the available data. When it comes to figuring out the complexities of temporal sequences, time series methods come out on top. They offer a customised method for identifying and forecasting patterns over time that generic models might miss. Knowing the strengths and weaknesses of each data navigation technique helps us select the appropriate tool for the job at hand as we navigate the data landscape.

Let’s dive deeper into few importance concepts

  1. Stationarity: A key idea in time series analysis is stationarity. A time series that exhibits constant statistical attributes over time, like mean, variance, and autocorrelation, is referred to as stationary. A time series may be non-stationary if it exhibits seasonality or a trend.
    • Strict Stationarity: Distribution moments (mean, variance, etc.) are constant over time.
    • Trend Stationarity: Only mean is constant over time, but the variance may change.
    • Difference Stationarity: By differencing, the time series is made stationary. It is a first-order difference stationary series if the series becomes stationary after differencing once.
    • How do you check for stationarity?
      • Visual Inspection: Plot the time series data and look for trends or seasonality.
      • Summary Statistics: Compare mean and variance across different time periods.
      • Statistical Tests: Augmented Dickey-Fuller (ADF) test and Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.
  2. KPSS Test: It is used to test for stationarity of a time series around a deterministic trend. It has null and alternative hypotheses:
    • Null Hypothesis (H0): The time series is stationary around a deterministic trend.
    • Alternative Hypothesis (H1): The time series has a unit root and is non-stationary.
    • When the test’s p-value is less than the predetermined significance level (usually 0.05), the null hypothesis is rejected and non-stationarity is suggested.
  3. Unit Root Test: A unit root test determines if a time series has or does not have a unit root, which is a feature of a non-stationary time series. One common unit root test is the Augmented Dickey-Fuller (ADF) test.
    • ADF Test:
      • Null Hypothesis (H0): The time series has a unit root and is non-stationary.
      • Alternative Hypothesis (H1): The time series is stationary.
      • If the p-value is less than the significance level, the null hypothesis is rejected and it is concluded that the time series is stationary.

Report – Beyond the Headlines: Deep Dive into Police Shootings Data and Patterns

I am delighted to present this comprehensive report, offering a nuanced and insightful analysis of police shooting incidents. Through meticulous examination of data, statistical tests, and machine learning techniques, we aim to shed light on patterns, demographic disparities, and predictive insights. This report represents a dedicated effort to unravel the complexities surrounding police shootings, contributing to a more informed understanding of this critical issue.

Report_2_MTH522 (1)

Decoding Patterns: Analyzing Age-Related Trends in Weapon Use

I used a dataset from The Washington Post covering the years 2015 to 2023 to investigate age-related trends in the usage of weapons in fatal occurrences. I thoroughly analysed the data using Python, and ANOVA statistical testing showed a statistically significant age-related trend in the sorts of weapons used. Although some data overlap was visible, the first box plot visualisation showed age variances across various weapon kinds.

I used a violin plot to improve the visualisation for a sizable dataset. After encoding categorical characteristics, I then investigated using a logistic regression model to predict a person’s likelihood of carrying a weapon based on their age. Problems surfaced, such as convergence warnings and vague metrics, which led to changes like feature scaling and report editing.

The final model had a 58% accuracy rate, and the classification report that was produced indicated areas that needed work, especially with handling imbalanced classes and multi-label prediction. My recommendations for improvement included correcting imbalances, investigating more intricate models, and taking into account further features to improve prediction accuracy. The thorough investigation of weapon usage patterns highlighted how iterative data analysis and model construction are processes.

 

Report comprises three key components: F1-score (harmonic mean of precision and recall), recall (sensitivity, capturing actual positives), and precision (accuracy of positive predictions). The aforementioned metrics evaluate the model’s precision in identifying and categorising occurrences, which is essential for comprehending its efficacy across diverse dataset classes.

Simple Stats of Random Forest Success

Let’s dive into the statistical nuances of the random forest algorithm. Essentially, a random forest is an ensemble learning technique that works by building a large number of decision trees during training and producing the mean prediction (regression) or the mode of the classes (classification) of the individual trees. To break it down statistically:

  • Decision Trees
    • In essence, every tree in the forest is a sequence of binary choices made in response to input data. In order to make these choices, the best split is chosen at each node. Typically, entropy or Gini impurity are used for classification, and mean squared error is used for regression. They are highly sensitive to training data. This could result in high variance
  • Bootstrapping
    • By training each tree using bootstrapped samples of the original data, random forests provide unpredictability to the system. Bootstrapping is the process of establishing several datasets, some of which may contain duplicates, by sampling and replacement.
  • Feature Randomness:
    • The approach takes into account only a random selection of features at each split in a tree, not all of them. By doing this, the trees become more diverse and overfitting is less likely to occur.
  • Voting or Averaging:
    • A majority vote among the trees determines the final forecast for classification. It is the mean of all the trees’ predictions in regression.
  • Correlated Trees:
    • Both the randomness introduced during tree construction and the inherent randomness in the data influence the correlation between any two trees. This association is advantageous since it lowers overfitting and enhances prediction performance overall.
  • Out of Bag Errors:
    • There are data points that are not part of every tree’s training set because every tree is trained using a bootstrapped sample. These out-of-bag samples can be used to estimate the performance of the random forest without a separate validation set.
  • Tuning Parameters
    • The number of trees, the depth of each tree, and the size of the feature subsets utilised at each split are some of the characteristics that affect random forests. To maximise the random forest’s performance, certain settings must be tuned.

I used the random forest algorithm to see if I can predict the armed status in incidents using the age and race. The Random Forest model achieved an overall accuracy of 57% in predicting the ‘armed’ status. It excelled in identifying instances of ‘gun’ (97% recall, 73% F1-Score), but struggled with other classes, often resulting in 0% precision, recall, and F1-Score.  The classification report suggests limitations in recognizing diverse ‘armed’ scenarios. Improvements may involve hyperparameter tuning, addressing class imbalances, and exploring alternative models. Or maybe if I combine all kinds of arms and encode it, the model would perform better.

The Metrics:

  • Precision: It represents the ratio of true positive predictions to the total predicted positives. For example, for the class ‘gun,’ the precision is 59%, meaning that 59% of instances predicted as ‘gun’ by the model were actually ‘gun’ incidents.
  • Recall (Sensitivity): It represents the ratio of true positive predictions to the total actual positives. For ‘gun,’ the recall is 97%, indicating that the model captured 97% of the actual ‘gun’ incidents.
  • F1-Score: It is the harmonic mean of precision and recall. It provides a balanced measure between precision and recall. For ‘gun,’ the F1-Score is 73%.

Philosophical Foundations of Statistical Inference

Let’s talk about two of the common approaches to statistics today.

  1. Probability:
    • Frequentist Statistics (FS), probability is interpreted as nothing but the long-run frequency of events in a repeated, hypothetical infinite sequence of trials. It’s based on the idea of objective randomness.
    • Bayesian statistics (BS) views probability as a measure of belief or uncertainty. It incorporates prior beliefs and updates them based on new evidence using Bayes’ theorem.
  2. Parameter estimation:
    • FS: The focus is on estimating fixed, unknown parameters from observed data. This estimation is done using methods like maximum likelihood estimation (MLE).
    • BS: Bayesian inference provides a probability distribution for the parameters, incorporating prior knowledge and updating it with observed data to get a posterior distribution.
  3. Hypothesis testing:
    • FS: Frequentist hypothesis testing involves making decisions about population parameters based on sample data. It often uses p-values to determine the level of significance.
    • BS: Bayesian hypothesis testing involves comparing the probabilities of different hypotheses given the data. It uses posterior probabilities and Bayes factors to make decisions.

I used a Bayesian t-test strategy to take this prior knowledge into account because I firmly believe that the difference in average ages is approximately 7 and that it is statistically significant. The results, however, revealed an intriguing discrepancy: the observed difference was located towards the posterior distribution’s tail. I did not appreciate this disparity and how it demonstrated how sensitive Bayesian analysis is to previous specifications

To investigate the difference in average ages between black and white guys further, I used frequentist statistical methods such as Welch’s t-test in addition to the Bayesian approaches. The Welch’s t-test findings showed a very important finding:

 

In addition to the Bayesian studies, this frequentist method consistently and clearly indicates a large difference in mean ages. My confidence in the observed difference in average ages between the two groups is increased by the convergence of results obtained using frequentist and Bayesian approaches.

The investigation of average age differences reveals a complex terrain when comparing the Bayesian and frequentist approaches. Frequentist methods give clear interpretability, while Bayesian methods allow flexibility and measurement of uncertainty. The tension seen in the Bayesian t-test highlights how crucial it is to carefully specify the model and take previous data into account. Future analyses can gain from a comprehensive strategy that combines frequentist and Bayesian methodology’s advantages.

Efficient Geospatial Visualization of Police Shooting Incidents: Navigating Complexity with HeatMap and Marker Clusters

The problem in visualising a big collection of geocoordinated police shooting incidents is to effectively transmit information without overloading the viewer. First, I tried using the Folium library to create separate markers on a map to represent each incidence. But as the dataset expanded, the computing cost of making markers for each data point increased.

I decided to use a HeatMap as a more effective solution to this problem. With the help of the HeatMap, incident concentration may be shown more succinctly and the distribution of events on the map can be seen more clearly. I improved the heatmap’s interpretability by controlling its size and intensity using settings like blur and radius.

Alongside the HeatMap, I also used a MarkerCluster layer to further refine the visualisation but later decided not to. Although the general legibility of the map is increased by the MarkerCluster, which clusters nearby incidences together. The decision to exclude MarkerCluster was made to maintain simplicity and reduce the processing time, especially when dealing with a substantial dataset. The HeatMap alone provides a more concise representation of incident concentration while addressing the computational challenges associated with handling a large number of individual markers. Within these clusters, individual markers stay accessible, enabling users to zoom in and examine individual instances.

The outcome of this thorough process was an engaging and educational map. While the MarkerCluster layer allowed for the examination of individual episodes within clusters, the HeatMap offered a visual summary of incidence concentration. This visualisation method takes into account both specific occurrences and large patterns in the dataset, providing a nuanced perspective on the geographical distribution of police shooting incidents while maintaining efficiency and detail.