Trend Analysis: STL Decomposition

Let us return to our Economic Indicators dataset to discuss trends and seasonality.

We can determine whether there is a long-term trend in any of the economic indicators (like the number of international flights at Logan International Airport) using Time series analysis techniques. A common method is to plot historical data on international flight counts over time. This graphical representation allows you to visually inspect the data for patterns, trends, and potential seasonality.

  • From 2013 to 2015, there is an initial upward trend, indicating a gradual increase in the number of international flights.
  • From 2015 to 2016, the plot steepens, indicating a faster increase in international flights during this time period. This could indicate a period of rapid growth or a shift in influencing factors.
  • From 2016 to 2018, the plot maintains the same angle as in the previous years (2013 to 2015). This indicates a pattern of sustained growth, but at a relatively consistent rate, as opposed to the sharper increase seen in the previous period.

To extract underlying trends, we could also calculate and plot a moving average or use more advanced time series decomposition methods. These techniques can assist in identifying any long-term patterns or fluctuations in international flight counts, providing valuable insights into the airport’s long-term dynamics.

Moving Average: A moving average is a statistical calculation that is used to analyse data points by calculating a series of averages from different subsets of the entire dataset. A “window” denotes the number of data points in each average. The term “rolling” refers to the calculation being performed iteratively for each successive subset of data points, resulting in a smooth trend line.

  • Let’s take the window as 12.
  • Averages and subsets: Each subset of 12 consecutive points in the dataset is considered by the moving average.
    • For each subset, it computes the average of these 12 points.
    • As the window moves across your flight data, there will be N-12+1 subsets (N is the number of total datapoints you have)
    • The Average (x) is interpreted as follows: Assume the first window’s average is x. This x is the smoothed average value for that specific time point. It is the 12-month average of the ‘logan_intl_flights’ value.
  • Highlighting a Trend: The moving average’s purpose is to smooth out short-term fluctuations or noise in the data.
  • The Orange Line in Context: The orange line’s peaks and valleys represent data trends over the specified window size. If the orange line rises, it indicates an increasing trend over the selected time period. If it is falling, it indicates a downward trend.
    When compared to the original data, the fluctuations are less abrupt, making it easier to identify overall trends.
  • A confidence interval around the moving average is typically represented by the light orange region. A wider confidence interval indicates that the data points within each window are more uncertain or variable.

STL or Seasonal-Trend decomposition using LOESS is a time series decomposition method that divides a time series into three major components: trend, seasonal, and residual. Each component is described below:

  • Before we do the STL decomposition, let’s understand what frequency is. Because STL is based on the assumption of regular intervals between observations, having a well-defined frequency is critical.
    • Not every time series inherently has a well-defined frequency. While some time series data may naturally exhibit a regular and consistent pattern at specific intervals, others may not follow a clear frequency.
    • How do check if your time series has a well defined frequency?
    • After you set the Date as your index, you can check using df.index.freq to see what the frequency is. If it’s ‘None’, you will have to explicitly set it.
    • df.index.to_series().diff().value_counts(): This will show you the counts of different time intervals between observations in your time series. Observe the most common interval, as it is likely the frequency of your time series.
    • The idea is to observe the most common interval, which is likely the frequency of your time series. The output helps verify that the chosen frequency aligns with the actual structure of the
    • data.df.index.freq = ‘MS’: This explicitly sets the frequency of the time series to ‘MS’, which stands for “Month Start.”
    • Based on this assumed frequency, it decomposes the time series into seasonal, trend, and residual components. By explicitly setting the frequency, you ensure that STL correctly interprets the data and captures the desired patterns.
  • Trend Component: The trend component in time series data represents the long-term movement or underlying pattern. It is created by smoothing the original time series with a locally weighted scatterplot smoothing (LOESS). LOESS is a non-parametric regression technique that fits data to a smooth curve

  • Seasonal Component: The seasonal component captures recurring patterns or cycles that occur over a set period of time.
    STL, unlike traditional seasonal decomposition methods, allows for adaptive seasonality by adjusting the period based on the data. The seasonal component values represent systematic and repetitive patterns that occur at a specific frequency, typically associated with seasons or other regular data cycles. These values, which can be positive or negative, indicate the magnitude and direction of the seasonal effect.

  •  Residual Component: After removing the trend and seasonal components, the residual component represents the remaining variability in the data. It is the “noise” or irregular fluctuations in time series that are not accounted for by trend and seasonal patterns.

Now what is this residual component? If you look, your original time series data has various patterns, including seasonal ups and downs, a long-term trend, and some random fluctuations. Residuals are essentially the leftover part of your data that wasn’t explained by the identified patterns. They are like the “random noise” or “unpredictable” part of your data. It is nothing but the difference between the observed value and the predicted value. I plan on addressing the entire logic behind this in a separate blog.

So technically, if your model is good, the residuals should resemble random noise with no discernible structure or pattern.
And if you see a pattern in your residuals, it means your model did not capture all of the underlying dynamics.

Which months do you believe have the highest number of international flights? Take a wild guess. Is it around Christmas, Thanksgiving, umm.. Valentine’s Day?

We can isolate and emphasise the recurring patterns inherent in international flight numbers by using the seasonal component obtained from STL decomposition rather than the original data. The seasonal component represents the regular, periodic fluctuations, allowing us to concentrate on the recurring variations associated with different months.

By analysing this component, we can identify specific months with consistently higher or lower international flight numbers, allowing us to gain a better understanding of the dataset’s seasonal patterns and trends. This method aids in the discovery of recurring behaviours that may be masked or diluted in raw, unprocessed data.

The month with the highest average seasonal component is represented by the tallest bar. This indicates that, on average, that particular month has the most flights during the season. Shorter bars, on the other hand, represent months with lower average seasonal effects, indicating periods with lower international flight numbers. Based on the seasonal patterns extracted from the data, analysing the heights of these bars provides insights into seasonal variations and helps identify which months have consistently higher or lower international flight numbers.

Leave a Reply

Your email address will not be published. Required fields are marked *