ARIMA (AutoRegressive Integrated Moving Average) is a popular time series forecasting method that models a time series as a combination of autoregressive (AR) and moving average (MA) components, with the added ability to handle non-stationary time series through differencing.
Let’s consider an example of forecasting the monthly sales of a particular product over the past 2 years, from January 2019 to December 2020. Here is the data:
Month Sales
Jan-19 100
Feb-19 120
Mar-19 125
Apr-19 130
May-19 140
Jun-19 145
Jul-19 150
Aug-19 155
Sep-19 160
Oct-19 170
Nov-19 180
Dec-19 200
Jan-20 210
Feb-20 220
Mar-20 230
Apr-20 240
May-20 250
Jun-20 260
Jul-20 270
Aug-20 280
Sep-20 290
Oct-20 300
Nov-20 310
Dec-20 320
We want to build an ARIMA model to forecast the sales for the next 6 months (Jan-21 to Jun-21).
First, we need to check if the time series is stationary. Stationarity is a key assumption of ARIMA models. A stationary time series has a constant mean, variance, and autocorrelation over time. We can check for stationarity using a statistical test like the Augmented Dickey-Fuller (ADF) test. Here’s the Python code to perform the test:
import pandas as pd from statsmodels.tsa.stattools import adfuller # Load the data data = pd.read_csv('sales_data.csv') # Convert the Month column to datetime format data['Month'] = pd.to_datetime(data['Month'], format='%b-%y') # Set the Month column as the index data.set_index('Month', inplace=True) # Perform the ADF test result = adfuller(data['Sales']) print('ADF Statistic:', result[0]) print('p-value:', result[1])
The output of the ADF test is:
ADF Statistic: -1.781072297230788 p-value: 0.3949925809654875
The p-value is greater than 0.05, so we fail to reject the null hypothesis that the time series is non-stationary. This suggests that we need to difference the time series to achieve stationarity.
To difference the time series, we can take the first difference (i.e., the difference between each observation and the previous observation). Here’s the Python code to do that:
# Take the first difference data_diff = data.diff().dropna() # Perform the ADF test again result = adfuller(data_diff['Sales']) print('ADF Statistic:', result[0]) print('p-value:', result[1])
The output of the ADF test is:
ADF Statistic: -4.249799122750079 p-value: 0.0005452049051191423
The p-value is now less than 0.05, so we can conclude that the time series is stationary. Next, we need to determine the appropriate values for the ARIMA parameters: p, d, and q. The p parameter