class: center, middle, inverse, title-slide .title[ # Nested Forecasting Approach and modeltime Methods ] .author[ ### Dr. Kam Tin Seong
Assoc. Professor of Information Systems ] .institute[ ### School of Computing and Information Systems,
Singapore Management University ] .date[ ### 2022-7-16 (updated: 2022-07-30) ] --- # Content .vlarge[ + The motivation + Principles of nested forecasting approach + Nested forecasting processes + Nested forecasting with modeltime ] --- ## Motivation of Nested Forecasting Approach .large[ In real world practice, it is very common a forecaster is required to forecast multiple time series by fitting multiple models. ] .center[ ] --- class: center, middle # Nested Forecasting  .small[ Source: [Getting Started with Modeltime](https://business-science.github.io/modeltime/articles/nested-forecasting.html) ] --- ## Setting Up R Environment .pull-left[ For the purpose of this hands-on exercise, the following R packages will be used. ```r pacman::p_load(tidyverse, tidymodels, timetk, modeltime) ``` + [**tidyverse**](https://lubridate.tidyverse.org/) provides a collection of commonly used functions for importing, wrangling and visualising data. In this hands-on exercise the main packages used are readr, dplyr, tidyr and ggplot2. ] .pull-right[ + [**modeltime**](https://business-science.github.io/modeltime/index.html) a new time series forecasting package designed to speed up model evaluation, selection, and forecasting. modeltime does this by integrating the [**tidymodels**](https://www.tidymodels.org/) machine learning ecosystem of packages into a streamlined workflow for tidyverse forecasting. ] --- ## The data .pull-left[ In this sharing, [**Store Sales - Time Series Forecasting: Use machine learning to predict grocery sales**](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/overview) from Kaggle competition will be used. For the purpose of this sharing, the main data set used is *train.csv*. It consists of six columns. They are: + *id* contains unique id for each records. + *date* gives the sales dates. + *store_nbr* identifies the store at which the products are sold. + *family* identifies the type of product sold. + *sales* gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips). + *onpromotion* gives the total number of items in a product family that were being promoted at a store at a given date. ] -- .pull-right[ For the purpose of this sharing, I will focus of grocery sales instead of all products. Code chunk below is used to extract grocery sales from *train.csv* and saved the output into an rds file format for subsequent used. ```r grocery <- read_csv( "data/store_sales/train.csv") %>% filter(family == "GROCERY") %>% write_rds( "data/store_sales/grocery.rds") ``` ] --- ### Step 1: Data Import and Wrangling .pull-left[ In the code chunk below, `read_rds()` of **readr** package is used to import grocery.rds data into R environment. Then, `mutate()`, `across()` and `as.factor()` are used to convert all values in columns 1,3 and 4 into factor data type. ```r grocery <- read_rds( "data/store_sales/grocery.rds") %>% mutate(across(c(1, 3, 4), as.factor)) %>% filter(date >= "2015-01-01") ``` ] -- .pull-right[ In the code chunk below, `read_csv()` is used to import *stores.csv* file into R environment. TThen, `mutate()`, `across()` and `as.factor()` are used to convert values in columns 1to 5 into factor data type. ```r stores <- read_csv( "data/store_sales/stores.csv") %>% mutate(across(c(1:5), as.factor)) %>% select(store_nbr, cluster) ``` ] --- ### Data integration and wrangling .pull-left[ In the code chunk below, `left_join()` of **dplyr** package is used to join *grocery* and *stores* tibble data frames by using *store_nb*r as unique field. ```r grocery_stores <- left_join( x = grocery, y = stores, by = "store_nbr") ``` ] -- .pull-right[ In the code chunk below, a new tibble data frame called *grocery_cluster* is derived by summing sales values by values in cluster and date fields. ```r grocery_cluster <- grocery_stores %>% group_by(cluster, date) %>% summarise(value = sum(sales)) %>% select(cluster, date, value) %>% set_names(c("id", "date", "value")) %>% ungroup() ``` ] --- ### Visualising the time series data: The code chunk .pull-left[ It is always a good practice to visualise the time series graphically. ```r grocery_cluster %>% group_by(id) %>% plot_time_series( date, value, .line_size = 0.4, .facet_ncol = 5, .facet_scales = "free_y", .interactive = FALSE, .smooth_size = 0.4) ``` ] --- ### Visualising the time series data: The plot  --- ## Preparation for Nested Forecasting .pull-left[ Before fitting the nested forecasting models, there are two key components that we need to prepare for: + **Nested Data Structure:** Most critical to ensure your data is prepared (covered next). + **Nested Modeltime Workflow:** This stage is where we create many models, fit the models to the data, and generate forecasts at scale. ] .pull-right[  ] --- ### Step 2: Preparing Nested Time Series Data Frame .pull-left[ .large[ There are three major steps in reparing the nested time series data frame. They are: + Creating an initial data frame and extending to the future, + Transforming the tibble data frame into nested modeltime data frame, and + Splitting the nested data frame into training and test (hold-out) data sets. ]] --- ### **Creating initial data frame and extending to the future** .pull-left[ Firstly, we will create a new data table and extend the time frame 60 days into the future by using [`extend_timeseries()`](https://business-science.github.io/modeltime/reference/prep_nested.html) of modeltime. ```r nested_tbl <- grocery_cluster %>% extend_timeseries( .id_var = id, .date_var = date, .length_future = "60 days") ``` ] .pull-right[  ] --- ### Nesting the tibble data frame .pull-left[ Next, [`nest_timeseries()`](https://business-science.github.io/modeltime/reference/prep_nested.html) is used to transform the newly created data frame in previous slide into a nested data frame by grouping the values in the *id* field. ```r nested_tbl <- nested_tbl %>% nest_timeseries( .id_var = id, .length_future = 60, .length_actual = 17272) ``` Notice that the nested data frame consists of three fields namely *id*, *.actual_data* and *.future_data*.] .pull-right[  ] --- ### Data sampling .pull-left[ Lastly, `split_nested_timeseries()` is used to split the original data into training and testing (or hold-out) data sets. ```r nested_tbl <- nested_tbl %>% split_nested_timeseries( .length_test = 60) ``` ] .pull-right[  ] --- ### Step 3: Creating Tidymodels Workflows In this step, we will first applying tidymodels approach to create four forecasting models by using [`recipe()`](https://recipes.tidymodels.org/reference/recipe.html) of [**recipe**](https://recipes.tidymodels.org/) package and [`workflow()`](https://workflows.tidymodels.org/reference/workflow.html) of [**workflow**](https://workflows.tidymodels.org/) package. Both packages are member of [**tidymodels**](https://www.tidymodels.org/), a family of R packages specially designed for modeling and machine learning using [**tidyverse**](https://www.tidyverse.org/) principles. .pull-left[ **Model 1: Exponential Smoothing (Modeltime)** An Error-Trend-Season (ETS) model by using [`exp_smoothing()`](https://business-science.github.io/modeltime/reference/exp_smoothing.html). ```r rec_autoETS <- recipe( value ~ date, extract_nested_train_split( nested_tbl)) wflw_autoETS <- workflow() %>% add_model( exp_smoothing() %>% set_engine("ets")) %>% add_recipe(rec_autoETS) ``` ] -- .pull-right[ **Model 2: Auto ARIMA (Modeltime)** An auto ARIMA model by using [`arima_reg()`](https://business-science.github.io/modeltime/reference/arima_reg.html). ```r rec_autoARIMA <- recipe( value ~ date, extract_nested_train_split( nested_tbl)) wflw_autoARIMA <- workflow() %>% add_model( arima_reg() %>% set_engine("auto_arima")) %>% add_recipe(rec_autoARIMA) ``` ] --- ### Step 3: Creating Tidymodels Workflows (cont') .pull-left[ **Model 3: Boosted Auto ARIMA (Modeltime)** An Boosted auto ARIMA model by using [`arima_boost()`](https://business-science.github.io/modeltime/reference/arima_boost.html). ```r rec_xgb <- recipe( value ~ ., extract_nested_train_split( nested_tbl)) %>% step_timeseries_signature(date) %>% step_rm(date) %>% step_zv(all_predictors()) %>% step_dummy(all_nominal_predictors(), one_hot = TRUE) wflw_xgb <- workflow() %>% add_model( boost_tree( "regression") %>% set_engine("xgboost")) %>% add_recipe(rec_xgb) ``` ] -- .pull-right[ **Model 4: prophet (Modeltime)** A prophet model using [`prophet_reg()`](https://business-science.github.io/modeltime/reference/prophet_reg.html). ```r rec_prophet <- recipe( value ~ date, extract_nested_train_split( nested_tbl)) wflw_prophet <- workflow() %>% add_model( prophet_reg( "regression", seasonality_yearly = TRUE) %>% set_engine("prophet") ) %>% add_recipe(rec_prophet) ``` ] --- ### Prophet [**Prophet**](https://facebook.github.io/prophet/) is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. The general formula is defined as follow:  It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. .small[ Source: [Forecasting at Scale](https://peerj.com/preprints/3190/) ] --- ### XGBoost forecasting algorithm **XGBoost(Extreme Gradient Boosting)** is an implementation of the gradient boosting ensemble algorithm for classification and regression. Time series datasets can be transformed into supervised learning using a sliding-window representation.  .small[ Reference: [XGBOOST AS A TIME-SERIES FORECASTING TOOL](https://filip-wojcik.com/talks/xgboost_forecasting_eng.pdf) ] --- ### What is so special of XGBoost?  --- ## Nested Forecasting with modeltime .center[ ] --- ### Step 4: Fitting Nested Forecasting Models .pull-left[ In this step, [`modeltime_nested_fit()`](https://business-science.github.io/modeltime/reference/modeltime_nested_fit.html) is used to fit the four models we created in Step 3. Note that the input must be in the form of nested modeltime table (i.e. *nested_tbl*) ```r nested_tbl <- modeltime_nested_fit( nested_data = nested_tbl, wflw_autoETS, wflw_autoARIMA, wflw_prophet, wflw_xgb) ``` ] --- ### The nested modeltime data frame The output object *nested_tbl* is a nested modetime data frame.  -- Click on the table icon of the first row under *.modeltime_tables* field, its corresponding modelling data frame appears.  Notice that the four models were fitted by using the test data set. --- ### Step 5: Model Accuracy Assessment with Test Logged Attributes .pull-left[ Before we go ahead to select the best model, it is a good practice to compare the performance of the models by using accuracy matrices. ```r nested_tbl %>% extract_nested_test_accuracy() %>% table_modeltime_accuracy( .interactive = FALSE) ``` What can we learn fro mthe code chunk above? + [`extract_nested_test_accuracy()`](https://business-science.github.io/modeltime/reference/log_extractors.html) is used to extract the accuracy matrices by using the test data set. + [`table_modeltime_accuracy()`](https://business-science.github.io/modeltime/reference/table_modeltime_accuracy.html) is used to display the accuracy report in tabular form. ] .pull-right[ Note that `.interactive` argument returns interactive or static tables. If TRUE, returns `reactable` table. If FALSE, returns static `gt` table. ] --- ### Step 5: Model Accuracy Assessment with Test Logged Attributes
Accuracy Table
id
.model_id
.model_desc
.type
mae
mape
mase
smape
rmse
rsq
1
1
ETSMNA
Test
980.30
9.44
0.59
9.66
1275.14
0.52
1
2
ARIMA
Test
1297.01
13.00
0.78
12.87
1602.60
0.11
1
3
PROPHET
Test
869.20
8.43
0.52
8.53
1061.91
0.60
1
4
XGBOOST
Test
774.08
7.47
0.46
7.53
964.33
0.66
2
1
ETSANA
Test
749.22
11.58
0.82
10.48
1116.95
0.21
2
2
ARIMA
Test
1103.26
17.13
1.21
15.41
1342.42
0.00
2
3
PROPHET
Test
633.69
9.24
0.69
9.04
982.70
0.29
2
4
XGBOOST
Test
497.78
7.82
0.54
7.28
678.29
0.73
3
1
ETSANA
Test
2349.65
12.20
0.85
12.03
2929.23
0.33
3
2
ARIMA
Test
3056.48
14.87
1.11
15.60
3817.20
0.01
3
3
PROPHET
Test
2211.61
11.59
0.80
11.26
2593.66
0.49
3
4
XGBOOST
Test
2113.59
10.77
0.77
10.88
2622.32
0.52
4
1
ETSANA
Test
1022.66
9.59
0.50
9.30
1380.97
0.66
4
2
ARIMA
Test
1063.64
10.03
0.52
9.63
1419.50
0.66
4
3
PROPHET
Test
1039.09
10.00
0.51
9.59
1317.76
0.70
4
4
XGBOOST
Test
760.45
7.32
0.37
7.11
988.27
0.82
5
1
ETSANA
Test
1137.69
10.65
0.52
10.21
1591.22
0.71
5
2
ARIMA
Test
1182.35
11.11
0.54
10.50
1636.24
0.71
5
3
PROPHET
Test
970.66
9.09
0.44
8.95
1493.63
0.70
5
4
XGBOOST
Test
799.55
7.31
0.36
7.14
1146.20
0.83
6
1
ETSANA
Test
4544.43
13.59
0.92
13.19
5613.06
0.51
6
2
ARIMA
Test
5757.96
18.19
1.16
16.46
6698.34
0.52
6
3
PROPHET
Test
4352.88
13.29
0.88
12.90
5360.33
0.49
6
4
XGBOOST
Test
2867.44
8.46
0.58
8.29
3647.97
0.81
7
1
ETSANA
Test
625.28
10.94
0.82
11.39
800.48
0.27
7
2
ARIMA
Test
835.66
16.05
1.10
14.82
983.17
0.02
7
3
PROPHET
Test
679.32
11.79
0.89
12.51
885.92
0.20
7
4
XGBOOST
Test
539.32
9.72
0.71
9.82
646.86
0.52
8
1
ETSANA
Test
2200.69
13.18
0.97
12.07
2656.93
0.54
8
2
ARIMA
Test
2032.78
12.15
0.90
11.17
2526.41
0.54
8
3
PROPHET
Test
1268.20
7.29
0.56
7.12
1658.77
0.62
8
4
XGBOOST
Test
1241.69
7.32
0.55
6.93
1649.55
0.73
9
1
ETSANA
Test
602.63
11.06
0.69
10.31
765.24
0.58
9
2
ARIMA
Test
597.25
10.87
0.69
10.16
762.89
0.59
9
3
PROPHET
Test
458.22
8.11
0.53
8.05
605.74
0.65
9
4
XGBOOST
Test
412.86
7.35
0.48
7.26
507.81
0.75
10
1
ETSANA
Test
2660.18
10.10
0.51
10.76
3945.18
0.65
10
2
ARIMA
Test
5579.39
20.18
1.08
23.83
7545.19
0.26
10
3
PROPHET
Test
2804.60
10.45
0.54
11.09
4119.26
0.67
10
4
XGBOOST
Test
2126.00
8.47
0.41
8.68
2658.25
0.83
11
1
ETSANA
Test
2508.10
9.77
0.51
9.42
3147.47
0.71
11
2
ARIMA
Test
3469.80
13.60
0.71
13.38
4320.72
0.44
11
3
PROPHET
Test
2383.68
8.68
0.48
9.01
3308.05
0.68
11
4
XGBOOST
Test
2013.79
8.01
0.41
7.83
2535.41
0.78
12
1
ETSANA
Test
679.43
14.70
0.87
13.41
845.52
0.64
12
2
ARIMA
Test
759.62
16.50
0.97
14.74
958.62
0.62
12
3
PROPHET
Test
641.66
14.40
0.82
13.29
747.16
0.65
12
4
XGBOOST
Test
679.48
15.44
0.87
13.94
833.40
0.71
13
1
ETSANA
Test
1665.20
11.82
0.90
11.00
1997.53
0.54
13
2
ARIMA
Test
2040.20
14.46
1.11
13.49
2439.06
0.15
13
3
PROPHET
Test
1258.50
8.03
0.68
8.47
1733.33
0.59
13
4
XGBOOST
Test
850.80
5.79
0.46
5.59
1110.86
0.81
14
1
ETSANA
Test
4712.31
12.48
0.62
12.05
6418.03
0.69
14
2
ARIMA
Test
5471.91
14.87
0.72
13.95
6898.97
0.69
14
3
PROPHET
Test
4542.00
12.67
0.60
12.19
6028.11
0.69
14
4
XGBOOST
Test
3415.03
9.28
0.45
8.83
4809.18
0.82
15
1
ETSANA
Test
1433.52
9.97
0.72
9.69
1763.29
0.49
15
2
ARIMA
Test
3182.94
20.81
1.60
24.17
3864.81
0.08
15
3
PROPHET
Test
1426.38
9.79
0.72
9.70
1799.58
0.44
15
4
XGBOOST
Test
973.03
6.77
0.49
6.56
1241.86
0.78
16
1
ETSANA
Test
520.36
17.79
0.92
16.24
631.58
0.54
16
2
ARIMA
Test
517.15
19.38
0.91
17.47
618.96
0.38
16
3
PROPHET
Test
426.77
15.04
0.75
14.07
508.18
0.54
16
4
XGBOOST
Test
699.41
24.16
1.23
29.72
855.65
0.52
17
1
ETSANA
Test
507.83
9.60
0.39
9.31
680.97
0.63
17
2
ARIMA
Test
758.96
15.76
0.58
14.23
963.44
0.38
17
3
PROPHET
Test
552.44
9.82
0.42
10.05
737.03
0.63
17
4
XGBOOST
Test
491.53
8.91
0.37
9.06
661.54
0.68
--- ### Step 6: Extracting and Visualising Nested Test Forecast .pull-left[ In this step, [`extract_nested_test_forecast()`](https://business-science.github.io/modeltime/reference/log_extractors.html) is used to extract the forecasted values from the nested modeltime data frame and [`plot_modeltime_forecast()`](https://business-science.github.io/modeltime/reference/plot_modeltime_forecast.html) is used to plot the forecasted values graphically. ```r nested_tbl %>% extract_nested_test_forecast() %>% group_by(id) %>% plot_modeltime_forecast( .facet_ncol = 4, .interactive = FALSE) ``` ] --- ### Static multiple small line graphs  --- ### Step 7: Extracting nested error logs .pull-left[ Before going ahead to choose the best model, it is always a good practice to examine if there is any error in the model. This task can be accomplished by using [`extract_nested_error_report()`](https://business-science.github.io/modeltime/reference/log_extractors.html). ```r nested_tbl %>% extract_nested_error_report() ``` ``` ## # A tibble: 0 × 4 ## # … with 4 variables: id <fct>, .model_id <int>, .model_desc <chr>, ## # .error_desc <chr> ``` ] --- ### Step 8: Selecting the Best Model .pull-left[ Now we are ready to select the best model by using [`modeltime_nested_select_best()`](https://business-science.github.io/modeltime/reference/modeltime_nested_select_best.html). ```r best_nested_tbl <- nested_tbl %>% modeltime_nested_select_best( metric = "rmse", minimize = TRUE, filter_test_forecasts = TRUE) ``` Note that to select the best forecasting models, the `minimize` argument must set to *TRUE*. ] --- ### Extracting and displaying nested best model report .pull-left[ After selecting the best model for each time series, we can display the best model report by using [`extract_nested_best_model_report()`](https://business-science.github.io/modeltime/reference/log_extractors.html). ```r best_nested_tbl %>% extract_nested_best_model_report() %>% table_modeltime_accuracy( .interactive = FALSE) ``` ] --- ### Extracting and displaying nested best model report
Accuracy Table
id
.model_id
.model_desc
.type
mae
mape
mase
smape
rmse
rsq
1
4
XGBOOST
Test
774.08
7.47
0.46
7.53
964.33
0.66
2
4
XGBOOST
Test
497.78
7.82
0.54
7.28
678.29
0.73
3
3
PROPHET
Test
2211.61
11.59
0.80
11.26
2593.66
0.49
4
4
XGBOOST
Test
760.45
7.32
0.37
7.11
988.27
0.82
5
4
XGBOOST
Test
799.55
7.31
0.36
7.14
1146.20
0.83
6
4
XGBOOST
Test
2867.44
8.46
0.58
8.29
3647.97
0.81
7
4
XGBOOST
Test
539.32
9.72
0.71
9.82
646.86
0.52
8
4
XGBOOST
Test
1241.69
7.32
0.55
6.93
1649.55
0.73
9
4
XGBOOST
Test
412.86
7.35
0.48
7.26
507.81
0.75
10
4
XGBOOST
Test
2126.00
8.47
0.41
8.68
2658.25
0.83
11
4
XGBOOST
Test
2013.79
8.01
0.41
7.83
2535.41
0.78
12
3
PROPHET
Test
641.66
14.40
0.82
13.29
747.16
0.65
13
4
XGBOOST
Test
850.80
5.79
0.46
5.59
1110.86
0.81
14
4
XGBOOST
Test
3415.03
9.28
0.45
8.83
4809.18
0.82
15
4
XGBOOST
Test
973.03
6.77
0.49
6.56
1241.86
0.78
16
3
PROPHET
Test
426.77
15.04
0.75
14.07
508.18
0.54
17
4
XGBOOST
Test
491.53
8.91
0.37
9.06
661.54
0.68
--- ### Extracting and Visualising Nested Best Test Forecasts .pull-left[ We can also plot multiple small line graphs by using `plot_modeltime_forecast()`. ```r best_nested_tbl %>% extract_nested_test_forecast() %>% group_by(id) %>% plot_modeltime_forecast( .facet_ncol = 4, .interactive = FALSE) ``` ] --- ### Extracting and Visualising Nested Best Test Forecasts  --- ### Step 9: Refitting and forecast forward .pull-left[ The last step of the forecasting process is to refit the best models with the full data set and forecast to the future by using [`modeltime_nested_refit()`](https://business-science.github.io/modeltime/reference/modeltime_nested_refit.html). ```r nested_refit_tbl <- best_nested_tbl %>% modeltime_nested_refit( control = control_nested_refit( verbose = TRUE)) ``` Note that `control_nested_refit(verbose = TRUE)` is used to display the modelling results as each model is refit. This is an useful way to follow the nested model fitting process. ] --- ### Extracting and Visualising Nested Future Forecast .pull-left[ Similar, `plot_modeltime_forecast()` can be used to visualise the forecasts. However, instead of `extracted_nested_test_forecast()` is used, `extract_nested_future_forecast()` is used. ```r nested_refit_tbl %>% extract_nested_future_forecast() %>% group_by(id) %>% plot_modeltime_forecast( .interactive = FALSE, .facet_ncol = 4) ``` ] --- ### Extracting and Visualising Nested Future Forecast  --- ### Interactive Line Graph of Future Forecast For effective data discovery, interactive data visualisation can be used as shown in the figure below.
--- ### Interactive Line Graph of Future Forecast .pull-left[ Code chunk below is used to create the interactive line graph shown in previous slide. ```r nested_refit_tbl %>% extract_nested_future_forecast() %>% filter(id == 1) %>% plot_modeltime_forecast( .interactive = TRUE, .facet_ncol = 4, .plotly_slider = TRUE) ``` ]