Forecast modeling • augury

To look at using forecast methods to predict data, we will again be using the ghost package, which provides an R interface for the GHO OData API and accessing data on blood pressure. We will load in data for the USA and Great Britain initially, which provide full time series from 1975 to 2015.

library(augury)

df <- ghost::gho_data("BP_04", query = "$filter=SpatialDim in ('USA', 'GBR') and Dim1 eq 'MLE' and Dim2 eq 'YEARS18-PLUS'") %>%
  billionaiRe::wrangle_gho_data() %>%
  dplyr::right_join(tidyr::expand_grid(iso3 = c("USA", "GBR"),
                                       year = 1975:2017))
#> Warning: Some of the rows are missing a source value.
#> Joining, by = c("iso3", "year")

head(df)
#> # A tibble: 6 × 13
#>   iso3   year ind   value lower upper use_dash use_calc source type  type_detail
#>   <chr> <int> <chr> <dbl> <dbl> <dbl> <lgl>    <lgl>    <lgl>  <chr> <chr>      
#> 1 GBR    1975 bp     37.8  26.7  49.1 TRUE     TRUE     NA     NA    NA         
#> 2 GBR    1976 bp     37.6  27.4  48   TRUE     TRUE     NA     NA    NA         
#> 3 GBR    1977 bp     37.3  27.9  46.8 TRUE     TRUE     NA     NA    NA         
#> 4 GBR    1978 bp     37.1  28.4  45.9 TRUE     TRUE     NA     NA    NA         
#> 5 GBR    1979 bp     36.9  28.8  45.2 TRUE     TRUE     NA     NA    NA         
#> 6 GBR    1980 bp     36.7  29.2  44.4 TRUE     TRUE     NA     NA    NA         
#> # … with 2 more variables: other_detail <chr>, upload_detail <chr>

With this data, we can now use the predict_forecast() function like we would any of the other predict_... functions from augury to forecast out to 2017. First, we will do this just on USA data and use the forecast::holt to forecast using exponential smoothing.

usa_df <- dplyr::filter(df, iso3 == "USA")

predict_forecast(usa_df,
                 forecast::holt,
                 "value",
                 sort_col = "year") %>%
  dplyr::filter(year >= 2012)
#> Registered S3 method overwritten by 'quantmod':
#>   method            from
#>   as.zoo.data.frame zoo
#> # A tibble: 6 × 16
#>   iso3   year ind   value lower upper use_dash use_calc source type  type_detail
#>   <chr> <int> <chr> <dbl> <dbl> <dbl> <lgl>    <lgl>    <lgl>  <chr> <chr>      
#> 1 USA    2012 bp     15.7  11.7  20.3 TRUE     TRUE     NA     NA    NA         
#> 2 USA    2013 bp     15.5  11.2  20.8 TRUE     TRUE     NA     NA    NA         
#> 3 USA    2014 bp     15.4  10.8  21.3 TRUE     TRUE     NA     NA    NA         
#> 4 USA    2015 bp     15.3  10.4  21.8 TRUE     TRUE     NA     NA    NA         
#> 5 USA    2016 NA     15.2  NA    NA   NA       NA       NA     NA    NA         
#> 6 USA    2017 NA     15.1  NA    NA   NA       NA       NA     NA    NA         
#> # … with 5 more variables: other_detail <chr>, upload_detail <chr>, pred <dbl>,
#> #   pred_upper <dbl>, pred_lower <dbl>

Of course, we might want to run these models all together for each country individually. In this case, we can use the group_models = TRUE function to perform the forecast individually by country. To save a bit of limited time, let’s use the wrapper predict_holt() to automatically supply forecast::holt as the forecasting function.

predict_holt(df,
             response = "value",
             group_col = "iso3",
             group_models = TRUE,
             sort_col = "year") %>%
  dplyr::filter(year >= 2014, year <= 2017)
#> # A tibble: 8 × 16
#>   iso3   year ind   value lower upper use_dash use_calc source type  type_detail
#>   <chr> <int> <chr> <dbl> <dbl> <dbl> <lgl>    <lgl>    <lgl>  <chr> <chr>      
#> 1 GBR    2014 bp     18.5  14    23.3 TRUE     TRUE     NA     NA    NA         
#> 2 GBR    2015 bp     17.9  13    23.2 TRUE     TRUE     NA     NA    NA         
#> 3 GBR    2016 NA     17.3  NA    NA   NA       NA       NA     NA    NA         
#> 4 GBR    2017 NA     16.7  NA    NA   NA       NA       NA     NA    NA         
#> 5 USA    2014 bp     15.4  10.8  21.3 TRUE     TRUE     NA     NA    NA         
#> 6 USA    2015 bp     15.3  10.4  21.8 TRUE     TRUE     NA     NA    NA         
#> 7 USA    2016 NA     15.2  NA    NA   NA       NA       NA     NA    NA         
#> 8 USA    2017 NA     15.1  NA    NA   NA       NA       NA     NA    NA         
#> # … with 5 more variables: other_detail <chr>, upload_detail <chr>, pred <dbl>,
#> #   pred_upper <dbl>, pred_lower <dbl>

Et voila, we have the same results for the USA and have also ran forecasting on Great Britain as well. However, you should be careful on the data that is supplied for forecasting. The forecast package functions default to using the longest, contiguous non-missing data for forecasting. augury instead automatically pulls the latest contiguous observed data to use for forecasting, to ensure that older data is not prioritized over new data. However, this means any break in a time series will prevent data before that from being used.

bad_df <- dplyr::tibble(x = c(1:4, NA, 3:2, rep(NA, 4)))

predict_holt(bad_df, "x", group_col = NULL, sort_col = NULL, group_models = FALSE)
#> # A tibble: 11 × 6
#>         x   pred pred_upper pred_lower upper lower
#>     <dbl>  <dbl>      <dbl>      <dbl> <dbl> <dbl>
#>  1  1     NA          NA        NA        NA    NA
#>  2  2     NA          NA        NA        NA    NA
#>  3  3     NA          NA        NA        NA    NA
#>  4  4     NA          NA        NA        NA    NA
#>  5 NA     NA          NA        NA        NA    NA
#>  6  3     NA          NA        NA        NA    NA
#>  7  2     NA          NA        NA        NA    NA
#>  8  1.17   1.17        2.55     -0.217    NA    NA
#>  9  0.338  0.338       2.33     -1.66     NA    NA
#> 10 -0.494 -0.494       2.14     -3.12     NA    NA
#> 11 -1.32  -1.32        1.98     -4.63     NA    NA

It’s advisable to consider if other data infilling or imputation methods should be used to generate a full time series prior to the use of forecasting methods to prevent issues like above from impacting the predictive accuracy.