While we most often want to directly build models on our original dataset to generate predicted values, we might instead want to generate average trends across larger groups instead, and then apply this to our original data. For instance, generating trends by region, and then applying those regional trends back to the country level. The predict_...avg_trend()
functions in augury allow us to do just that, applying any of the models we are used to a grouped set of columns.
These work across specific groups, specified by average_cols
, and averaging numeric values specified as the response variable or variables extracted from a formula
. The specified model is then fit to this averaged data, and the predicted values are joined back up to the original data frame. Let’s look at an example using blood pressure data, which has a comprehensive time series.
library(augury)
df <- ghost::gho_data("BP_04", query = "$filter=Dim1 eq 'MLE' and Dim2 eq 'YEARS18-PLUS'") %>%
billionaiRe::wrangle_gho_data() %>%
dplyr::right_join(covariates_df) %>%
dplyr::select(iso3, year, year_n, value) %>%
dplyr::filter(whoville::is_who_member(iso3), # keep WHO member states
year >= 2000, year <= 2023) %>% # get relevant years
dplyr::mutate(who_region = whoville::iso3_to_regions(iso3)) # get WHO regions
#> Warning: Some of the rows are missing a source value.
#> Joining, by = c("iso3", "year")
ur <- unique(df$who_region)
ur
#> [1] "EMR" "AFR" "EUR" "AMR" "WPR" "SEAR"
Alright, so, here we have 6 WHO regions. We will use these regions to fit a model to and use INLA to predict out to 2023, then apply these trends to input countries.
pred_df <- df %>%
predict_inla_avg_trend(formula = value ~ f(year_n, model = "rw2"),
average_cols = c("who_region", "year_n"),
group_models = TRUE,
group_col = "iso3",
sort_col = "year_n")
#> Warning in is.null(x) || is.na(x): 'length(x) = 24 > 1' in coercion to
#> 'logical(1)'
#> Warning in is.null(x) || is.na(x): 'length(x) = 24 > 1' in coercion to
#> 'logical(1)'
#> Warning in is.null(x) || is.na(x): 'length(x) = 24 > 1' in coercion to
#> 'logical(1)'
#> Warning in is.null(x) || is.na(x): 'length(x) = 24 > 1' in coercion to
#> 'logical(1)'
#> Warning in is.null(x) || is.na(x): 'length(x) = 24 > 1' in coercion to
#> 'logical(1)'
#> Warning in is.null(x) || is.na(x): 'length(x) = 24 > 1' in coercion to
#> 'logical(1)'
pred_df %>%
dplyr::filter(iso3 == "AFG", year >= 2013)
#> # A tibble: 11 × 10
#> iso3 year year_n value who_region pred pred_upper pred_lower upper lower
#> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AFG 2013 14 30.4 EMR 29.4 29.4 29.4 NA NA
#> 2 AFG 2014 15 30.4 EMR 29.4 29.4 29.4 NA NA
#> 3 AFG 2015 16 30.4 EMR 29.4 29.4 29.4 NA NA
#> 4 AFG 2016 17 29.4 EMR 29.4 29.4 29.4 NA NA
#> 5 AFG 2017 18 29.4 EMR 29.4 29.4 29.4 NA NA
#> 6 AFG 2018 19 29.4 EMR 29.4 29.4 29.4 NA NA
#> 7 AFG 2019 20 29.4 EMR 29.4 29.4 29.4 NA NA
#> 8 AFG 2020 21 29.4 EMR 29.4 29.4 29.4 NA NA
#> 9 AFG 2021 22 29.4 EMR 29.4 29.4 29.4 NA NA
#> 10 AFG 2022 23 29.4 EMR 29.4 29.4 29.4 NA NA
#> 11 AFG 2023 24 29.4 EMR 29.4 29.4 29.4 NA NA
Above, we can see we have a generated a model using 2nd order random walk with INLA, however, the model was generated by averaging data to WHO regions first, fitting the random walk to each reach (since group_models = TRUE
) and then fitting those trends to the original data. Note some specifics of what had to be set, as the predict_..._avg_trend()
functions are slightly more complex than others:
-
average_cols
must contain thesort_col
. So, since we useyear_n
in the time series rather thanyear
, we will sort by that this time. -
average_cols
refers to the groupings used for averaging (we take the average for each WHO region and year in this case). Then, the model is fit toaverage_cols
that are NOT thesort_col
. -
group_col
is the groupings used on the original data frame, which is still necessary here when applying the trend back to the original data. - If a variable is in
formula
, it must either be inaverage_cols
or it must be a numeric column that can be averaged. This is because the formula is applied to the data frame afterdplyr::group_by()
anddplyr::summarize()
have reduced it.
To highlight this point, in the above example, what’s actually happening is we are actually fitting the model on the summarized data:
df %>%
dplyr::group_by(who_region, year_n) %>%
dplyr::summarize(value = mean(value, na.rm = T)) %>%
head()
#> `summarise()` has grouped output by 'who_region'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups: who_region [1]
#> who_region year_n value
#> <chr> <dbl> <dbl>
#> 1 AFR 1 29.7
#> 2 AFR 2 29.6
#> 3 AFR 3 29.5
#> 4 AFR 4 29.4
#> 5 AFR 5 29.3
#> 6 AFR 6 29.2
Since average_cols = c("who_region", "year_n")
, we took the mean of all values in formula
not in average_cols
, in this case just value
. If for instance, we tried to specify a model using iso3
in the formula
:
predict_inla_avg_trend(df,
formula = value ~ iso3 + f(year_n, model = "rw2"),
average_cols = c("who_region", "year_n"),
group_models = TRUE,
group_col = "iso3",
sort_col = "year_n")
#> Error: iso3 must be numeric columns for use in averaging, or included in `average_cols` for grouping.
We get an error message indicating that iso3 must be numeric or included in average_cols
for grouping. This is because without it being numeric or in the average_cols
, there’s no way dplyr::group_by() %>% dplyr::summarize()
a non-numeric column automatically (how would we reduce country-level ISO3 codes to the regional level?).
While slightly complex, ensuring you follow the above means you should easily and successfully get out meaningful trend predictions for your data frames using trends generated on grouped data.