Use predict_inla on groups to generate average trend and apply to original data

predict_inla_avg_trend() uses a Integrated Nested Laplace approximation to fit a model to groups within the data, and then bring that fitted prediction back to the original data. The function uses INLA::inla() to perform the model fitting and prediction, and full details and explanation of arguments that it can accept is available on that page. The function also allows for inputting of data type and source information directly into the data frame if the type_col and source_col are specified respectively.

Usage

predict_inla_avg_trend(
  df,
  formula,
  average_cols = NULL,
  weight_col = NULL,
  group_models = FALSE,
  control.predictor = list(compute = TRUE),
  ...,
  ret = c("df", "all", "error", "model"),
  scale = NULL,
  probit = FALSE,
  test_col = NULL,
  test_period = NULL,
  test_period_flex = NULL,
  group_col = "iso3",
  obs_filter = NULL,
  sort_col = "year",
  sort_descending = FALSE,
  pred_col = "pred",
  pred_upper_col = "pred_upper",
  pred_lower_col = "pred_lower",
  upper_col = "upper",
  lower_col = "lower",
  filter_na = c("predictors", "response", "all", "none"),
  type_col = NULL,
  types = c("imputed", "imputed", "projected"),
  source_col = NULL,
  source = NULL,
  scenario_detail_col = NULL,
  scenario_detail = NULL,
  replace_obs = c("missing", "all", "none"),
  error_correct = FALSE,
  error_correct_cols = NULL,
  shift_trend = FALSE
)

Arguments

df

Data frame of model data.

formula

A formula that will be supplied to the model, such as y~x.

average_cols

Column name(s) of column(s) for use in grouping data for averaging, such as regions. If missing, uses global average of the data for infilling.

weight_col

Column name of column of weights to be used in averaging, such as country population.

group_models

Logical, if TRUE, fits and predicts models individually onto each group_col. If FALSE, a general model is fit across the entire data frame.

control.predictor

Used to set compute = TRUE to ensure that the posterior marginals of the fitted values are obtained and the mean and standard deviation of the fitted values returned for use in the infilling and predictions. Additional arguments can be passed in the control.predictor list, but must always include compute = TRUE. See INLA::control.predictor() for details.

...

Additional arguments passed to INLA::inla().

ret

Character vector specifying what values the function returns. Defaults to returning a data frame, but can return a vector of model error, the model itself or a list with all 3 as components.

scale

Either NULL or a numeric value. If a numeric value is provided, the response variable is scaled by the value passed to scale prior to model fitting and prior to any probit transformation, so can be used to put the response onto a 0 to 1 scale. Scaling is done by dividing the response by the scale and using the scale_transform() function. The response, as well as the fitted values and confidence bounds are unscaled prior to error calculation and returning to the user.

probit

Logical value on whether or not to probit transform the response prior to model fitting. Probit transformation is performed after any scaling determined by scale but prior to model fitting. The response, as well as the fitted values and confidence bounds are untransformed prior to error calculation and returning to the user.

test_col

Name of logical column specifying which response values to remove for testing the model's predictive accuracy. If NULL, ignored. See model_error() for details on the methods and metrics returned.

test_period

Length of period to test for RMChE. If NULL, beginning and end points of each group in group_col are compared. Otherwise, test_period must be set to an integer n and for each group, comparisons are made between the end point and n periods prior.

test_period_flex

Logical value indicating if test_period is less than the full length of the series, should change error still be calculated for that point. Defaults to FALSE.

group_col

Column name(s) of group(s) to use in dplyr::group_by() when supplying type, calculating mean absolute scaled error on data involving time series, and if group_models, then fitting and predicting models too. If NULL, not used. Defaults to "iso3".

obs_filter

String value of the form "logical operator integer" that specifies the number of observations required to fit the model and replace observations with predicted values. This is done in conjunction with group_col. So, if group_col = "iso3" and obs_filter = ">= 5", then for this model, predictions will only be used for iso3 vales that have 5 or more observations. Possible logical operators to use are >, >=, <, <=, ==, and !=.

If `group_models = FALSE`, then `obs_filter` is only used to determine when
predicted values replace observed values but **is not** used to restrict values
from being used in model fitting. If `group_models = TRUE`, then a model
is only fit for a group if they meet the `obs_filter` requirements. This provides
speed benefits, particularly when running INLA time series using `predict_inla()`.

sort_col

Column name(s) to use to dplyr::arrange() the data prior to supplying type and calculating mean absolute scaled error on data involving time series. If NULL, not used. Defaults to "year".

sort_descending

Logical value on whether the sorted values from sort_col should be sorted in descending order. Defaults to FALSE.

pred_col

Column name to store predicted value.

pred_upper_col

Column name to store upper bound of confidence interval generated by the predict_... function. This stores the full set of generated values for the upper bound.

pred_lower_col

Column name to store lower bound of confidence interval generated by the predict_... function. This stores the full set of generated values for the lower bound.

upper_col

Column name that contains upper bound information, including upper bound of the input data to the model. Values from pred_upper_col are put into this column in the exact same way the response is filled by pred based on replace_na (only when there is a missing value in the response).

lower_col

Column name that contains lower bound information, including lower bound of the input data to the model. Values from pred_lower_col are put into this column in the exact same way the response is filled by pred based on replace_na (only when there is a missing value in the response).

filter_na

Character value specifying how, if at all, to filter NA values from the dataset prior to applying the model. By default, only observations with missing predictors are removed, although it can also remove rows only if they have missing dependent or independent variables, or no filtering at all. Model prediction and fitting are done in one pass with INLA::inla(), so there will be no predictions if observations with missing dependent variables are removed.

type_col

Column name specifying data type.

types

Vector of length 3 that provides the type to provide to data produced in the model. These values are only used to fill in type values where the dependent variable is missing. The first value is given to missing observations that precede the first observation, the second to those after the last observation, and the third for those following the final observation.

source_col

Column name containing source information for the data frame. If provided, the argument in source is used to fill in where predictions have filled in missing data.

source

Source to add to missing values.

scenario_detail_col

Column name containing scenario_detail information for the data frame. If provided, the argument in scenario_detail is used to fill in where prediction shave filled in missing data.

scenario_detail

Scenario details to add to missing values (usually the name of the model being used to generate the projection, optionally with relevant parameters).

replace_obs

Character value specifying how, if at all, observations should be replaced by fitted values. Defaults to replacing only missing values, but can be used to replace all values or none.

error_correct

Logical value indicating whether or not whether mean error should be used to adjust predicted values. If TRUE, the mean error between observed and predicted data points will be used to adjust predictions. If error_correct_cols is not NULL, mean error will be used within those groups instead of overall mean error.

error_correct_cols

Column names of data frame to group by when applying error correction to the predicted values.

shift_trend

Logical value specifying whether or not to shift predictions so that the trend matches up to the last observation. If error_correct and shift_trend are both TRUE, shift_trend takes precedence.

Value

Depending on the value passed to ret, either a data frame with predicted data, a vector of errors from model_error(), a fitted model, or a list with all 3.

Details

predict_..._avg_trend() functions need to be used carefully. Ensure that average_cols and variables in the formula match, and any formula variables not in average_cols are numeric that can be averaged. Even though the modeling won't use the group_col, it should be provided if necessary to be used in error metric calculations, and provision of types into type_col. Similarly, the sort_col is necessary for types, but also needs to be in average_cols if error_correct, group_models, or shift_trend is going to be used.