Get modeling error from a data frame — model

model_error() calculates modeling error using observed and fitted values from the data frame. If test_col is provided, the error is only calculated on observations that were excluded from modeling for test purpose. Otherwise, the error is calculated for all non-missing values.

Usage

model_error(
  df,
  response,
  test_col = NULL,
  test_period = NULL,
  test_period_flex = FALSE,
  group_col = NULL,
  sort_col = NULL,
  sort_descending = FALSE,
  pred_col = "pred",
  pred_upper_col = "pred_upper",
  pred_lower_col = "pred_lower"
)

Arguments

df: Data frame of model data.
response: Column name of response variable.
test_col: Name of logical column specifying which response values to remove for testing the model's predictive accuracy. If NULL, ignored. See model_error() for details on the methods and metrics returned.
test_period: Length of period to test for RMChE. If NULL, beginning and end points of each group in group_col are compared. Otherwise, test_period must be set to an integer n and for each group, comparisons are made between the end point and n periods prior.
test_period_flex: Logical value indicating if test_period is less than the full length of the series, should change error still be calculated for that point. Defaults to FALSE.
group_col: Column name(s) of group(s) to use in dplyr::group_by() when supplying type, calculating mean absolute scaled error on data involving time series, and if group_models, then fitting and predicting models too. If NULL, not used. Defaults to "iso3".
sort_col: Column name(s) to use to dplyr::arrange() the data prior to supplying type and calculating mean absolute scaled error on data involving time series. If NULL, not used. Defaults to "year".
sort_descending: Logical value on whether the sorted values from sort_col should be sorted in descending order. Defaults to FALSE.
pred_col: Column name to store predicted value.
pred_upper_col: Column name to store upper bound of confidence interval generated by the predict_... function. This stores the full set of generated values for the upper bound.
pred_lower_col: Column name to store lower bound of confidence interval generated by the predict_... function. This stores the full set of generated values for the lower bound.

Value

A named vector of errors: RMSE, MAE, MdAE, MASE, CBA, R2, COR and RMChE.

Details

The error metrics generated from model_error() are the following:

RMSE: root mean squared error
MAE: mean absolute error
MdAE: median absolute error
MASE: mean absolute scaled error. Only calculated if test_col is provided, as it is test error scaled by in-sample error.
CBA: confidence bound accuracy, % of observations lying within the confidence bounds. Should be very near to 95%. Only calculated if both pred_upper_col and pred_lower_col are provided.
R2: R-squared or coefficient of determination. Calculated only on test values if test_col is provided. Due to the variety of models available within augury, as well as the predict_..._avg_trend() functions, adjusted R-squared is not currently available.
COR: Pearson correlation coefficient of fitted values to observations. Useful as a measure of general trend matching beyond the point error measurements used above. If group_col provided, correlation coefficients are calculated within each group and the average across all groups is returned. Calculated on all data, but be careful in interpreting when applied to non-time series data.
RMChE: root mean change error. Since the GPW13 infilling and projections are designed to estimate change over time, RMChE measures the accuracy of this change. It is calculated as the difference between observed change between two time periods and predicted change across those same time periods. If test_period is NULL, this is the beginning and end of each group from group_col, sorted by sort_col. If test_period is provided as an integer n, then instead it is calculated comparing change between the end and n periods prior. test_period_flexibility says whether or not to calculate the change if the full length of the series is less than test_period. If TRUE, then it again compares change between the beginning and end of the series for that group.