Skip to contents

model_error() calculates modeling error using observed and fitted values from the data frame. If test_col is provided, the error is only calculated on observations that were excluded from modeling for test purpose. Otherwise, the error is calculated for all non-missing values.

Usage

model_error(
  df,
  response,
  test_col = NULL,
  test_period = NULL,
  test_period_flex = FALSE,
  group_col = NULL,
  sort_col = NULL,
  sort_descending = FALSE,
  pred_col = "pred",
  pred_upper_col = "pred_upper",
  pred_lower_col = "pred_lower"
)

Arguments

df

Data frame of model data.

response

Column name of response variable.

test_col

Name of logical column specifying which response values to remove for testing the model's predictive accuracy. If NULL, ignored. See model_error() for details on the methods and metrics returned.

test_period

Length of period to test for RMChE. If NULL, beginning and end points of each group in group_col are compared. Otherwise, test_period must be set to an integer n and for each group, comparisons are made between the end point and n periods prior.

test_period_flex

Logical value indicating if test_period is less than the full length of the series, should change error still be calculated for that point. Defaults to FALSE.

group_col

Column name(s) of group(s) to use in dplyr::group_by() when supplying type, calculating mean absolute scaled error on data involving time series, and if group_models, then fitting and predicting models too. If NULL, not used. Defaults to "iso3".

sort_col

Column name(s) to use to dplyr::arrange() the data prior to supplying type and calculating mean absolute scaled error on data involving time series. If NULL, not used. Defaults to "year".

sort_descending

Logical value on whether the sorted values from sort_col should be sorted in descending order. Defaults to FALSE.

pred_col

Column name to store predicted value.

pred_upper_col

Column name to store upper bound of confidence interval generated by the predict_... function. This stores the full set of generated values for the upper bound.

pred_lower_col

Column name to store lower bound of confidence interval generated by the predict_... function. This stores the full set of generated values for the lower bound.

Value

A named vector of errors: RMSE, MAE, MdAE, MASE, CBA, R2, COR and RMChE.

Details

The error metrics generated from model_error() are the following:

  • RMSE: root mean squared error

  • MAE: mean absolute error

  • MdAE: median absolute error

  • MASE: mean absolute scaled error. Only calculated if test_col is provided, as it is test error scaled by in-sample error.

  • CBA: confidence bound accuracy, % of observations lying within the confidence bounds. Should be very near to 95%. Only calculated if both pred_upper_col and pred_lower_col are provided.

  • R2: R-squared or coefficient of determination. Calculated only on test values if test_col is provided. Due to the variety of models available within augury, as well as the predict_..._avg_trend() functions, adjusted R-squared is not currently available.

  • COR: Pearson correlation coefficient of fitted values to observations. Useful as a measure of general trend matching beyond the point error measurements used above. If group_col provided, correlation coefficients are calculated within each group and the average across all groups is returned. Calculated on all data, but be careful in interpreting when applied to non-time series data.

  • RMChE: root mean change error. Since the GPW13 infilling and projections are designed to estimate change over time, RMChE measures the accuracy of this change. It is calculated as the difference between observed change between two time periods and predicted change across those same time periods. If test_period is NULL, this is the beginning and end of each group from group_col, sorted by sort_col. If test_period is provided as an integer n, then instead it is calculated comparing change between the end and n periods prior. test_period_flexibility says whether or not to calculate the change if the full length of the series is less than test_period. If TRUE, then it again compares change between the beginning and end of the series for that group.