Skip to contents

Introduction

The data pipeline of the Triple Billion store multiple data sets corresponding to different stages of transformation before combining them for the final calculations. This enables a flexible approach where one scenario can be reused by others to fill in the gaps. For more information on the scenarios see the scenarios vignette.

As stated in the scenario definition, scenarios should contain all and only the data that is strictly needed for calculations with billionaiRe. In order to have accurate Triple Billion calculations, it is then necessary to recycle the data to recombine different datasets into single timeseries that then can be used in the calculations. .If a scenario does not contain the required data, the functions will fail.

billionaiRe transform_ and calculate_ functions (as well as rapporteur export functions) use the scenario column as a dplyr::group() in dplyr::group_by.If a scenario does not contain the required data, the functions will fail.

For most users of billionaiRe, accessing external data tables (on xMart or World Health Data Hub) is more resource intensive than computation. This means that large tables should be avoided. Recycling the data between scenarios is then key to avoid storing identical data multiple times that will then be used by different scenarios. This is what data recycling aims to achieve: minimal storage before computation.

Data recycling infers the existence of a reference scenario. This reference scenario is called default by default, and it is a parameter (default_scenario) of recycle_data() and all functions that rely on full data sets. The name of default_scenario can then be modified as required.

default scenario provides values when they are absent from scenarios, along with:

  • scenario_reported_estimated for reported/estimated values (routine by default),
  • scenario_reference_infilling for values imputed/projected by technical programs (reference_infilling by default)
  • and scenario_covid_shock for COVID-19 shock values.

Together scenario_reported_estimated, scenario_reference_infilling, and scenario_covid_shock are the base scenarios.

Data recycling works by adding to all scenarios present in the scenario_col column values that are missing from first default_scenario, then looks in scenario_reported_estimated, scenario_reported_estimated and scenario_covid_shock to add values that are not present in the scenario, nor any of the preceding scenarios. This is done through a series of dplyr::anti_join:

Implementation of data recycling in billionaiRe

Data recycling is implemented in billionaiRe through the recycle_data() function. This function wraps around recycle_data_scenario_single() (not exported for external use) to run the recycling over all the scenarios present in the input data frame.

recycle_data uses similar parameters than other exported billionaiRe functions. However it introduces specific parameters:

  • default_scenario: sets the default parameter (see above). default by default
  • scenario_reported_estimated: sets the reported/estimated scenario (see above). routine by default
  • scenario_reference_infilling: sets the projected/imputed scenarios. reference_infilling by default.
  • scenario_covid_shock: sets the data that correspond to the COVID-19 shock. covid_shock by default.
  • include_projection: Boolean to set if projections should be included in the recycling. TRUE by default.
  • recycle_campaigns: Boolean to set if campaign data should be included in the recycling. TRUE by default.

A recycle and ...parameters were added to the transform_ functions to ease recycling. If recycle is TRUE, data will be recycled, using the eventual parameters passed through the ....

In order to facilitate the cleaning of data, a recycled column is added by the recycling function to identify data points that have been recycled. They can then be removed by the remove_recycled_data() function that takes into account a few specific scenarios.

To avoid adding ellipses to all billionaiRe functions which could have opened a number of unforeseeable issues, not all formally recycled data can be identified and thus removed with remove_recycled_data(), especially for the HEP billion. This includes mostly carried over campaign data.

A make_default_scenario() function is provided to combine the default_scenario, scenario_reported_estimated, scenario_reference_infilling and scenario_covid_shock efficiently.