Title: | A Tidy Data Pipeline to Construct, Compare, and Analyse Indexes |
---|---|
Description: | Construct and analyse indexes in a pipeline tidy workflow. 'tidyindex' contains modules for transforming variables, aggregating variables across time, reducing data dimension through weighting, and fitting distributions. A manuscript describing the methodology can be found at <https://github.com/huizezhang-sherry/paper-tidyindex>. |
Authors: | H. Sherry Zhang [aut, cre, cph] , Dianne Cook [aut] , Ursula Laa [aut] , Nicolas Langrené [aut] , Patricia Menéndez [aut] |
Maintainer: | H. Sherry Zhang <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0.9000 |
Built: | 2025-01-10 04:32:40 UTC |
Source: | https://github.com/huizezhang-sherry/tidyindex |
The function joins the parameter table to the 'paras' element of an index table object.
add_paras(data, para_tbl, by)
add_paras(data, para_tbl, by)
data |
a |
para_tbl |
a tibble or data frame object with parameter of variables |
by |
a single column name (support tidyselect) in the 'para_tbl' that maps to the variable name in the data |
an index object
init(gggi) |> add_paras(gggi_weights, by = "variable")
init(gggi) |> add_paras(gggi_weights, by = "variable")
Data for constructing Air Quality Index (AQI), extracted from the Technical Assistance Document for the Reporting of Daily Air Quality.
aqi_ref_tbl pollutant_ref_tbl aqi
aqi_ref_tbl pollutant_ref_tbl aqi
The aqi data contains daily PM2.5 values in Travis county, Austin, Texas, USA in 2024, measured in three monitor sites. The data is a tibble with 272 rows and 9 variables:
name of pollutant (PM2.5)
a five-digit code assigned to each pollutant
date of measurement
the measured value of PM2.5
the calculated API value
longitude of the monitor site
latitude of the monitor site
site code
site name
The aqi_ref_tbl and pollutant_ref_tbl data contain the breakpoints for the AQI and for each of the six pollutants (Ozone, PM2.5, PM10, CO, SO2, NO2). The aqi_ref_tbl data is a tibble with 5 rows and 3 variables:
corresponding group category, from "Good" to "Very Unhealthy"
the low breakpoint of a certain pollutant group
the high breakpoint of a certain pollutant group
The pollutant_ref_tbl data is a tibble with 30 rows and 5 variables.
https://document.airnow.gov/technical-assistance-document-for-the-reporting-of-daily-air-quailty.pdf
Calculate multiple indexes at once
compute_indexes(.data, ...) ## S3 method for class 'idx_res' augment(x, .id = ".id", ...)
compute_indexes(.data, ...) ## S3 method for class 'idx_res' augment(x, .id = ".id", ...)
.data |
an |
... |
Unused, included for generic consistency only |
x |
an |
.id |
a character string, the name of the first column |
an idx_res
object
library(dplyr) library(lmomco) library(generics) res <- tenterfield |> mutate(month = lubridate::month(ym)) |> init(id = id, time = ym, group = month) |> compute_indexes( spi = idx_spi(), spei = idx_spei(.lat = lat, .tavg = tavg), edi = idx_edi() )
library(dplyr) library(lmomco) library(generics) res <- tenterfield |> mutate(month = lubridate::month(ym)) |> init(id = id, time = ym, group = month) |> compute_indexes( spi = idx_spi(), spei = idx_spei(.lat = lat, .tavg = tavg), edi = idx_edi() )
The module combines multiple variables into a new variable. The new variable
can be a linear combination of the original variables,
aggregate_linear()
, or a geometric mean of the original variables,
aggregate_geometry()
, or created from an user formula input,
aggregate_manual()
.
dimension_reduction(data, ...) aggregate_linear(formula, weight) aggregate_geometrical(formula) aggregate_manual(formula)
dimension_reduction(data, ...) aggregate_linear(formula, weight) aggregate_geometrical(formula) aggregate_manual(formula)
data |
used in |
... |
used in |
formula |
the formula to evaluate |
weight |
used in |
an index table object
dt <- gggi |> dplyr::select(country, sex_ratio_at_birth:healthy_life_expectancy) |> init() dt |> dimension_reduction(health = aggregate_manual( ~sex_ratio_at_birth * 0.693 + healthy_life_expectancy * 0.307)) dt |> add_paras(gggi_weights, by = variable) |> dimension_reduction(health = aggregate_linear( ~sex_ratio_at_birth:healthy_life_expectancy, weight = var_weight)) dt |> dimension_reduction(health = aggregate_geometrical( ~sex_ratio_at_birth:healthy_life_expectancy) )
dt <- gggi |> dplyr::select(country, sex_ratio_at_birth:healthy_life_expectancy) |> init() dt |> dimension_reduction(health = aggregate_manual( ~sex_ratio_at_birth * 0.693 + healthy_life_expectancy * 0.307)) dt |> add_paras(gggi_weights, by = variable) |> dimension_reduction(health = aggregate_linear( ~sex_ratio_at_birth:healthy_life_expectancy, weight = var_weight)) dt |> dimension_reduction(health = aggregate_geometrical( ~sex_ratio_at_birth:healthy_life_expectancy) )
This module fits a distribution to the variable of interest. Currently
implemented distributions are: gamma, dist_gamma()
,
generalized logistic, dist_glo()
, generalized extreme value,
dist_gev()
, and Pearson Type III, dist_pe3()
distribution_fit(data, ...) dist_gamma(var, method = "lmoms", .n_boot = 1, .boot_seed = 123) dist_glo(var, method = "lmoms", .n_boot = 1, .boot_seed = 123) dist_gev(var, method = "lmoms", .n_boot = 1, .boot_seed = 123) dist_pe3(var, method = "lmoms", .n_boot = 1, .boot_seed = 123)
distribution_fit(data, ...) dist_gamma(var, method = "lmoms", .n_boot = 1, .boot_seed = 123) dist_glo(var, method = "lmoms", .n_boot = 1, .boot_seed = 123) dist_gev(var, method = "lmoms", .n_boot = 1, .boot_seed = 123) dist_pe3(var, method = "lmoms", .n_boot = 1, .boot_seed = 123)
data |
an index table object |
... |
a distribution fit object, currently implemented are
|
var |
used in |
method |
used in |
.n_boot |
the number of bootstrap replicate, default to 1 |
.boot_seed |
the seed to generate bootstrap replicate, default to 123 |
an index table object
library(dplyr) library(lmomco) tenterfield |> mutate(month = lubridate::month(ym)) |> init(id = id, time = ym, group = month) |> temporal_aggregate(.agg = temporal_rolling_window(prcp, scale = 12)) |> distribution_fit(.fit = dist_gamma(.agg, method = "lmoms"))
library(dplyr) library(lmomco) tenterfield |> mutate(month = lubridate::month(ym)) |> init(id = id, time = ym, group = month) |> temporal_aggregate(.agg = temporal_rolling_window(prcp, scale = 12)) |> distribution_fit(.fit = dist_gamma(.agg, method = "lmoms"))
The Global Gender Gap Index combines 14 variables from four dimensions to measure the gender parity across 146 countries in the world.
gggi gggi_weights
gggi gggi_weights
An object of class tbl_df
(inherits from tbl
, data.frame
) with 146 rows and 22 columns.
An object of class tbl_df
(inherits from tbl
, data.frame
) with 14 rows and 7 columns.
The dataset includes country, region, GGGI score and rank, the combined four dimensions (Economic Participation and Opportunity, Educational Attainment, Health and Survival, and Political Empowerment), and variables under each dimensions. The variable composition of each dimension is as follows:
* Economic Participation and Opportunity: Labour force participation, Wage equality for similar work, Estimated earned income, Legislators, senior officials and managers, and Professional and technical workers
* Educational attainment: Literacy rate, Enrolment in primary education, Enrolment in secondary education, Enrolment in tertiary education
* Health and survival: Sex ratio at birth and Healthy life expectancy
* Political empowerment: Women in parliament, Women in ministerial positions, and Years with female head of state
Variable names are cleaned with [janitor::clean_names()].
The weight data is extracted from page 65 of the Global Gender Gap Report (see reference), see page 61 for the region classification.
https://www3.weforum.org/docs/WEF_GGGR_2023.pdf
Human Development Index (2022)
hdi hdi_scales
hdi hdi_scales
A tibble with three columns:
the row number
191 countries with computed HDI
the HDI index value
life expectancy
expected schooling
average schooling
GNI per capital, logged
An object of class tbl_df
(inherits from tbl
, data.frame
) with 4 rows and 5 columns.
https://hdr.undp.org/data-center/human-development-index#/indicies/HDI
Initialise an index table object with a data frame or a tibble.
init(data, ...) ## S3 method for class 'idx_tbl' print(x, ...)
init(data, ...) ## S3 method for class 'idx_tbl' print(x, ...)
data |
a tibble or data frame to be converted into a index object |
... |
arguments to give variables roles, recorded in the |
x |
an index object |
an index table object
init(hdi) init(gggi)
init(hdi) init(gggi)
The normalise module takes a probability value from a distribution fit
norm_quantile()
to convert based on the normal quantile function
normalise(data, ...) norm_quantile(var)
normalise(data, ...) norm_quantile(var)
data |
an index table object |
... |
the expression to be evaluated |
var |
used in |
an index table object
library(dplyr) library(lmomco) tenterfield |> mutate(month = lubridate::month(ym)) |> init(id = id, time = ym, group = month) |> temporal_aggregate(.agg = temporal_rolling_window(prcp, scale = 12)) |> distribution_fit(.fit = dist_gamma(.agg, method = "lmoms")) |> normalise(index = norm_quantile(.fit))
library(dplyr) library(lmomco) tenterfield |> mutate(month = lubridate::month(ym)) |> init(id = id, time = ym, group = month) |> temporal_aggregate(.agg = temporal_rolling_window(prcp, scale = 12)) |> distribution_fit(.fit = dist_gamma(.agg, method = "lmoms")) |> normalise(index = norm_quantile(.fit))
The rescale module changes the scale of the variable(s) using one of the
available rescaling functions: rescale_zscore()
,
rescale_minmax()
, and rescale_center
.
rescaling(data, ...) rescale_zscore(var, na.rm = TRUE) rescale_minmax(var, min = NULL, max = NULL, na.rm = TRUE, censor = TRUE) rescale_center(var, na.rm = TRUE)
rescaling(data, ...) rescale_zscore(var, na.rm = TRUE) rescale_minmax(var, min = NULL, max = NULL, na.rm = TRUE, censor = TRUE) rescale_center(var, na.rm = TRUE)
data |
an index table object, see [tidyindex::init()] |
... |
used in |
var |
the variable(s) to rescale, accept tidyselect syntax |
na.rm |
used in |
min , max
|
used in |
censor |
used in |
an index table object
dt <- hdi |> init() dt |> rescaling(life_exp = rescale_zscore(life_exp)) dt |> rescaling(life_exp2 = rescale_minmax(life_exp, min = 20, max = 85)) hdi_init <- hdi |> init(id = country) |> add_paras(hdi_scales, by = "var") hdi_init |> rescaling(rescale_minmax(c(life_exp, exp_sch, avg_sch, gni_pc), min = min, max = max))
dt <- hdi |> init() dt |> rescaling(life_exp = rescale_zscore(life_exp)) dt |> rescaling(life_exp2 = rescale_minmax(life_exp, min = 20, max = 85)) hdi_init <- hdi |> init(id = country) |> add_paras(hdi_scales, by = "var") hdi_init |> rescaling(rescale_minmax(c(life_exp, exp_sch, avg_sch, gni_pc), min = min, max = max))
The two functions allows you to substitute a value/expression in the pipeline with other options. These functions will evaluate the modified pipeline step, as well as its prior and subsequent steps to create different versions of the index.
swap_values(data, .var, .param, .values) swap_exprs(data, .var, .exprs)
swap_values(data, .var, .param, .values) swap_exprs(data, .var, .exprs)
data |
an |
.var |
the name of the variable, which the step is tested for alternatives |
.param |
the name of the parameter to swap |
.values , .exprs
|
a list of values or expressions |
an index table
library(generics) hdi_paras <- hdi_scales |> dplyr::add_row(dimension = "Education", name = "Education", var = "sch", min = 0, max = 0) |> dplyr::mutate(weight = c(1/3, 0, 0, 1/3, 1/3), weight2 = c(0.1, 0, 0, 0.8, 0.1), weight3 = c(0.8, 0, 0, 0.1, 0.1), weight4 = c(0.1, 0, 0, 0.1, 0.8)) dt <- hdi |> init(id = country) |> add_paras(hdi_paras, by = var) |> rescaling(life_exp = rescale_minmax(life_exp, min = min, max = max)) |> rescaling(exp_sch = rescale_minmax(exp_sch, min = min, max = max)) |> rescaling(avg_sch = rescale_minmax(avg_sch, min = min, max = max)) |> rescaling(gni_pc = rescale_minmax(gni_pc, min = min, max = max)) |> dimension_reduction(sch = aggregate_manual(~(exp_sch + avg_sch)/2)) |> dimension_reduction(index = aggregate_linear(~c(life_exp, sch, gni_pc), weight = weight)) dt2 <- dt |> swap_values(.var = "index", .param = weight, .value = list(weight2, weight3, weight4)) augment(dt2) dt3 <- dt |> swap_exprs(.var = index, .exprs = list( aggregate_geometrical(~c(life_exp, sch, gni_pc)))) augment(dt3)
library(generics) hdi_paras <- hdi_scales |> dplyr::add_row(dimension = "Education", name = "Education", var = "sch", min = 0, max = 0) |> dplyr::mutate(weight = c(1/3, 0, 0, 1/3, 1/3), weight2 = c(0.1, 0, 0, 0.8, 0.1), weight3 = c(0.8, 0, 0, 0.1, 0.1), weight4 = c(0.1, 0, 0, 0.1, 0.8)) dt <- hdi |> init(id = country) |> add_paras(hdi_paras, by = var) |> rescaling(life_exp = rescale_minmax(life_exp, min = min, max = max)) |> rescaling(exp_sch = rescale_minmax(exp_sch, min = min, max = max)) |> rescaling(avg_sch = rescale_minmax(avg_sch, min = min, max = max)) |> rescaling(gni_pc = rescale_minmax(gni_pc, min = min, max = max)) |> dimension_reduction(sch = aggregate_manual(~(exp_sch + avg_sch)/2)) |> dimension_reduction(index = aggregate_linear(~c(life_exp, sch, gni_pc), weight = weight)) dt2 <- dt |> swap_values(.var = "index", .param = weight, .value = list(weight2, weight3, weight4)) augment(dt2) dt3 <- dt |> swap_exprs(.var = index, .exprs = list( aggregate_geometrical(~c(life_exp, sch, gni_pc)))) augment(dt3)
The temporal processing module is used to aggregate data along the temporal
dimension. Current available aggregation recipe includes
temporal_rolling_window
.
temporal_aggregate(data, ...) temporal_rolling_window( var, scale, .before = 0L, .step = 1L, .complete = TRUE, rm.na = TRUE, ... )
temporal_aggregate(data, ...) temporal_rolling_window( var, scale, .before = 0L, .step = 1L, .complete = TRUE, rm.na = TRUE, ... )
data |
an index table object, see [tidyindex::init()] |
... |
an temporal processing object of class |
var |
the variable to aggregate |
scale |
numeric, the scale (window) of the aggregation |
.before , .step , .complete
|
see |
rm.na |
logical, whether to remove the first few rows with NAs |
an index table object
tenterfield |> init(time = ym) |> temporal_aggregate(.agg = temporal_rolling_window(prcp, scale = 12)) # multiple ids (groups), and multiple scales queensland |> dplyr::filter(id %in% c("ASN00029038", "ASN00029127")) |> init(id = id, time = ym) |> temporal_aggregate(temporal_rolling_window(prcp, scale = c(12, 24)))
tenterfield |> init(time = ym) |> temporal_aggregate(.agg = temporal_rolling_window(prcp, scale = 12)) # multiple ids (groups), and multiple scales queensland |> dplyr::filter(id %in% c("ASN00029038", "ASN00029127")) |> init(id = id, time = ym) |> temporal_aggregate(temporal_rolling_window(prcp, scale = c(12, 24)))
Weather data for in-situ stations in Queensland from 1990 to 2020
tenterfield aus_climate queensland
tenterfield aus_climate queensland
A tibble with 9 columns:
station ID, ASN000xxxxx
date in 'tsibble::yearmonth' format
aggregated monthly precipitation from daily data
maximum/minimum/ average temperature
longitude and latitude of the station
station name
An object of class tbl_df
(inherits from tbl
, data.frame
) with 52373 rows and 9 columns.
An object of class tbl_df
(inherits from tbl
, data.frame
) with 11252 rows and 9 columns.
A ggplot2 theme for benchmarking the index series
theme_benchmark(yintercept = -2, linetype = "dashed")
theme_benchmark(yintercept = -2, linetype = "dashed")
yintercept |
intercept |
linetype |
linetype |
a ggplot2 object
if (require("ggplot2", quietly = TRUE) ){ dplyr::tibble(x = 1:100, y = rnorm(100, sd = 2)) |> ggplot(aes(x = x, y =y )) + geom_line() + theme_benchmark() }
if (require("ggplot2", quietly = TRUE) ){ dplyr::tibble(x = 1:100, y = rnorm(100, sd = 2)) |> ggplot(aes(x = x, y =y )) + geom_line() + theme_benchmark() }
The functions are used for quick computing of some common drought indexes built from wrappers of the underlying modules. For more customised needs, users may build their own indexes from the modules.
trans_thornthwaite(var, lat, na.rm = FALSE, verbose = TRUE) idx_spi(data, .prcp, .dist = dist_gamma(), .scale = 12) idx_spei( data, .tavg, .lat, .prcp, .pet_method = trans_thornthwaite(), .scale = 12, .dist = dist_glo() ) idx_rdi( data, .tavg, .lat, .prcp, .pet_method = trans_thornthwaite(), .scale = 12 ) idx_edi(data, .tavg, .lat, .prcp, .scale = 12)
trans_thornthwaite(var, lat, na.rm = FALSE, verbose = TRUE) idx_spi(data, .prcp, .dist = dist_gamma(), .scale = 12) idx_spei( data, .tavg, .lat, .prcp, .pet_method = trans_thornthwaite(), .scale = 12, .dist = dist_glo() ) idx_rdi( data, .tavg, .lat, .prcp, .pet_method = trans_thornthwaite(), .scale = 12 ) idx_edi(data, .tavg, .lat, .prcp, .scale = 12)
var |
the variable to be transformed, see [tidyindex::variable_trans()] and [SPEI::thornthwaite()] |
lat , na.rm , verbose
|
see [SPEI::thornthwaite] |
data |
an |
.dist |
the distribution used for distribution fit, see [tidyindex::distribution_fit()] |
.scale |
the temporal aggregation scale, see [tidyindex::temporal_aggregation()] |
.tavg , .lat , .prcp
|
variables to be used in the index calculation, see Details |
.pet_method |
the method used for calculating potential
evapotranspitation, currently only |
Below explains the steps wrapped in each index and the intermediate variables created.
The idx_spi()
function performs
a temporal aggregation on the input precipitation series,
.prcp
, as .agg
,
a distribution fit step on the aggregated precipitation
, .agg
, as .fit
, and
a normalising step on the fitted values, .fit
, as
.index
The idx_spei()
function performs
a variable transformation step on the inut average temperature,
.tavg
, to obtain the potential evapotranspiration, .pet
,
a dimension reduction step to calculate difference series,
.diff
, between the input precipitation series, .prcp
,
and .pet
,
a temporal aggregation step on the difference series, .diff
,
as .agg
,
a distribution fit step on the aggregated series, .agg
,
as .fit
, and
a normalising step on the fitted value, .fit
, to
obtain .index
.
The idx_rdi()
function performs
a variable transformation step on the input average temperature,
.tavg
, to obtain potential evapotranspiration .pet
,
a dimension reduction step to calculate the ratio of input
precipitation, .prcp
, to .pet
as .ratio
,
a temporal aggregation step on the ratio series, .ratio
, as
.agg
a variable transformation step to take the log10 of the aggregated
series, .agg
, as .y
, and
a rescaling step to rescale .y
by zscore to obtain
.index
.
The idx_edi()
function performs
a dimension reduction step to aggregate the input precipitation
series, prcp
, as .mult
,
a temporal aggregation step on the aggregated precipitation series
(.mult
) as .ep
, and
a rescaling step to rescale .ep
by zscore to obtain
.index
.
an index table object
library(dplyr) library(lmomco) dt <- tenterfield |> mutate(month = lubridate::month(ym)) |> init(id = id, time = ym, group = month) dt |> idx_spi() dt |> idx_spi(.scale = c(12, 24)) dt |> idx_spei(.lat = lat, .tavg = tavg) dt |> idx_rdi(.lat = lat, .tavg = tavg) dt |> idx_edi(.lat = lat, .tavg = tavg)
library(dplyr) library(lmomco) dt <- tenterfield |> mutate(month = lubridate::month(ym)) |> init(id = id, time = ym, group = month) dt |> idx_spi() dt |> idx_spi(.scale = c(12, 24)) dt |> idx_spei(.lat = lat, .tavg = tavg) dt |> idx_rdi(.lat = lat, .tavg = tavg) dt |> idx_edi(.lat = lat, .tavg = tavg)
The variable transformation module is used to transform a single variable
in the index table object. The transformation is specified by a variable
transformation object of class var_trans
, created by
trans_*
functions. Currently, the following transformation functions
are supported: trans_log10
, trans_quadratic
,
trans_square_root
, and trans_cubic_root
.
variable_trans(data, ...) trans_log10(var) trans_quadratic(var) trans_square_root(var) trans_cubic_root(var) trans_affine(var, a = NULL, b = NULL)
variable_trans(data, ...) trans_log10(var) trans_quadratic(var) trans_square_root(var) trans_cubic_root(var) trans_affine(var, a = NULL, b = NULL)
data |
an index table object |
... |
an variable transformation recipe of class |
var |
used in |
a |
used in |
b |
used in |
an index table object
hdi |> init() |> variable_trans(gni_pc = trans_log10(gni_pc))
hdi |> init() |> variable_trans(gni_pc = trans_log10(gni_pc))