Computes balance statistics for multiple variables across different groups and optional weighting schemes. This function generalizes balance checking by supporting multiple metrics (SMD, variance ratio, Kolmogorov-Smirnov, weighted correlation) and returns results in a tidy format.
Usage
check_balance(
.data,
.vars,
.group,
.wts = NULL,
.metrics = c("smd", "vr", "ks", "energy"),
include_observed = TRUE,
reference_group = 1L,
na.rm = FALSE,
make_dummy_vars = TRUE,
squares = FALSE,
cubes = FALSE,
interactions = FALSE
)
Arguments
- .data
A data frame containing the variables to analyze.
- .vars
Variables for which to calculate metrics. Can be unquoted variable names, a character vector, or a tidyselect expression.
- .group
Grouping variable, e.g., treatment or exposure group.
- .wts
Optional weighting variables. Can be unquoted variable names, a character vector, or NULL. Multiple weights can be provided to compare different weighting schemes.
- .metrics
Character vector specifying which metrics to compute. Available options: "smd" (standardized mean difference), "vr" (variance ratio), "ks" (Kolmogorov-Smirnov), "correlation" (for continuous exposures), "energy" (multivariate energy distance). Defaults to c("smd", "vr", "ks", "energy").
- include_observed
Logical. If using
.wts
, also calculate observed (unweighted) metrics? Defaults to TRUE.- reference_group
The reference group level to use for comparisons. Defaults to 1 (first level).
- na.rm
A logical value indicating whether to remove missing values before computation. If
FALSE
(default), missing values in the input will produceNA
in the output.- make_dummy_vars
Logical. Transform categorical variables to dummy variables using
model.matrix()
? Defaults to TRUE. When TRUE, categorical variables are expanded into separate binary indicators for each level.- squares
Logical. Include squared terms for continuous variables? Defaults to FALSE. When TRUE, adds squared versions of numeric variables.
- cubes
Logical. Include cubed terms for continuous variables? Defaults to FALSE. When TRUE, adds cubed versions of numeric variables.
- interactions
Logical. Include all pairwise interactions between variables? Defaults to FALSE. When TRUE, creates interaction terms for all variable pairs, excluding interactions between levels of the same categorical variable and between squared/cubed terms.
Value
A tibble with columns:
- variable
Character. The variable name being analyzed.
- group_level
Character. The non-reference group level.
- method
Character. The weighting method ("observed" or weight variable name).
- metric
Character. The balance metric computed ("smd", "vr", "ks").
- estimate
Numeric. The computed balance statistic.
Details
This function serves as a comprehensive balance assessment tool by computing multiple balance metrics simultaneously. It automatically handles different variable types and can optionally transform variables (dummy coding, polynomial terms, interactions) before computing balance statistics.
The function supports several balance metrics:
SMD (Standardized Mean Difference): Measures effect size between groups, with values around 0.1 or smaller generally indicating good balance
Variance Ratio: Compares group variances, with values near 1.0 indicating similar variability between groups
Kolmogorov-Smirnov: Tests distributional differences between groups, with smaller values indicating better balance
Correlation: For continuous exposures, measures linear association between covariate and exposure
Energy Distance: Multivariate test comparing entire distributions
When multiple weighting schemes are provided, the function computes balance
for each method, enabling comparison of different approaches (e.g., ATE vs ATT weights).
The include_observed
parameter controls whether unweighted ("observed") balance
is included in the results.
See also
bal_smd()
, bal_vr()
, bal_ks()
, bal_corr()
, bal_energy()
for individual metric functions
Other balance functions:
bal_corr()
,
bal_ks()
,
bal_smd()
,
bal_vr()
,
check_auc()
Examples
# Basic usage with all metrics
check_balance(nhefs_weights, c(age, wt71), qsmk, .wts = c(w_ate, w_att))
#> # A tibble: 21 × 5
#> variable group_level method metric estimate
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 age 0 observed ks 0.130
#> 2 age 0 w_ate ks 0.0293
#> 3 age 0 w_att ks 0.0362
#> 4 age 0 observed smd 0.282
#> 5 age 0 w_ate smd 0.00585
#> 6 age 0 w_att smd 0.0122
#> 7 age 0 observed vr 1.07
#> 8 age 0 w_ate vr 1.01
#> 9 age 0 w_att vr 1.01
#> 10 wt71 0 observed ks 0.0700
#> # ℹ 11 more rows
# With specific metrics only
check_balance(nhefs_weights, c(age, wt71), qsmk, .metrics = c("smd", "energy"))
#> # A tibble: 3 × 5
#> variable group_level method metric estimate
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 age 0 observed smd 0.282
#> 2 wt71 0 observed smd 0.133
#> 3 NA NA observed energy 0.0503
# Exclude observed results
check_balance(nhefs_weights, c(age, wt71), qsmk, .wts = w_ate, include_observed = FALSE)
#> # A tibble: 7 × 5
#> variable group_level method metric estimate
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 age 0 w_ate ks 0.0293
#> 2 age 0 w_ate smd 0.00585
#> 3 age 0 w_ate vr 1.01
#> 4 wt71 0 w_ate ks 0.0358
#> 5 wt71 0 w_ate smd -0.00903
#> 6 wt71 0 w_ate vr 1.00
#> 7 NA NA w_ate energy 0.00217
# Use correlation for continuous exposure
check_balance(mtcars, c(mpg, hp), disp, .metrics = c("correlation", "energy"))
#> # A tibble: 3 × 5
#> variable group_level method metric estimate
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 hp disp observed correlation 0.791
#> 2 mpg disp observed correlation -0.848
#> 3 NA NA observed energy 0.882
# With dummy variables for categorical variables (default behavior)
check_balance(nhefs_weights, c(age, sex, race), qsmk)
#> # A tibble: 10 × 5
#> variable group_level method metric estimate
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 age 0 observed ks 0.130
#> 2 age 0 observed smd 0.282
#> 3 age 0 observed vr 1.07
#> 4 race 0 observed ks 0.0568
#> 5 race 0 observed smd -0.177
#> 6 race 0 observed vr 0.652
#> 7 sex 0 observed ks 0.0799
#> 8 sex 0 observed smd -0.160
#> 9 sex 0 observed vr 0.996
#> 10 NA NA observed energy 0.0641
# Without dummy variables for categorical variables
check_balance(nhefs_weights, c(age, sex, race), qsmk, make_dummy_vars = FALSE)
#> # A tibble: 10 × 5
#> variable group_level method metric estimate
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 age 0 observed ks 0.130
#> 2 age 0 observed smd 0.282
#> 3 age 0 observed vr 1.07
#> 4 race 0 observed ks NA
#> 5 race 0 observed smd NA
#> 6 race 0 observed vr NA
#> 7 sex 0 observed ks NA
#> 8 sex 0 observed smd NA
#> 9 sex 0 observed vr NA
#> 10 NA NA observed energy 0.0641