Computes the two-sample KS statistic comparing empirical cumulative distribution functions (CDFs) between two groups. For binary variables, returns the absolute difference in proportions. For continuous variables, computes the maximum difference between empirical CDFs.
Arguments
- covariate
A numeric vector containing the covariate values to compare.
- group
A vector (factor or numeric) indicating group membership. Must have exactly two unique levels.
- weights
An optional numeric vector of case weights. If provided, must have the same length as other input vectors. All weights must be non-negative.
- reference_group
The reference group level for comparisons. Can be either a group level value or a numeric index. If
NULL
(default), uses the first level.- na.rm
A logical value indicating whether to remove missing values before computation. If
FALSE
(default), missing values in the input will produceNA
in the output.
Value
A numeric value representing the KS statistic. Values range from 0 to 1, with 0 indicating identical distributions and 1 indicating completely separate distributions.
Details
The Kolmogorov-Smirnov statistic measures the maximum difference between empirical cumulative distribution functions of two groups: $$KS = \max_x |F_1(x) - F_0(x)|$$ where \(F_1(x)\) and \(F_0(x)\) are the empirical CDFs of the treatment and control groups.
For binary variables, this reduces to the absolute difference in proportions. For continuous variables, the statistic captures differences in the entire distribution shape, not just means or variances.
The KS statistic ranges from 0 (identical distributions) to 1 (completely separate distributions). Smaller values indicate better distributional balance between groups.
See also
check_balance()
for computing multiple balance metrics at once
Other balance functions:
bal_corr()
,
bal_smd()
,
bal_vr()
,
check_auc()
,
check_balance()