surveyCV
This package implements cross validation (CV) for complex survey
data, by accounting for strata, clusters, FPCs, and survey weights when
creating CV folds as well as when calculating test-set loss estimates
(currently either mean squared error (MSE) for linear models, or binary
cross-entropy for logistic models). It is meant to work smoothly with the survey
package.
In addition, the function folds.svy()
(which makes the
design-based CV folds) can be used to code up custom CV loops to
evaluate other models besides linear and logistic regression.
To understand why we believe it’s important to account for survey
design when carrying out CV, please see our paper: Wieczorek, Guerin,
and McMahon (2022), “K-Fold Cross-Validation for Complex Sample
Surveys,” Stat <doi:10.1002/sta4.454>.
Briefly, the problem is that standard CV assumes exchangeable data,
which is not true of most complex sampling designs. We argue (1) that CV
folds should mimic the sampling design (so that we’re honest about how
much the model-fits will vary across “samples like this one”), and (2)
that test set MSEs should be averaged and weighted in ways that
generalize to the full population from which the sample was drawn,
assuming that is the population for which the models will be
applied.
This vignette briefly illustrates how to use surveyCV
with several common types of sampling design. We provide three
equivalent ways to carry out CV: direct control with
cv.svy
, use of an existing survey design object with
cv.svydesign
, or use of an existing survey GLM object with
cv.svyglm
. We also illustrate how to write your own
design-based CV loop, with a design-consistent random forest example
from the rpms
package.
cv.svy
,
cv.svydesign
, and cv.svyglm
cv.svy
First, we borrow some examples from the documentation for
survey::svyglm()
, based on the api
data (the
Academic Performance Index for California schools in the year 2000).
Say that we want to carry out complex survey CV for several linear
models, and we are working with the stratified sample in
apistrat
. Using cv.svy()
, we specify the
dataset; the formulas for the models being compared; the number of
folds; and any necessary information about the sampling design. For this
particular stratified design, we must specify the strata variable, the
weights variable, and the FPC variable.
#stratified sample
cv.svy(apistrat, c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
nfolds = 5, strataID = "stype", weightsID = "pw", fpcID = "fpc")
#> mean SE
#> .Model_1 9101.0 872.52
#> .Model_2 5333.9 504.76
#> .Model_3 5390.0 511.97
The 2nd model (api00~ell+meals
) appears to have the
lowest average test MSE (in the mean
) column, although its
performance does not differ much from the 3rd model and the difference
between them is much smaller than either of their standard errors (in
the SE
column).
If we were to use the single-stage cluster sample in
apiclus1
instead, the command would be similar but we would
need to specify clusters instead of strata:
# one-stage cluster sample
cv.svy(apiclus1, c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
nfolds = 5, clusterID = "dnum", weightsID = "pw", fpcID = "fpc")
#> mean SE
#> .Model_1 8541.8 2127.89
#> .Model_2 3363.1 707.12
#> .Model_3 3473.8 726.64
On the other hand, if we are working with a simple random sample as
in apisrs
, we could just use standard CV instead of complex
survey CV to generate the folds. However, complex survey CV may still be
useful with a SRS if we want to account for the finite population
correction (FPC) when estimating the SEs of our CV MSEs.
# simple random sample
cv.svy(apisrs, c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
nfolds = 5, fpcID = "fpc")
#> mean SE
#> .Model_1 9880.3 994.73
#> .Model_2 6720.3 1151.04
#> .Model_3 6869.2 1178.15
Note that our intent in this vignette is not to compare the stratified sample results vs. the cluster sample results vs. the SRS results. We show 3 approaches only to illustrate how the code depends on the sampling design. Each of these samples was taken in a different way, and so it should be analyzed in a different way. In most cases, the data analyst will have only one dataset, and they ought to choose the right form of CV based on that one dataset’s sampling design.
(But if we do have different datasets taken under different sampling designs from the same population, it’s plausible that the “best model” might differ across sampling designs. For instance, a stratified sample will likely have more precision than a cluster sample with the same number of ultimate observation units—so it’s possible that the stratified-sample’s CV would choose a larger model than the cluster-sample’s CV. That is, the stratified sample might have enough precision to fit a larger model without overfitting, while CV might show us that the cluster sample is overfitting when it tries to fit a larger model.)
Finally, we show an example from a different dataset, the 2015-2017 National Survey of Family Growth (NSFG), which uses clusters nested within strata as well as unequal sample weights. In this case we are using 4 folds, not 5 as above, because each stratum only has 4 clusters.
(Note: We use nest = TRUE
only if cluster IDs are nested
within strata, i.e., if clusters in different strata might reuse the
same names.)
# complex sample from NSFG
library(splines)
cv.svy(NSFG_data, c("income ~ ns(age, df = 1)",
"income ~ ns(age, df = 2)",
"income ~ ns(age, df = 3)",
"income ~ ns(age, df = 4)",
"income ~ ns(age, df = 5)",
"income ~ ns(age, df = 6)"),
nfolds = 4,
strataID = "strata", clusterID = "SECU",
nest = TRUE, weightsID = "wgt")
#> mean SE
#> .Model_1 22674 730.26
#> .Model_2 22450 733.46
#> .Model_3 22416 734.37
#> .Model_4 22468 745.27
#> .Model_5 22497 749.42
#> .Model_6 22517 739.71
Here we tend to see that the spline with 3 degrees of freedom fits a little better than the others, but the differences between models are mostly smaller than the SEs.
cv.svydesign
When using the survey
package, we will often create a
svydesign
object in order to calculate means and totals,
fit GLMs, etc. In that case, instead of using cv.svy()
, it
is more convenient to use cv.svydesign()
, which will read
the relevant information out of the svydesign
object and
internally pass it along to cv.svy()
for us.
We repeat the api
stratified, cluster, and SRS examples
above using this cv.svydesign()
approach.
#stratified sample
dstrat <- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat, fpc = ~fpc)
cv.svydesign(formulae = c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
design_object = dstrat, nfolds = 5)
#> mean SE
#> .Model_1 9138.9 879.50
#> .Model_2 5332.5 498.04
#> .Model_3 5416.8 502.89
# one-stage cluster sample
dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
cv.svydesign(formulae = c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
design_object = dclus1, nfolds = 5)
#> mean SE
#> .Model_1 8483.7 1924.91
#> .Model_2 3321.3 668.37
#> .Model_3 3654.3 673.98
# simple random sample
dsrs <- svydesign(id = ~1, data = apisrs, fpc = ~fpc)
cv.svydesign(formulae = c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
design_object = dsrs, nfolds = 5)
#> mean SE
#> .Model_1 9866.8 993.03
#> .Model_2 6654.5 1125.67
#> .Model_3 6781.4 1143.83
Next, we repeat the NSFG example using this approach.
NSFG.svydes <- svydesign(id = ~SECU, strata = ~strata, nest = TRUE,
weights = ~wgt, data = NSFG_data)
cv.svydesign(formulae = c("income ~ ns(age, df = 1)",
"income ~ ns(age, df = 2)",
"income ~ ns(age, df = 3)",
"income ~ ns(age, df = 4)",
"income ~ ns(age, df = 5)",
"income ~ ns(age, df = 6)"),
design_object = NSFG.svydes, nfolds = 4)
#> mean SE
#> .Model_1 22668 733.45
#> .Model_2 22464 740.10
#> .Model_3 22364 748.72
#> .Model_4 22374 731.03
#> .Model_5 22452 753.67
#> .Model_6 22523 735.79
cv.svyglm
Finally, if we have already fit a svyglm
object, we can
use cv.svyglm()
which will read the formula as well as the
survey design information and internally pass it along to
cv.svy()
for us. However, this approach only works for one
GLM at time instead of comparing several models in a single
function.
We repeat the api
stratified, cluster, and SRS examples
above using this cv.svyglm()
approach, with the help of the
surveydesign
objects already created above.
#stratified sample
glmstrat <- svyglm(api00 ~ ell+meals+mobility, design = dstrat)
cv.svyglm(glmstrat, nfolds = 5)
#> mean SE
#> .Model_1 5632.6 548.47
# one-stage cluster sample
glmclus1 <- svyglm(api00 ~ ell+meals+mobility, design = dclus1)
cv.svyglm(glmclus1, nfolds = 5)
#> mean SE
#> .Model_1 3630.4 742.13
# simple random sample
glmsrs <- svyglm(api00 ~ ell+meals+mobility, design = dsrs)
cv.svyglm(glmsrs, nfolds = 5)
#> mean SE
#> .Model_1 7067.2 1254.3
Next, we repeat the NSFG example using this approach.
If we have a binary response variable and wish to fit a logistic
regression instead, we can use family = quasibinomial()
in
the call to svyglm()
and feed that directly into
cv.svyglm()
:
NSFG.svyglm.logistic <- svyglm(LBW ~ ns(age, df = 3), design = NSFG.svydes,
family = quasibinomial())
cv.svyglm(glm_object = NSFG.svyglm.logistic, nfolds = 4)
#> mean SE
#> .Model_1 0.34523 0.0177
In this case, the mean
column shows the average of the
binary cross-entropy loss −[yilog (ŷi) + (1 − yi)log (1 − ŷi)]
rather than MSE.
As with MSE, lower values of this loss indicate better-fitting
models.
Or we can set method = "logistic"
in
cv.svydesign()
or in cv.svy()
:
cv.svydesign(formulae = c("LBW ~ ns(age, df = 1)",
"LBW ~ ns(age, df = 2)",
"LBW ~ ns(age, df = 3)",
"LBW ~ ns(age, df = 4)",
"LBW ~ ns(age, df = 5)",
"LBW ~ ns(age, df = 6)"),
design_object = NSFG.svydes, nfolds = 4,
method = "logistic")
#> mean SE
#> .Model_1 0.33997 0.0173
#> .Model_2 0.33973 0.0171
#> .Model_3 0.34199 0.0173
#> .Model_4 0.34470 0.0178
#> .Model_5 0.34635 0.0179
#> .Model_6 0.34689 0.0180
cv.svy(NSFG_data, c("LBW ~ ns(age, df = 1)",
"LBW ~ ns(age, df = 2)",
"LBW ~ ns(age, df = 3)",
"LBW ~ ns(age, df = 4)",
"LBW ~ ns(age, df = 5)",
"LBW ~ ns(age, df = 6)"),
nfolds = 4,
strataID = "strata", clusterID = "SECU",
nest = TRUE, weightsID = "wgt",
method = "logistic")
#> Warning in ns(age, df = 6): shoving 'interior' knots matching boundary knots to
#> inside
#> mean SE
#> .Model_1 0.34134 0.0175
#> .Model_2 0.34081 0.0174
#> .Model_3 0.34073 0.0174
#> .Model_4 0.34242 0.0174
#> .Model_5 0.34441 0.0176
#> .Model_6 0.34451 0.0175
These two examples are carrying out the same CV over the same data and same models; they have slightly different results only because of randomness in forming the CV folds for each example.
In these examples, model 2 or model 3 has the lowest loss, though any differences between models are well within one SE.
folds.svy
and
folds.svydesign
The function folds.svy()
generates design-based fold IDs
for K-fold CV, using any specified strata and clusters.
(Briefly: For a stratified sample, each fold will contain data from each
stratum. For a cluster sample, a given cluster’s rows will all be
assigned to the same fold. See our Stat paper for
details.)
Using these fold IDs, you can write your own CV loop for models that
our packages does not currently handle. These might be other models from
the survey
package (Poisson regression, Cox proportional
hazards model, etc.) or from other design-based modeling packages such
as rpms
, mase
, or
others in the CRAN Task View
on Official Statistics & Survey Statistics.
Here is an example of tuning the bin size parameter for a
design-based random forest, using the rpms_forest()
function from the rpms
package. (I have also set the forest size to 50 trees instead of the
default 500, purely in order to make the example run faster.)
Note that while folds.svy()
accounts for the clustering
in this survey design, we need to do a bit of extra work to ensure that
the model training and testing steps also account for the survey design.
In this particular example, we must pass the cluster IDs and survey
weights to rpms_forest()
for design-consistent
model-fitting, and we must use the survey weights in the MSE
calculations.
# Based on example("rpms"):
# model the mean of retirement account value `IRAX` among households with
# reported retirement account values > 0,
# predicted from householder education, age, and urban/rural location
library(rpms)
data(CE)
# Generate fold IDs that account for clustering in the survey design
# for the IRAX>0 subset of the CE dataset
nfolds <- 5
CEsubset <- CE[which(CE$IRAX > 0), ]
CEsubset$.foldID <- folds.svy(CEsubset, nfolds = nfolds, clusterID = "CID")
# Use CV to tune the bin_size parameter of rpms_forest()
bin_sizes <- c(10, 20, 50, 100, 250, 500)
# Create placeholder for weighted Sums of Squared Errors
SSEs <- rep(0, length(bin_sizes))
for(ff in 1:nfolds) { # For every fold...
# Use .foldID to split data into training and testing sets
train <- subset(CEsubset, .foldID != ff)
test <- subset(CEsubset, .foldID == ff)
for(bb in 1:length(bin_sizes)) { # For every value of the tuning parameter...
# Fit a new model
rf <- rpms_forest(IRAX ~ EDUCA + AGE + BLS_URBN,
data = train,
weights = ~FINLWT21, clusters = ~CID,
bin_size = bin_sizes[bb], f_size = 50)
# Get predictions and squared errors
yhat <- predict(rf, newdata = test)
res2 <- (yhat - test$IRAX)^2
# Sum up weighted SSEs, not MSEs yet,
# b/c cluster sizes may differ across folds and b/c of survey weights
SSEs[bb] <- SSEs[bb] + sum(res2 * test$FINLWT21)
}
}
# Divide entire weighted sum by the sum of weights
MSEs <- SSEs / sum(CEsubset$FINLWT21)
# Show results
cbind(bin_sizes, MSEs)
#> bin_sizes MSEs
#> [1,] 10 204246617270
#> [2,] 20 202870633392
#> [3,] 50 201393921358
#> [4,] 100 201085838446
#> [5,] 250 201825549231
#> [6,] 500 204155844501
Bin size 100 had the lowest survey-weighted CV MSE estimate, although sizes 50 and 250 were quite similar.
Using this chosen tuning parameter value, the next step could be to
fit a random forest with bin size 100 on the full CEsubset
dataset.
In this case, rpms_forest()
does not work with the
survey
package so there was no need to create a
svydesign
object for this analysis. But if we were working
with a different model that expects us to create a
svydesign
object from CEsubset
, we could have
used folds.svydesign()
instead, which would read out the
cluster variable from the svydesign
object
automatically:
CE.svydes <- svydesign(id = ~CID, weights = ~FINLWT21, data = CEsubset)
# Use update() to add variables to a svydesign object
CE.svydes <- update(CE.svydes,
.foldID = folds.svydesign(CE.svydes, nfolds = nfolds))
# Now the fold IDs are available as CE.svydes$variables$.foldID
table(CE.svydes$variables$.foldID)
#> 1 2 3 4 5
#> 813 923 928 885 960