CIPerm
with ‘naive’
approachWe run simple timing comparisons to show how our package’s approach with Nguyen (2009) compares against a “naive” grid-based search approach to confidence intervals from permutation methods.
We can use CIPerm
’s cint(dset(x, y))
directly, and it already wins against the naive for-loop or array-based
approaches. But since dset
and cint
both have
extra cruft built into them, for an even “fairer” comparison we recreate
the core of cint(dset())
here without all the extra
variables and without passing copies of dataframes.
First we re-define the internal function
roundOrCeiling()
, then we define a streamlined
cint.nguyen()
:
#### Define a streamlined function for Nguyen approach ####
# 1. use round(siglevel*num), if siglevel*num is basically an integer,
# to within the default numerical tolerance of all.equal();
# 2. or otherwise use ceiling(siglevel*num) instead.
roundOrCeiling <- function(x) {
ifelse(isTRUE(all.equal(round(x), x)), # is x==round(x) to numerical tolerance?
round(x),
ceiling(x))
}
cint.nguyen <- function(x, y, nmc = 10000, conf.level = 0.95) {
# Two-tailed CIs only, for now
sig <- 1 - conf.level
# Code copied/modified from within CIPerm::dset()
n <- length(x)
m <- length(y)
N <- n + m
num <- choose(N, n) # number of possible combinations
# Form a matrix where each column contains indices in new "group1" for that comb or perm
if(nmc == 0 | num <= nmc) { # take all possible combinations
dcombn1 <- utils::combn(1:N, n)
} else { # use Monte Carlo sample of permutations, not all possible combinations
dcombn1 <- replicate(nmc, sample(N, n))
dcombn1[,1] <- 1:n # force the 1st "combination" to be original data order
num <- nmc
}
# Form the equivalent matrix for indices in new "group2"
dcombn2 <- apply(dcombn1, 2, function(x) setdiff(1:N, x))
# Form the corresponding matrices of data values, not data indices
combined <- c(x, y)
group1_perm <- matrix(combined[dcombn1], nrow = n)
group2_perm <- matrix(combined[dcombn2], nrow = m)
# For each comb or perm, compute difference in group means, k, and w_{k,d}
diffmean <- colMeans(group1_perm) - colMeans(group2_perm)
k <- colSums(matrix(dcombn1 %in% ((n+1):N),
nrow = n))
wkd <- (diffmean[1] - diffmean) / (k * (1/n + 1/m))
# Code copied/modified from within CIPerm::cint()
# Sort wkd values and find desired quantiles
w.i <- sort(wkd, decreasing = FALSE, na.last = FALSE)
siglevel <- (1 - conf.level)/2
index <- roundOrCeiling(siglevel*num) - 1
# When dset's nmc leads us to use Monte Carlo sims,
# we may get some permutations equivalent to orig data
# i.e. we may get SEVERAL k=0 and therefore several w.i=NaN.
nk0 <- sum(k == 0)
# Start counting from (1+nk0)'th element of w.i
# (not the 1st, which will always be 'NaN' since k[1] is 0)
LB <- w.i[1 + nk0 + index]
UB <- w.i[(num - index)]
CI <- c(LB, UB)
conf.achieved <- 1 - (2*(index+1) / num)
message(paste0("Achieved conf. level: 1-2*(", index+1, "/", num, ")"))
return(list(conf.int = CI,
conf.level.achieved = conf.achieved))
}
Next, we define an equivalent function to implement the “naive” approach:
#### Define a function for "naive" approach with for-loop ####
cint.naive.forloop <- function(x, y, deltas,
nmc = 10000,
conf.level = 0.95) {
# Two-tailed CIs only, for now
sig <- 1 - conf.level
pvalmeans <- rep(1, length(deltas))
# Code copied/modified from within CIPerm::dset()
n <- length(x)
m <- length(y)
N <- n + m
num <- choose(N, n) # number of possible combinations
# Form a matrix where each column contains indices in new "group1" for that comb or perm
if(nmc == 0 | num <= nmc) { # take all possible combinations
dcombn1 <- utils::combn(1:N, n)
} else { # use Monte Carlo sample of permutations, not all possible combinations
dcombn1 <- replicate(nmc, sample(N, n))
dcombn1[,1] <- 1:n # force the 1st "combination" to be original data order
num <- nmc
}
# Form the equivalent matrix for indices in new "group2"
dcombn2 <- apply(dcombn1, 2, function(x) setdiff(1:N, x))
for(dd in 1:length(deltas)) {
xtmp <- x - deltas[dd]
# Code copied/modified from within CIPerm::dset()
combined <- c(xtmp, y)
# Form the corresponding matrices of data values, not data indices
group1_perm <- matrix(combined[dcombn1], nrow = n)
group2_perm <- matrix(combined[dcombn2], nrow = m)
# For each comb or perm, compute difference in group means
diffmean <- colMeans(group1_perm) - colMeans(group2_perm)
# Code copied/modified from within CIPerm::pval()
pvalmeans[dd] <- sum(abs(diffmean - mean(diffmean)) >= abs(diffmean[1] - mean(diffmean)))/length(diffmean)
}
print( range(deltas[which(pvalmeans >= sig)]) )
return(list(cint = range(deltas[which(pvalmeans >= sig)]),
pvalmeans = pvalmeans,
deltas = deltas))
}
We also wrote another “naive” function that avoids for-loops and
takes a “vectorized” approach instead. We created large arrays: each
permutation requires a matrix, and the 3rd array dimension is over
permutations. Then apply
and colSums
work very
quickly to carry out all the per-permutations steps on this array at
once. However, we found that unless datasets were trivially small,
creating and storing the array took a lot of time and memory, and it
ended up far slower than the for-loop approach. Consequently we do not
report this function or its results here, although curious readers can
find it hidden in the vignette’s .Rmd file.
When we compare the timings, cint.naive()
will need to
have a reasonable search grid for values of delta
. On this
problem, we happen to know the correct CI endpoints are the integers
(-21, 3)
, so we “cheat” by using (-22):4
as
the grid for cint.naive()
. Of course in practice, if you
had only the naive approach, you would probably have to try a wider
grid, since you wouldn’t already know the answer.
#### Speed tests on Nguyen's tiny dataset ####
# Use 1st tiny dataset from Nguyen's paper
library(CIPerm)
x <- c(19, 22, 25, 26)
y <- c(23, 33, 40)
# Actual CIPerm package's approach:
system.time({
print( cint(dset(x, y), conf.level = 0.95, tail = "Two") )
})
#> Achieved conf. level: 1-2*(1/35)
#> $conf.int
#> [1] -21 3
#>
#> $conf.level.achieved
#> [1] 0.9428571
#> user system elapsed
#> 0.003 0.000 0.003
# Streamlined version of CIPerm approach:
system.time({
print( cint.nguyen(x, y, conf.level = 0.95) )
})
#> Achieved conf. level: 1-2*(1/35)
#> $conf.int
#> [1] -21 3
#>
#> $conf.level.achieved
#> [1] 0.9428571
#> user system elapsed
#> 0.024 0.000 0.025
# Naive approach with for-loops:
deltas <- ((-22):4)
system.time({
pvalmeans <- cint.naive.forloop(x, y, deltas)$pvalmeans
})
#> [1] -21 3
#> user system elapsed
#> 0.018 0.000 0.019
# Sanity check to debug `cint.naive`:
# are the p-vals always higher when closer to middle of CI?
# Are they always above 0.05 inside and below it outside CI?
cbind(deltas, pvalmeans)
#> deltas pvalmeans
#> [1,] -22 0.02857143
#> [2,] -21 0.05714286
#> [3,] -20 0.05714286
#> [4,] -19 0.05714286
#> [5,] -18 0.14285714
#> [6,] -17 0.14285714
#> [7,] -16 0.20000000
#> [8,] -15 0.25714286
#> [9,] -14 0.37142857
#> [10,] -13 0.54285714
#> [11,] -12 0.60000000
#> [12,] -11 0.80000000
#> [13,] -10 0.85714286
#> [14,] -9 1.00000000
#> [15,] -8 0.88571429
#> [16,] -7 0.74285714
#> [17,] -6 0.57142857
#> [18,] -5 0.45714286
#> [19,] -4 0.34285714
#> [20,] -3 0.25714286
#> [21,] -2 0.22857143
#> [22,] -1 0.14285714
#> [23,] 0 0.11428571
#> [24,] 1 0.08571429
#> [25,] 2 0.08571429
#> [26,] 3 0.05714286
#> [27,] 4 0.02857143
plot(deltas, pvalmeans)
abline(h = 0.05)
abline(v = c(-21, 3), lty = 2)
Above we see timings for the original
CIPerm::cint(dset))
, its streamlined
versioncint.nguyen()
, and the naive equivalent
cint.naive()
.
In the case of a tiny dataset like this, all 3 approaches take almost no
time. So it can be a tossup as to which one is fastest on this
example.
Try a slightly larger dataset, where the total number of permutations
is above the default nmc=10000
but still manageable on a
laptop.
Again, cint.naive()
will need to have a reasonable
search grid for values of delta
. This time again we will
set up a grid that just barely covers the correct endpoints from
cint.nguyen()
, and again we’ll try to keep it at around 20
to 25 grid points in all. Then we’ll try timing it again with a slightly
finer grid.
choose(18, 9) ## 48620
#> [1] 48620
set.seed(20220528)
x <- rnorm(9, mean = 0, sd = 1)
y <- rnorm(9, mean = 1, sd = 1)
# Actual CIPerm approach
# (with nmc = 0,
# so that we use ALL of the choose(N,n) combinations
# instead of a MC sample of permutations)
system.time({
print( cint(dset(x, y, nmc = 0),
conf.level = 0.95, tail = "Two") )
})
#> Achieved conf. level: 1-2*(1216/48620)
#> $conf.int
#> [1] -2.06178174 0.07679298
#>
#> $conf.level.achieved
#> [1] 0.9499794
#> user system elapsed
#> 0.446 0.011 0.458
# Streamlined version of CIPerm approach:
system.time({
print( cint.nguyen(x, y, nmc = 0, conf.level = 0.95) )
})
#> Achieved conf. level: 1-2*(1216/48620)
#> $conf.int
#> [1] -2.06178174 0.07679298
#>
#> $conf.level.achieved
#> [1] 0.9499794
#> user system elapsed
#> 0.426 0.013 0.439
# Coarser grid
deltas <- ((-21):(1))/10 # grid steps of 0.1
# Naive with for-loops:
system.time({
pvalmeans <- cint.naive.forloop(x, y, deltas, nmc = 0)$pvalmeans
})
#> [1] -2 0
#> user system elapsed
#> 0.570 0.016 0.586
# Finer grid
deltas <- ((-21*2):(1*2))/20 # grid steps of 0.05
# Naive with for-loops:
system.time({
pvalmeans <- cint.naive.forloop(x, y, deltas, nmc = 0)$pvalmeans
})
#> [1] -2.05 0.05
#> user system elapsed
#> 0.656 0.024 0.679
Now, CIPerm::cint(dset())
and the
cint.nguyen
approach take about the same amount of time,
and they are noticeably faster than the cint.naive
approach. (And even more so when we use a finer search grid.) That
happens even with the “cheat” of using CIPerm
first to find
the right CI limits so that our naive search grid isn’t too wide, just
stepping over the minimal required range with a reasonable
grid-coarseness.
(For the original data where CI = (-21, 3), we stepped from -22 to 4 in integers. For latest data where CI = (-2.06, 0.08), we stepped from -2.1 to 0.1 in units of 0.1 and then 0.05.)
In practice there are probably cleverer ways for the naive method to choose grid discreteness and endpoints… but still, this seems like a more-than-fair chance for the naive approach.
Try an even larger example, where we’re OK with a smallish total
number of permutations (eg nmc=10000
), but the dataset
itself is “huge”, so that each individual permutation takes a longer
time.
We still don’t make it all that large, since we want this vignette to knit in a reasonable time. But we’ve seen similar results when we made even-larger datasets that took longer to run.
Once again, for cint.naive()
we will set up a search
grid that just barely covers the correct endpoints from
cint.nguyen()
, and again we’ll try to keep it at around 20
to 25 grid points in all.
#### Speed tests on much larger dataset ####
set.seed(20220528)
x <- rnorm(5e3, mean = 0, sd = 1)
y <- rnorm(5e3, mean = 1, sd = 1)
# Actual CIPerm approach
# (with nmc = 2000 << choose(N,n),
# so it only takes a MC sample of permutations
# instead of running all possible combinations)
system.time({
print( cint(dset(x, y, nmc = 2e3),
conf.level = 0.95, tail = "Two") )
})
#> Achieved conf. level: 1-2*(50/2000)
#> $conf.int
#> [1] -1.0227345 -0.9437299
#>
#> $conf.level.achieved
#> [1] 0.95
#> user system elapsed
#> 2.154 0.096 2.249
# Streamlined version of CIPerm approach:
system.time({
print( cint.nguyen(x, y, nmc = 2e3, conf.level = 0.95) )
})
#> Achieved conf. level: 1-2*(50/2000)
#> $conf.int
#> [1] -1.0232198 -0.9435962
#>
#> $conf.level.achieved
#> [1] 0.95
#> user system elapsed
#> 1.664 0.108 1.771
# Grid of around 20ish steps
deltas <- ((-11*10):(-9*10))/100 # grid steps of 0.01
# Naive with for-loops:
system.time({
pvalmeans <- cint.naive.forloop(x, y, deltas, nmc = 2e3)$pvalmeans
})
#> [1] -1.02 -0.95
#> user system elapsed
#> 3.443 0.780 4.225
Here we finally see that the overhead built into
CIPerm::cint(dset())
makes it slightly slower than the
streamlined cint.nguyen()
. However, both are
substantially faster than cint.naive()
—they run in
less than half the time of the naive approach.
bench::mark()
and profvis
The results above are from one run per method on each example dataset.
Using the last (“largest”) dataset, we also tried using
bench::mark()
to summarize the distribution of runtimes
over many runs, as well as memory usage.
(Not actually run when the vignette knits, in order to avoid needing
bench
as a dependency.)
# bench::mark(cint.nguyen(x, y, nmc = 2e3),
# cint.naive.forloop(x, y, deltas, nmc = 2e3),
# check = FALSE, min_iterations = 10)
#> # A tibble: 2 × 13
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
#> 1 cint.nguyen(x, y, nmc = 2000) 2.5s 2.71s 0.364 1.58GB 2.92 10 80 27.44s <NULL> <Rprofmem [32,057 × 3]> <bench_tm [10]> <tibble [10 × 3]>
#> 2 cint.naive.forloop(x, y, deltas, nmc = 2000) 7.87s 8.07s 0.120 7.4GB 4.64 10 388 1.39m <NULL> <Rprofmem [32,256 × 3]> <bench_tm [10]> <tibble [10 × 3]>
Finally, we also tried using the profvis
package to
check what steps are taking the longest time.
(Again, not actually run when the vignette knits, in order to avoid
needing profvis
as a dependency; but screenshots are
included below.)
# profvis::profvis(cint.nguyen(x, y, nmc = 2e3))
# profvis::profvis(cint.naive.forloop(x, y, deltas, nmc = 2e3))
For Nguyen’s method, the initial setup with combn()
or
sample()
followed by apply(setdiff())
takes
about 80% of the time, and the per-permutation calculations only take
about 20% of the time:
profvis::profvis(cint.nguyen(x, y, nmc = 2e3))
For the naive method, the initial setup is identical, and each round
of per-permutation calculations is slightly faster than in Nguyen’s
method… but because you have to repeat them many times (unless you
already know the CI endpoints, in which case you wouldn’t need to run
this at all!), they can add up to substantially more total time
than for Nguyen’s method.
profvis::profvis(cint.naive.forloop(x, y, deltas, nmc = 2e3))