--- title: "Comparing `CIPerm` with 'naive' approach" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{naive} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` We run simple timing comparisons to show how our package's approach with Nguyen (2009) compares against a "naive" grid-based search approach to confidence intervals from permutation methods. # Defining streamlined functions We can use `CIPerm`'s `cint(dset(x, y))` directly, and it already wins against the naive for-loop or array-based approaches. But since `dset` and `cint` both have extra cruft built into them, for an even "fairer" comparison we recreate the core of `cint(dset())` here without all the extra variables and without passing copies of dataframes. First we re-define the internal function `roundOrCeiling()`, then we define a streamlined `cint.nguyen()`: ```{r function-nguyen} #### Define a streamlined function for Nguyen approach #### # 1. use round(siglevel*num), if siglevel*num is basically an integer, # to within the default numerical tolerance of all.equal(); # 2. or otherwise use ceiling(siglevel*num) instead. roundOrCeiling <- function(x) { ifelse(isTRUE(all.equal(round(x), x)), # is x==round(x) to numerical tolerance? round(x), ceiling(x)) } cint.nguyen <- function(x, y, nmc = 10000, conf.level = 0.95) { # Two-tailed CIs only, for now sig <- 1 - conf.level # Code copied/modified from within CIPerm::dset() n <- length(x) m <- length(y) N <- n + m num <- choose(N, n) # number of possible combinations # Form a matrix where each column contains indices in new "group1" for that comb or perm if(nmc == 0 | num <= nmc) { # take all possible combinations dcombn1 <- utils::combn(1:N, n) } else { # use Monte Carlo sample of permutations, not all possible combinations dcombn1 <- replicate(nmc, sample(N, n)) dcombn1[,1] <- 1:n # force the 1st "combination" to be original data order num <- nmc } # Form the equivalent matrix for indices in new "group2" dcombn2 <- apply(dcombn1, 2, function(x) setdiff(1:N, x)) # Form the corresponding matrices of data values, not data indices combined <- c(x, y) group1_perm <- matrix(combined[dcombn1], nrow = n) group2_perm <- matrix(combined[dcombn2], nrow = m) # For each comb or perm, compute difference in group means, k, and w_{k,d} diffmean <- colMeans(group1_perm) - colMeans(group2_perm) k <- colSums(matrix(dcombn1 %in% ((n+1):N), nrow = n)) wkd <- (diffmean[1] - diffmean) / (k * (1/n + 1/m)) # Code copied/modified from within CIPerm::cint() # Sort wkd values and find desired quantiles w.i <- sort(wkd, decreasing = FALSE, na.last = FALSE) siglevel <- (1 - conf.level)/2 index <- roundOrCeiling(siglevel*num) - 1 # When dset's nmc leads us to use Monte Carlo sims, # we may get some permutations equivalent to orig data # i.e. we may get SEVERAL k=0 and therefore several w.i=NaN. nk0 <- sum(k == 0) # Start counting from (1+nk0)'th element of w.i # (not the 1st, which will always be 'NaN' since k[1] is 0) LB <- w.i[1 + nk0 + index] UB <- w.i[(num - index)] CI <- c(LB, UB) conf.achieved <- 1 - (2*(index+1) / num) message(paste0("Achieved conf. level: 1-2*(", index+1, "/", num, ")")) return(list(conf.int = CI, conf.level.achieved = conf.achieved)) } ``` Next, we define an equivalent function to implement the "naive" approach: * Choose a grid of possible values of delta = mu_X-mu_Y to try * For each delta... - subtract delta off of the x's, - and carry out a test of H_0: mu_X=mu_Y on the resulting data * Our confidence interval consists of the range of delta values for which H_0 is NOT rejected ```{r function-forloop} #### Define a function for "naive" approach with for-loop #### cint.naive.forloop <- function(x, y, deltas, nmc = 10000, conf.level = 0.95) { # Two-tailed CIs only, for now sig <- 1 - conf.level pvalmeans <- rep(1, length(deltas)) # Code copied/modified from within CIPerm::dset() n <- length(x) m <- length(y) N <- n + m num <- choose(N, n) # number of possible combinations # Form a matrix where each column contains indices in new "group1" for that comb or perm if(nmc == 0 | num <= nmc) { # take all possible combinations dcombn1 <- utils::combn(1:N, n) } else { # use Monte Carlo sample of permutations, not all possible combinations dcombn1 <- replicate(nmc, sample(N, n)) dcombn1[,1] <- 1:n # force the 1st "combination" to be original data order num <- nmc } # Form the equivalent matrix for indices in new "group2" dcombn2 <- apply(dcombn1, 2, function(x) setdiff(1:N, x)) for(dd in 1:length(deltas)) { xtmp <- x - deltas[dd] # Code copied/modified from within CIPerm::dset() combined <- c(xtmp, y) # Form the corresponding matrices of data values, not data indices group1_perm <- matrix(combined[dcombn1], nrow = n) group2_perm <- matrix(combined[dcombn2], nrow = m) # For each comb or perm, compute difference in group means diffmean <- colMeans(group1_perm) - colMeans(group2_perm) # Code copied/modified from within CIPerm::pval() pvalmeans[dd] <- sum(abs(diffmean - mean(diffmean)) >= abs(diffmean[1] - mean(diffmean)))/length(diffmean) } print( range(deltas[which(pvalmeans >= sig)]) ) return(list(cint = range(deltas[which(pvalmeans >= sig)]), pvalmeans = pvalmeans, deltas = deltas)) } ``` We also wrote another "naive" function that avoids for-loops and takes a "vectorized" approach instead. We created large arrays: each permutation requires a matrix, and the 3rd array dimension is over permutations. Then `apply` and `colSums` work very quickly to carry out all the per-permutations steps on this array at once. However, we found that unless datasets were trivially small, creating and storing the array took a lot of time and memory, and it ended up far slower than the for-loop approach. Consequently we do not report this function or its results here, although curious readers can find it hidden in the vignette's .Rmd file. ```{r function-array, echo=FALSE, eval=FALSE} ## THIS FUNCTION ALSO WORKS, ## BUT REQUIRES SUBSTANTIALLY MORE MEMORY THAN THE PREVIOUS ONE ## AND IS SLOWER FOR LARGE DATASETS, ## SO WE LEFT IT OUT OF THE COMPARISONS # Possible speedup: # Could we replace for-loop with something like manipulating a 3D array? # e.g., where we say ### group1_perm <- matrix(combined[dcombn1], nrow = n) # could we first make `combined` into a 2D matrix # where each col is like it is now, # but each row is for a different value of delta; # and then group1_perm would be a 3D array, # where each 1st+2nd dims matrix is like it is now, # but 3rd dim indexes over deltas; # and then `diffmean` and `pvalmeans` # would be colMeans or similar over an array # so the output would be right cint.naive.array <- function(x, y, deltas, nmc = 10000, conf.level = 0.95) { # Two-tailed CIs only, for now sig <- 1 - conf.level # New version where xtmp is a matrix, not a vector; # and some stuff below will be arrays, not matrices... xtmp <- matrix(x, nrow = length(x), ncol = length(deltas)) xtmp <- xtmp - deltas[col(xtmp)] combined <- rbind(xtmp, matrix(y, nrow = length(y), ncol = length(deltas))) # Code copied/modified from within CIPerm::dset() n <- length(x) m <- length(y) N <- n + m num <- choose(N, n) # number of possible combinations # Form a matrix where each column contains indices in new "group1" for that comb or perm if(nmc == 0 | num <= nmc) { # take all possible combinations dcombn1 <- utils::combn(1:N, n) } else { # use Monte Carlo sample of permutations, not all possible combinations dcombn1 <- replicate(nmc, sample(N, n)) dcombn1[,1] <- 1:n # force the 1st "combination" to be original data order num <- nmc } # Form the equivalent matrix for indices in new "group2" dcombn2 <- apply(dcombn1, 2, function(x) setdiff(1:N, x)) # ARRAYS of data indices, where 3rd dim indexes over deltas dcombn1_arr <- outer(dcombn1, N * (0:(length(deltas)-1)), FUN = "+") dcombn2_arr <- outer(dcombn2, N * (0:(length(deltas)-1)), FUN = "+") # Form the corresponding ARRAYS of data values, not data indices group1_perm <- array(combined[dcombn1_arr], dim = dim(dcombn1_arr)) group2_perm <- array(combined[dcombn2_arr], dim = dim(dcombn2_arr)) # For each comb or perm, compute difference in group means # diffmean <- matrix(colMeans(group1_perm, dim=1), nrow = num) - # matrix(colMeans(group2_perm, dim=1), nrow = num) diffmean <- colMeans(group1_perm, dim=1) - colMeans(group2_perm, dim=1) # Code copied/modified from within CIPerm::pval() pvalmeans <- colSums(abs(diffmean - colMeans(diffmean)[col(diffmean)]) >= abs(diffmean[1,] - colMeans(diffmean))[col(diffmean)])/nrow(diffmean) # plot(deltas, pvalmeans) print( range(deltas[which(pvalmeans >= sig)]) ) return(list(cint = range(deltas[which(pvalmeans >= sig)]), pvalmeans = pvalmeans, deltas = deltas)) } ``` # Speed tests ## Tiny dataset When we compare the timings, `cint.naive()` will need to have a reasonable search grid for values of `delta`. On this problem, we happen to know the correct CI endpoints are the integers `(-21, 3)`, so we "cheat" by using `(-22):4` as the grid for `cint.naive()`. Of course in practice, if you had only the naive approach, you would probably have to try a wider grid, since you wouldn't already know the answer. ```{r tests-tiny} #### Speed tests on Nguyen's tiny dataset #### # Use 1st tiny dataset from Nguyen's paper library(CIPerm) x <- c(19, 22, 25, 26) y <- c(23, 33, 40) # Actual CIPerm package's approach: system.time({ print( cint(dset(x, y), conf.level = 0.95, tail = "Two") ) }) # Streamlined version of CIPerm approach: system.time({ print( cint.nguyen(x, y, conf.level = 0.95) ) }) # Naive approach with for-loops: deltas <- ((-22):4) system.time({ pvalmeans <- cint.naive.forloop(x, y, deltas)$pvalmeans }) # Sanity check to debug `cint.naive`: # are the p-vals always higher when closer to middle of CI? # Are they always above 0.05 inside and below it outside CI? cbind(deltas, pvalmeans) plot(deltas, pvalmeans) abline(h = 0.05) abline(v = c(-21, 3), lty = 2) # Yes, it's as it should be :) ``` Above we see timings for the original `CIPerm::cint(dset))`, its streamlined version`cint.nguyen()`, and the naive equivalent `cint.naive()`. In the case of a tiny dataset like this, all 3 approaches take almost no time. So it can be a tossup as to which one is fastest on this example. ```{r test-tiny-array, echo=FALSE, eval=FALSE} # Naive approach with arrays: system.time({ pvalmeans <- cint.naive.array(x, y, deltas)$pvalmeans }) # Sanity check again: cbind(deltas, pvalmeans) plot(deltas, pvalmeans) abline(h = 0.05) abline(v = c(-21, 3), lty = 2) # Yep, still same results. OK. # So this "array" version IS faster, # and nearly as fast as Nguyen # (though harder to debug... # but the same could be said of Nguyen's approach...) # WAIT WAIT WAIT # I changed the for-loop approach to stop creating dcombn1 & dcombn2 # inside each step of the loop # since we can reuse the same ones for every value of deltas... # AND NOW it's FASTER than the array version, # even for tiny datasets! # And later we'll see that array version is unbearably slow # for large datasets where it takes forever # to create & store these massive arrays... # So it's not worth reporting in the final vignette :( ``` ## Larger dataset Try a slightly larger dataset, where the total number of permutations is above the default `nmc=10000` but still manageable on a laptop. Again, `cint.naive()` will need to have a reasonable search grid for values of `delta`. This time again we will set up a grid that just barely covers the correct endpoints from `cint.nguyen()`, and again we'll try to keep it at around 20 to 25 grid points in all. Then we'll try timing it again with a slightly finer grid. ```{r tests-larger} choose(18, 9) ## 48620 set.seed(20220528) x <- rnorm(9, mean = 0, sd = 1) y <- rnorm(9, mean = 1, sd = 1) # Actual CIPerm approach # (with nmc = 0, # so that we use ALL of the choose(N,n) combinations # instead of a MC sample of permutations) system.time({ print( cint(dset(x, y, nmc = 0), conf.level = 0.95, tail = "Two") ) }) # Streamlined version of CIPerm approach: system.time({ print( cint.nguyen(x, y, nmc = 0, conf.level = 0.95) ) }) # Coarser grid deltas <- ((-21):(1))/10 # grid steps of 0.1 # Naive with for-loops: system.time({ pvalmeans <- cint.naive.forloop(x, y, deltas, nmc = 0)$pvalmeans }) # Finer grid deltas <- ((-21*2):(1*2))/20 # grid steps of 0.05 # Naive with for-loops: system.time({ pvalmeans <- cint.naive.forloop(x, y, deltas, nmc = 0)$pvalmeans }) ``` Now, `CIPerm::cint(dset())` and the `cint.nguyen` approach take about the same amount of time, and they are noticeably faster than the `cint.naive` approach. (And even more so when we use a finer search grid.) That happens even with the "cheat" of using `CIPerm` first to find the right CI limits so that our naive search grid isn't too wide, just stepping over the minimal required range with a reasonable grid-coarseness. (For the original data where CI = (-21, 3), we stepped from -22 to 4 in integers. For latest data where CI = (-2.06, 0.08), we stepped from -2.1 to 0.1 in units of 0.1 and then 0.05.) In practice there are probably cleverer ways for the naive method to choose grid discreteness and endpoints... but still, this seems like a more-than-fair chance for the naive approach. ```{r tests-larger-array, echo=FALSE, eval=FALSE} # Coarser grid deltas <- ((-21):(1))/10 # grid steps of 0.1 # Naive with arrays: system.time({ pvalmeans <- cint.naive.array(x, y, deltas, nmc = 1e6)$pvalmeans }) # !!! Indeed it looks like naive.forloop # is FASTER than naive.array # now that I've removed the slow dcombn1 & dcombn2 creating from the loop. # Finer grid deltas <- ((-21*2):(1*2))/20 # grid steps of 0.05 # Naive with arrays: system.time({ pvalmeans <- cint.naive.array(x, y, deltas, nmc = 1e6)$pvalmeans }) ``` ## Largest dataset Try an even larger example, where we're OK with a smallish total number of permutations (eg `nmc=10000`), but the dataset itself is "huge", so that each individual permutation takes a longer time. We still don't make it all that large, since we want this vignette to knit in a reasonable time. But we've seen similar results when we made even-larger datasets that took longer to run. Once again, for `cint.naive()` we will set up a search grid that just barely covers the correct endpoints from `cint.nguyen()`, and again we'll try to keep it at around 20 to 25 grid points in all. ```{r tests-largest} #### Speed tests on much larger dataset #### set.seed(20220528) x <- rnorm(5e3, mean = 0, sd = 1) y <- rnorm(5e3, mean = 1, sd = 1) # Actual CIPerm approach # (with nmc = 2000 << choose(N,n), # so it only takes a MC sample of permutations # instead of running all possible combinations) system.time({ print( cint(dset(x, y, nmc = 2e3), conf.level = 0.95, tail = "Two") ) }) # Streamlined version of CIPerm approach: system.time({ print( cint.nguyen(x, y, nmc = 2e3, conf.level = 0.95) ) }) # Grid of around 20ish steps deltas <- ((-11*10):(-9*10))/100 # grid steps of 0.01 # Naive with for-loops: system.time({ pvalmeans <- cint.naive.forloop(x, y, deltas, nmc = 2e3)$pvalmeans }) ``` ```{r tests-largest-array, echo=FALSE, eval=FALSE} # Naive with arrays: system.time({ pvalmeans <- cint.naive.array(x, y, deltas, nmc = 2e3)$pvalmeans }) ``` Here we finally see that the overhead built into `CIPerm::cint(dset())` makes it slightly slower than the streamlined `cint.nguyen()`. However, both are *substantially* faster than `cint.naive()`---they run in less than half the time of the naive approach. # `bench::mark()` and `profvis` The results above are from one run per method on each example dataset. Using the last ("largest") dataset, we also tried using `bench::mark()` to summarize the distribution of runtimes over many runs, as well as memory usage. *(Not actually run when the vignette knits, in order to avoid needing `bench` as a dependency.)* ```{r tests-largest-benchmark} # bench::mark(cint.nguyen(x, y, nmc = 2e3), # cint.naive.forloop(x, y, deltas, nmc = 2e3), # check = FALSE, min_iterations = 10) #> # A tibble: 2 × 13 #> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc #> #> 1 cint.nguyen(x, y, nmc = 2000) 2.5s 2.71s 0.364 1.58GB 2.92 10 80 27.44s #> 2 cint.naive.forloop(x, y, deltas, nmc = 2000) 7.87s 8.07s 0.120 7.4GB 4.64 10 388 1.39m ``` Finally, we also tried using the `profvis` package to check what steps are taking the longest time. *(Again, not actually run when the vignette knits, in order to avoid needing `profvis` as a dependency; but screenshots are included below.)* ```{r tests-largest-profvis} # profvis::profvis(cint.nguyen(x, y, nmc = 2e3)) # profvis::profvis(cint.naive.forloop(x, y, deltas, nmc = 2e3)) ``` For Nguyen's method, the initial setup with `combn()` or `sample()` followed by `apply(setdiff())` takes about 80% of the time, and the per-permutation calculations only take about 20% of the time: `profvis::profvis(cint.nguyen(x, y, nmc = 2e3))` ![](profvis_nguyen.png) For the naive method, the initial setup is identical, and each round of per-permutation calculations is slightly faster than in Nguyen's method... but because you have to repeat them many times (*unless you already know the CI endpoints, in which case you wouldn't need to run this at all!*), they can add up to substantially more total time than for Nguyen's method. `profvis::profvis(cint.naive.forloop(x, y, deltas, nmc = 2e3))` ![](profvis_naive.png)