Workhorse for simulation studies — eval

Generates data according to all provided constellations in data_tibble and applies all provided constellations in proc_tibble to them.

Usage

eval_tibbles(
  data_grid,
  proc_grid = expand_tibble(proc = "length"),
  replications = 1,
  discard_generated_data = FALSE,
  post_analyze = identity,
  summary_fun = NULL,
  group_for_summary = NULL,
  ncpus = 1L,
  cluster = NULL,
  cluster_seed = rep(12345, 6),
  cluster_libraries = NULL,
  cluster_global_objects = NULL,
  envir = globalenv(),
  simplify = TRUE
)

Arguments

data_grid: a data.frame or tibble where the first column is a character vector with function names. The other columns contain parameters for the functions specified in the first column. Parameters with NA are ignored. If a column with name .truth exist, then the corresponding entry is passed to functions generated from proc_grid and the function specified in post_analyze.
proc_grid: similar as data_grid the first column must contain function names. The other columns contain parameters for the functions specified in the first column. The data generated according to data_grid will always be passed to the first unspecified argument of the functions specified in the first column of proc_grid. If a function specified in proc_grid has an argument .truth, then the corresponding entry in the .truth column from data_grid is passed to the .truth parameter or if no column .truth exist in data_grid, then all parameters used for the data generation are passed to the .truth parameter.
replications: number of replications for the simulation
discard_generated_data: if TRUE the generated data is deleted after all function constellations in proc_grid have been applied. Otherwise, ALL generated data sets will be part of the returned object.
post_analyze: this is a convenience function, that is applied directly after the data analyzing function. If this function has an argument .truth, then the corresponding entry in the .truth column from data_grid is passed to the .truth parameter or if no column .truth exist in data_grid, then all parameters used for the data generation are passed to the .truth parameter.
summary_fun: named list of univariate function to summarize the results (numeric or logical) over the replications, e.g. list(mean = mean, sd = sd).
group_for_summary: if the result returned by the data analyzing function or post_analyze is a data.frame with more than one row, one usually is interested in summarizing the results while grouping for some variables. This group variables can be passed as a character vector into group_for_summary
ncpus: a cluster of ncpus workers (R-processes) is created on the local machine to conduct the simulation. If ncpus equals one no cluster is created and the simulation is conducted by the current R-process.
cluster: a cluster generated by the parallel package that will be used to conduct the simulation. If cluster is specified, then ncpus will be ignored.
cluster_seed: if the simulation is done in parallel manner, then the combined multiple-recursive generator from L'Ecuyer (1999) is used to generate random numbers. Thus cluster_seed must be a (signed) integer vector of length 6. The 6 elements of the seed are internally regarded as 32-bit unsigned integers. Neither the first three nor the last three should be all zero, and they are limited to less than 4294967087 and 4294944443 respectively.
cluster_libraries: a character vector specifying the packages that should be loaded by the workers.
cluster_global_objects: a character vector specifying the names of R objects in the global environment that should be exported to the global environment of every worker.
envir: must be provided if the functions specified in data_grid or proc_grid are not part of the global environment.
simplify: usually the result column is nested, by default it is tried to unnest it.

Value

The returned object list of the class eval_tibbles, where the element simulations contain the results of the simulation.

Note

If cluster is provided by the user the function eval_tibbles will NOT stop the cluster. This has to be done by the user. Conducting parallel simulations by specifying ncpus will internally create a cluster and stop it after the simulation is done.

Author

Marsel Scheer

Examples

rng <- function(data, ...) {
  ret <- range(data)
  names(ret) <- c("min", "max")
  ret
}

### The following line is only necessary
### if the examples are not executed in the global
### environment, which for instance is the case when
### the oneline-documentation
### http://marselscheer.github.io/simTool/reference/eval_tibbles.html
### is build. In such case eval_tibble() would search the
### above defined function rng() in the global environment where
### it does not exist!
eval_tibbles <- purrr::partial(eval_tibbles, envir = environment())

dg <- expand_tibble(fun = "rnorm", n = c(5L, 10L))
pg <- expand_tibble(proc = c("rng", "median", "length"))

eval_tibbles(dg, pg, rep = 2, simplify = FALSE)
#> # A tibble: 12 × 5
#>    fun       n replications proc   results  
#>    <chr> <int>        <int> <chr>  <list>   
#>  1 rnorm     5            1 rng    <dbl [2]>
#>  2 rnorm     5            1 median <dbl [1]>
#>  3 rnorm     5            1 length <int [1]>
#>  4 rnorm     5            2 rng    <dbl [2]>
#>  5 rnorm     5            2 median <dbl [1]>
#>  6 rnorm     5            2 length <int [1]>
#>  7 rnorm    10            1 rng    <dbl [2]>
#>  8 rnorm    10            1 median <dbl [1]>
#>  9 rnorm    10            1 length <int [1]>
#> 10 rnorm    10            2 rng    <dbl [2]>
#> 11 rnorm    10            2 median <dbl [1]>
#> 12 rnorm    10            2 length <int [1]>
#> Number of data generating functions: 2
#> Number of analyzing procedures: 3
#> Number of replications: 2
#> Estimated replications per hour: 13833709
#> Start of the simulation: 2025-04-10 18:38:39.315084
#> End of the simulation: 2025-04-10 18:38:39.315604
eval_tibbles(dg, pg, rep = 2)
#> # A tibble: 16 × 5
#>    fun       n replications proc   results
#>    <chr> <int>        <int> <chr>    <dbl>
#>  1 rnorm     5            1 rng     0.112 
#>  2 rnorm     5            1 rng     1.62  
#>  3 rnorm     5            1 median  0.244 
#>  4 rnorm     5            1 length  5     
#>  5 rnorm     5            2 rng    -1.91  
#>  6 rnorm     5            2 rng     1.07  
#>  7 rnorm     5            2 median -0.279 
#>  8 rnorm     5            2 length  5     
#>  9 rnorm    10            1 rng    -1.91  
#> 10 rnorm    10            1 rng     2.76  
#> 11 rnorm    10            1 median  0.0583
#> 12 rnorm    10            1 length 10     
#> 13 rnorm    10            2 rng    -2.27  
#> 14 rnorm    10            2 rng     2.68  
#> 15 rnorm    10            2 median  0.0244
#> 16 rnorm    10            2 length 10     
#> Number of data generating functions: 2
#> Number of analyzing procedures: 3
#> Number of replications: 2
#> Estimated replications per hour: 21663550
#> Start of the simulation: 2025-04-10 18:38:39.355151
#> End of the simulation: 2025-04-10 18:38:39.355483
eval_tibbles(dg, pg,
  rep = 2,
  post_analyze = purrr::compose(as.data.frame, t)
)
#> # A tibble: 12 × 7
#>    fun       n replications proc     min   max     V1
#>    <chr> <int>        <int> <chr>  <dbl> <dbl>  <dbl>
#>  1 rnorm     5            1 rng    -1.18  1.11 NA    
#>  2 rnorm     5            1 median NA    NA    -0.246
#>  3 rnorm     5            1 length NA    NA     5    
#>  4 rnorm     5            2 rng    -1.70  1.07 NA    
#>  5 rnorm     5            2 median NA    NA     0.132
#>  6 rnorm     5            2 length NA    NA     5    
#>  7 rnorm    10            1 rng    -1.47  1.34 NA    
#>  8 rnorm    10            1 median NA    NA     0.260
#>  9 rnorm    10            1 length NA    NA    10    
#> 10 rnorm    10            2 rng    -2.61  1.92 NA    
#> 11 rnorm    10            2 median NA    NA     0.495
#> 12 rnorm    10            2 length NA    NA    10    
#> Number of data generating functions: 2
#> Number of analyzing procedures: 3
#> Number of replications: 2
#> Estimated replications per hour: 964854
#> Start of the simulation: 2025-04-10 18:38:39.393957
#> End of the simulation: 2025-04-10 18:38:39.40142
eval_tibbles(dg, pg, rep = 2, summary_fun = list(mean = mean, sd = sd))
#> # A tibble: 12 × 8
#>    fun       n replications summary_fun proc       min    max  value
#>    <chr> <int>        <int> <chr>       <chr>    <dbl>  <dbl>  <dbl>
#>  1 rnorm     5            1 mean        rng    -0.196   1.37  NA    
#>  2 rnorm     5            1 mean        median NA      NA      0.716
#>  3 rnorm     5            1 mean        length NA      NA      5    
#>  4 rnorm     5            1 sd          rng     0.224   0.431 NA    
#>  5 rnorm     5            1 sd          median NA      NA      0.325
#>  6 rnorm     5            1 sd          length NA      NA      0    
#>  7 rnorm    10            1 mean        rng    -1.72    1.55  NA    
#>  8 rnorm    10            1 mean        median NA      NA     -0.185
#>  9 rnorm    10            1 mean        length NA      NA     10    
#> 10 rnorm    10            1 sd          rng     0.0509  0.812 NA    
#> 11 rnorm    10            1 sd          median NA      NA      0.621
#> 12 rnorm    10            1 sd          length NA      NA      0    
#> Number of data generating functions: 2
#> Number of analyzing procedures: 3
#> Number of replications: 2
#> Estimated replications per hour: 238203
#> Start of the simulation: 2025-04-10 18:38:39.439408
#> End of the simulation: 2025-04-10 18:38:39.469634

regData <- function(n, SD) {
  data.frame(
    x = seq(0, 1, length = n),
    y = rnorm(n, sd = SD)
  )
}

eg <- eval_tibbles(
  expand_tibble(fun = "regData", n = 5L, SD = 1:2),
  expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")),
  replications = 3
)
eg
#> # A tibble: 12 × 7
#>    fun         n    SD replications proc  formula  results
#>    <chr>   <int> <int>        <int> <chr> <chr>    <list> 
#>  1 regData     5     1            1 lm    y~x      <lm>   
#>  2 regData     5     1            1 lm    y~I(x^2) <lm>   
#>  3 regData     5     1            2 lm    y~x      <lm>   
#>  4 regData     5     1            2 lm    y~I(x^2) <lm>   
#>  5 regData     5     1            3 lm    y~x      <lm>   
#>  6 regData     5     1            3 lm    y~I(x^2) <lm>   
#>  7 regData     5     2            1 lm    y~x      <lm>   
#>  8 regData     5     2            1 lm    y~I(x^2) <lm>   
#>  9 regData     5     2            2 lm    y~x      <lm>   
#> 10 regData     5     2            2 lm    y~I(x^2) <lm>   
#> 11 regData     5     2            3 lm    y~x      <lm>   
#> 12 regData     5     2            3 lm    y~I(x^2) <lm>   
#> Number of data generating functions: 2
#> Number of analyzing procedures: 2
#> Number of replications: 3
#> Estimated replications per hour: 823953
#> Start of the simulation: 2025-04-10 18:38:39.507479
#> End of the simulation: 2025-04-10 18:38:39.520586

presever_rownames <- function(mat) {
  rn <- rownames(mat)
  ret <- tibble::as_tibble(mat)
  ret$term <- rn
  ret
}

eg <- eval_tibbles(
  expand_tibble(fun = "regData", n = 5L, SD = 1:2),
  expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")),
  post_analyze = purrr::compose(presever_rownames, coef, summary),
  # post_analyze = broom::tidy, # is a nice out of the box alternative
  summary_fun = list(mean = mean, sd = sd),
  group_for_summary = "term",
  replications = 3
)
#> Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.
#> ℹ The deprecated feature was likely used in the dplyr package.
#>   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
eg$simulation
#> # A tibble: 16 × 12
#>    fun         n    SD replications summary_fun proc  formula  term     Estimate
#>    <chr>   <int> <int>        <int> <chr>       <chr> <chr>    <chr>       <dbl>
#>  1 regData     5     1            1 mean        lm    y~x      (Interc…   -0.137
#>  2 regData     5     1            1 mean        lm    y~x      x           0.148
#>  3 regData     5     1            1 mean        lm    y~I(x^2) (Interc…   -0.121
#>  4 regData     5     1            1 mean        lm    y~I(x^2) I(x^2)      0.154
#>  5 regData     5     1            1 sd          lm    y~x      (Interc…    0.330
#>  6 regData     5     1            1 sd          lm    y~x      x           0.298
#>  7 regData     5     1            1 sd          lm    y~I(x^2) (Interc…    0.240
#>  8 regData     5     1            1 sd          lm    y~I(x^2) I(x^2)      0.913
#>  9 regData     5     2            1 mean        lm    y~x      (Interc…   -1.05 
#> 10 regData     5     2            1 mean        lm    y~x      x           2.58 
#> 11 regData     5     2            1 mean        lm    y~I(x^2) (Interc…   -0.851
#> 12 regData     5     2            1 mean        lm    y~I(x^2) I(x^2)      2.91 
#> 13 regData     5     2            1 sd          lm    y~x      (Interc…    0.754
#> 14 regData     5     2            1 sd          lm    y~x      x           0.492
#> 15 regData     5     2            1 sd          lm    y~I(x^2) (Interc…    0.667
#> 16 regData     5     2            1 sd          lm    y~I(x^2) I(x^2)      0.655
#> # ℹ 3 more variables: `Std. Error` <dbl>, `t value` <dbl>, `Pr(>|t|)` <dbl>

dg <- expand_tibble(fun = "rexp", rate = c(10, 100), n = c(50L, 100L))
pg <- expand_tibble(proc = c("t.test"), conf.level = c(0.8, 0.9, 0.95))
et <- eval_tibbles(dg, pg,
  ncpus = 1,
  replications = 10^1,
  post_analyze = function(ttest, .truth) {
    mu <- 1 / .truth$rate
    ttest$conf.int[1] <= mu && mu <= ttest$conf.int[2]
  },
  summary_fun = list(mean = mean, sd = sd)
)
et
#> # A tibble: 24 × 8
#>    fun    rate     n replications summary_fun proc   conf.level value
#>    <chr> <dbl> <int>        <int> <chr>       <chr>       <dbl> <dbl>
#>  1 rexp     10    50            1 mean        t.test       0.8  0.9  
#>  2 rexp     10    50            1 mean        t.test       0.9  0.9  
#>  3 rexp     10    50            1 mean        t.test       0.95 0.9  
#>  4 rexp     10    50            1 sd          t.test       0.8  0.316
#>  5 rexp     10    50            1 sd          t.test       0.9  0.316
#>  6 rexp     10    50            1 sd          t.test       0.95 0.316
#>  7 rexp    100    50            1 mean        t.test       0.8  0.6  
#>  8 rexp    100    50            1 mean        t.test       0.9  0.7  
#>  9 rexp    100    50            1 mean        t.test       0.95 0.7  
#> 10 rexp    100    50            1 sd          t.test       0.8  0.516
#> # ℹ 14 more rows
#> Number of data generating functions: 4
#> Number of analyzing procedures: 3
#> Number of replications: 10
#> Estimated replications per hour: 253289
#> Start of the simulation: 2025-04-10 18:38:39.719946
#> End of the simulation: 2025-04-10 18:38:39.862076

dg <- dplyr::bind_rows(
  expand_tibble(fun = "rexp", rate = 10, .truth = 1 / 10, n = c(50L, 100L)),
  expand_tibble(fun = "rnorm", .truth = 0, n = c(50L, 100L))
)
pg <- expand_tibble(proc = c("t.test"), conf.level = c(0.8, 0.9, 0.95))
et <- eval_tibbles(dg, pg,
  ncpus = 1,
  replications = 10^1,
  post_analyze = function(ttest, .truth) {
    ttest$conf.int[1] <= .truth && .truth <= ttest$conf.int[2]
  },
  summary_fun = list(mean = mean, sd = sd)
)
et
#> # A tibble: 24 × 9
#>    fun    rate .truth     n replications summary_fun proc   conf.level value
#>    <chr> <dbl>  <dbl> <int>        <int> <chr>       <chr>       <dbl> <dbl>
#>  1 rexp     10    0.1    50            1 mean        t.test       0.8  0.9  
#>  2 rexp     10    0.1    50            1 mean        t.test       0.9  0.9  
#>  3 rexp     10    0.1    50            1 mean        t.test       0.95 1    
#>  4 rexp     10    0.1    50            1 sd          t.test       0.8  0.316
#>  5 rexp     10    0.1    50            1 sd          t.test       0.9  0.316
#>  6 rexp     10    0.1    50            1 sd          t.test       0.95 0    
#>  7 rexp     10    0.1   100            1 mean        t.test       0.8  0.6  
#>  8 rexp     10    0.1   100            1 mean        t.test       0.9  0.7  
#>  9 rexp     10    0.1   100            1 mean        t.test       0.95 0.8  
#> 10 rexp     10    0.1   100            1 sd          t.test       0.8  0.516
#> # ℹ 14 more rows
#> Number of data generating functions: 4
#> Number of analyzing procedures: 3
#> Number of replications: 10
#> Estimated replications per hour: 499936
#> Start of the simulation: 2025-04-10 18:38:39.905763
#> End of the simulation: 2025-04-10 18:38:39.977772
### need to remove the locally adapted eval_tibbles()
### otherwise executing the examples would mask
### eval_tibbles from simTool-namespace.
rm(eval_tibbles)