Skip to contents

Generates data according to all provided constellations in data_tibble and applies all provided constellations in proc_tibble to them.

Usage

eval_tibbles(
  data_grid,
  proc_grid = expand_tibble(proc = "length"),
  replications = 1,
  discard_generated_data = FALSE,
  post_analyze = identity,
  summary_fun = NULL,
  group_for_summary = NULL,
  ncpus = 1L,
  cluster = NULL,
  cluster_seed = rep(12345, 6),
  cluster_libraries = NULL,
  cluster_global_objects = NULL,
  envir = globalenv(),
  simplify = TRUE
)

Arguments

data_grid

a data.frame or tibble where the first column is a character vector with function names. The other columns contain parameters for the functions specified in the first column. Parameters with NA are ignored. If a column with name .truth exist, then the corresponding entry is passed to functions generated from proc_grid and the function specified in post_analyze.

proc_grid

similar as data_grid the first column must contain function names. The other columns contain parameters for the functions specified in the first column. The data generated according to data_grid will always be passed to the first unspecified argument of the functions specified in the first column of proc_grid. If a function specified in proc_grid has an argument .truth, then the corresponding entry in the .truth column from data_grid is passed to the .truth parameter or if no column .truth exist in data_grid, then all parameters used for the data generation are passed to the .truth parameter.

replications

number of replications for the simulation

discard_generated_data

if TRUE the generated data is deleted after all function constellations in proc_grid have been applied. Otherwise, ALL generated data sets will be part of the returned object.

post_analyze

this is a convenience function, that is applied directly after the data analyzing function. If this function has an argument .truth, then the corresponding entry in the .truth column from data_grid is passed to the .truth parameter or if no column .truth exist in data_grid, then all parameters used for the data generation are passed to the .truth parameter.

summary_fun

named list of univariate function to summarize the results (numeric or logical) over the replications, e.g. list(mean = mean, sd = sd).

group_for_summary

if the result returned by the data analyzing function or post_analyze is a data.frame with more than one row, one usually is interested in summarizing the results while grouping for some variables. This group variables can be passed as a character vector into group_for_summary

ncpus

a cluster of ncpus workers (R-processes) is created on the local machine to conduct the simulation. If ncpus equals one no cluster is created and the simulation is conducted by the current R-process.

cluster

a cluster generated by the parallel package that will be used to conduct the simulation. If cluster is specified, then ncpus will be ignored.

cluster_seed

if the simulation is done in parallel manner, then the combined multiple-recursive generator from L'Ecuyer (1999) is used to generate random numbers. Thus cluster_seed must be a (signed) integer vector of length 6. The 6 elements of the seed are internally regarded as 32-bit unsigned integers. Neither the first three nor the last three should be all zero, and they are limited to less than 4294967087 and 4294944443 respectively.

cluster_libraries

a character vector specifying the packages that should be loaded by the workers.

cluster_global_objects

a character vector specifying the names of R objects in the global environment that should be exported to the global environment of every worker.

envir

must be provided if the functions specified in data_grid or proc_grid are not part of the global environment.

simplify

usually the result column is nested, by default it is tried to unnest it.

Value

The returned object list of the class eval_tibbles, where the element simulations contain the results of the simulation.

Note

If cluster is provided by the user the function eval_tibbles will NOT stop the cluster. This has to be done by the user. Conducting parallel simulations by specifying ncpus will internally create a cluster and stop it after the simulation is done.

Author

Marsel Scheer

Examples

rng <- function(data, ...) {
  ret <- range(data)
  names(ret) <- c("min", "max")
  ret
}

### The following line is only necessary
### if the examples are not executed in the global
### environment, which for instance is the case when
### the oneline-documentation
### http://marselscheer.github.io/simTool/reference/eval_tibbles.html
### is build. In such case eval_tibble() would search the
### above defined function rng() in the global environment where
### it does not exist!
eval_tibbles <- purrr::partial(eval_tibbles, envir = environment())

dg <- expand_tibble(fun = "rnorm", n = c(5L, 10L))
pg <- expand_tibble(proc = c("rng", "median", "length"))

eval_tibbles(dg, pg, rep = 2, simplify = FALSE)
#> # A tibble: 12 × 5
#>    fun       n replications proc   results  
#>    <chr> <int>        <int> <chr>  <list>   
#>  1 rnorm     5            1 rng    <dbl [2]>
#>  2 rnorm     5            1 median <dbl [1]>
#>  3 rnorm     5            1 length <int [1]>
#>  4 rnorm     5            2 rng    <dbl [2]>
#>  5 rnorm     5            2 median <dbl [1]>
#>  6 rnorm     5            2 length <int [1]>
#>  7 rnorm    10            1 rng    <dbl [2]>
#>  8 rnorm    10            1 median <dbl [1]>
#>  9 rnorm    10            1 length <int [1]>
#> 10 rnorm    10            2 rng    <dbl [2]>
#> 11 rnorm    10            2 median <dbl [1]>
#> 12 rnorm    10            2 length <int [1]>
#> Number of data generating functions: 2
#> Number of analyzing procedures: 3
#> Number of replications: 2
#> Estimated replications per hour: 13833709
#> Start of the simulation: 2025-04-10 18:38:39.315084
#> End of the simulation: 2025-04-10 18:38:39.315604
eval_tibbles(dg, pg, rep = 2)
#> # A tibble: 16 × 5
#>    fun       n replications proc   results
#>    <chr> <int>        <int> <chr>    <dbl>
#>  1 rnorm     5            1 rng     0.112 
#>  2 rnorm     5            1 rng     1.62  
#>  3 rnorm     5            1 median  0.244 
#>  4 rnorm     5            1 length  5     
#>  5 rnorm     5            2 rng    -1.91  
#>  6 rnorm     5            2 rng     1.07  
#>  7 rnorm     5            2 median -0.279 
#>  8 rnorm     5            2 length  5     
#>  9 rnorm    10            1 rng    -1.91  
#> 10 rnorm    10            1 rng     2.76  
#> 11 rnorm    10            1 median  0.0583
#> 12 rnorm    10            1 length 10     
#> 13 rnorm    10            2 rng    -2.27  
#> 14 rnorm    10            2 rng     2.68  
#> 15 rnorm    10            2 median  0.0244
#> 16 rnorm    10            2 length 10     
#> Number of data generating functions: 2
#> Number of analyzing procedures: 3
#> Number of replications: 2
#> Estimated replications per hour: 21663550
#> Start of the simulation: 2025-04-10 18:38:39.355151
#> End of the simulation: 2025-04-10 18:38:39.355483
eval_tibbles(dg, pg,
  rep = 2,
  post_analyze = purrr::compose(as.data.frame, t)
)
#> # A tibble: 12 × 7
#>    fun       n replications proc     min   max     V1
#>    <chr> <int>        <int> <chr>  <dbl> <dbl>  <dbl>
#>  1 rnorm     5            1 rng    -1.18  1.11 NA    
#>  2 rnorm     5            1 median NA    NA    -0.246
#>  3 rnorm     5            1 length NA    NA     5    
#>  4 rnorm     5            2 rng    -1.70  1.07 NA    
#>  5 rnorm     5            2 median NA    NA     0.132
#>  6 rnorm     5            2 length NA    NA     5    
#>  7 rnorm    10            1 rng    -1.47  1.34 NA    
#>  8 rnorm    10            1 median NA    NA     0.260
#>  9 rnorm    10            1 length NA    NA    10    
#> 10 rnorm    10            2 rng    -2.61  1.92 NA    
#> 11 rnorm    10            2 median NA    NA     0.495
#> 12 rnorm    10            2 length NA    NA    10    
#> Number of data generating functions: 2
#> Number of analyzing procedures: 3
#> Number of replications: 2
#> Estimated replications per hour: 964854
#> Start of the simulation: 2025-04-10 18:38:39.393957
#> End of the simulation: 2025-04-10 18:38:39.40142
eval_tibbles(dg, pg, rep = 2, summary_fun = list(mean = mean, sd = sd))
#> # A tibble: 12 × 8
#>    fun       n replications summary_fun proc       min    max  value
#>    <chr> <int>        <int> <chr>       <chr>    <dbl>  <dbl>  <dbl>
#>  1 rnorm     5            1 mean        rng    -0.196   1.37  NA    
#>  2 rnorm     5            1 mean        median NA      NA      0.716
#>  3 rnorm     5            1 mean        length NA      NA      5    
#>  4 rnorm     5            1 sd          rng     0.224   0.431 NA    
#>  5 rnorm     5            1 sd          median NA      NA      0.325
#>  6 rnorm     5            1 sd          length NA      NA      0    
#>  7 rnorm    10            1 mean        rng    -1.72    1.55  NA    
#>  8 rnorm    10            1 mean        median NA      NA     -0.185
#>  9 rnorm    10            1 mean        length NA      NA     10    
#> 10 rnorm    10            1 sd          rng     0.0509  0.812 NA    
#> 11 rnorm    10            1 sd          median NA      NA      0.621
#> 12 rnorm    10            1 sd          length NA      NA      0    
#> Number of data generating functions: 2
#> Number of analyzing procedures: 3
#> Number of replications: 2
#> Estimated replications per hour: 238203
#> Start of the simulation: 2025-04-10 18:38:39.439408
#> End of the simulation: 2025-04-10 18:38:39.469634

regData <- function(n, SD) {
  data.frame(
    x = seq(0, 1, length = n),
    y = rnorm(n, sd = SD)
  )
}

eg <- eval_tibbles(
  expand_tibble(fun = "regData", n = 5L, SD = 1:2),
  expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")),
  replications = 3
)
eg
#> # A tibble: 12 × 7
#>    fun         n    SD replications proc  formula  results
#>    <chr>   <int> <int>        <int> <chr> <chr>    <list> 
#>  1 regData     5     1            1 lm    y~x      <lm>   
#>  2 regData     5     1            1 lm    y~I(x^2) <lm>   
#>  3 regData     5     1            2 lm    y~x      <lm>   
#>  4 regData     5     1            2 lm    y~I(x^2) <lm>   
#>  5 regData     5     1            3 lm    y~x      <lm>   
#>  6 regData     5     1            3 lm    y~I(x^2) <lm>   
#>  7 regData     5     2            1 lm    y~x      <lm>   
#>  8 regData     5     2            1 lm    y~I(x^2) <lm>   
#>  9 regData     5     2            2 lm    y~x      <lm>   
#> 10 regData     5     2            2 lm    y~I(x^2) <lm>   
#> 11 regData     5     2            3 lm    y~x      <lm>   
#> 12 regData     5     2            3 lm    y~I(x^2) <lm>   
#> Number of data generating functions: 2
#> Number of analyzing procedures: 2
#> Number of replications: 3
#> Estimated replications per hour: 823953
#> Start of the simulation: 2025-04-10 18:38:39.507479
#> End of the simulation: 2025-04-10 18:38:39.520586

presever_rownames <- function(mat) {
  rn <- rownames(mat)
  ret <- tibble::as_tibble(mat)
  ret$term <- rn
  ret
}

eg <- eval_tibbles(
  expand_tibble(fun = "regData", n = 5L, SD = 1:2),
  expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")),
  post_analyze = purrr::compose(presever_rownames, coef, summary),
  # post_analyze = broom::tidy, # is a nice out of the box alternative
  summary_fun = list(mean = mean, sd = sd),
  group_for_summary = "term",
  replications = 3
)
#> Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.
#>  The deprecated feature was likely used in the dplyr package.
#>   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
eg$simulation
#> # A tibble: 16 × 12
#>    fun         n    SD replications summary_fun proc  formula  term     Estimate
#>    <chr>   <int> <int>        <int> <chr>       <chr> <chr>    <chr>       <dbl>
#>  1 regData     5     1            1 mean        lm    y~x      (Interc…   -0.137
#>  2 regData     5     1            1 mean        lm    y~x      x           0.148
#>  3 regData     5     1            1 mean        lm    y~I(x^2) (Interc…   -0.121
#>  4 regData     5     1            1 mean        lm    y~I(x^2) I(x^2)      0.154
#>  5 regData     5     1            1 sd          lm    y~x      (Interc…    0.330
#>  6 regData     5     1            1 sd          lm    y~x      x           0.298
#>  7 regData     5     1            1 sd          lm    y~I(x^2) (Interc…    0.240
#>  8 regData     5     1            1 sd          lm    y~I(x^2) I(x^2)      0.913
#>  9 regData     5     2            1 mean        lm    y~x      (Interc…   -1.05 
#> 10 regData     5     2            1 mean        lm    y~x      x           2.58 
#> 11 regData     5     2            1 mean        lm    y~I(x^2) (Interc…   -0.851
#> 12 regData     5     2            1 mean        lm    y~I(x^2) I(x^2)      2.91 
#> 13 regData     5     2            1 sd          lm    y~x      (Interc…    0.754
#> 14 regData     5     2            1 sd          lm    y~x      x           0.492
#> 15 regData     5     2            1 sd          lm    y~I(x^2) (Interc…    0.667
#> 16 regData     5     2            1 sd          lm    y~I(x^2) I(x^2)      0.655
#> # ℹ 3 more variables: `Std. Error` <dbl>, `t value` <dbl>, `Pr(>|t|)` <dbl>

dg <- expand_tibble(fun = "rexp", rate = c(10, 100), n = c(50L, 100L))
pg <- expand_tibble(proc = c("t.test"), conf.level = c(0.8, 0.9, 0.95))
et <- eval_tibbles(dg, pg,
  ncpus = 1,
  replications = 10^1,
  post_analyze = function(ttest, .truth) {
    mu <- 1 / .truth$rate
    ttest$conf.int[1] <= mu && mu <= ttest$conf.int[2]
  },
  summary_fun = list(mean = mean, sd = sd)
)
et
#> # A tibble: 24 × 8
#>    fun    rate     n replications summary_fun proc   conf.level value
#>    <chr> <dbl> <int>        <int> <chr>       <chr>       <dbl> <dbl>
#>  1 rexp     10    50            1 mean        t.test       0.8  0.9  
#>  2 rexp     10    50            1 mean        t.test       0.9  0.9  
#>  3 rexp     10    50            1 mean        t.test       0.95 0.9  
#>  4 rexp     10    50            1 sd          t.test       0.8  0.316
#>  5 rexp     10    50            1 sd          t.test       0.9  0.316
#>  6 rexp     10    50            1 sd          t.test       0.95 0.316
#>  7 rexp    100    50            1 mean        t.test       0.8  0.6  
#>  8 rexp    100    50            1 mean        t.test       0.9  0.7  
#>  9 rexp    100    50            1 mean        t.test       0.95 0.7  
#> 10 rexp    100    50            1 sd          t.test       0.8  0.516
#> # ℹ 14 more rows
#> Number of data generating functions: 4
#> Number of analyzing procedures: 3
#> Number of replications: 10
#> Estimated replications per hour: 253289
#> Start of the simulation: 2025-04-10 18:38:39.719946
#> End of the simulation: 2025-04-10 18:38:39.862076

dg <- dplyr::bind_rows(
  expand_tibble(fun = "rexp", rate = 10, .truth = 1 / 10, n = c(50L, 100L)),
  expand_tibble(fun = "rnorm", .truth = 0, n = c(50L, 100L))
)
pg <- expand_tibble(proc = c("t.test"), conf.level = c(0.8, 0.9, 0.95))
et <- eval_tibbles(dg, pg,
  ncpus = 1,
  replications = 10^1,
  post_analyze = function(ttest, .truth) {
    ttest$conf.int[1] <= .truth && .truth <= ttest$conf.int[2]
  },
  summary_fun = list(mean = mean, sd = sd)
)
et
#> # A tibble: 24 × 9
#>    fun    rate .truth     n replications summary_fun proc   conf.level value
#>    <chr> <dbl>  <dbl> <int>        <int> <chr>       <chr>       <dbl> <dbl>
#>  1 rexp     10    0.1    50            1 mean        t.test       0.8  0.9  
#>  2 rexp     10    0.1    50            1 mean        t.test       0.9  0.9  
#>  3 rexp     10    0.1    50            1 mean        t.test       0.95 1    
#>  4 rexp     10    0.1    50            1 sd          t.test       0.8  0.316
#>  5 rexp     10    0.1    50            1 sd          t.test       0.9  0.316
#>  6 rexp     10    0.1    50            1 sd          t.test       0.95 0    
#>  7 rexp     10    0.1   100            1 mean        t.test       0.8  0.6  
#>  8 rexp     10    0.1   100            1 mean        t.test       0.9  0.7  
#>  9 rexp     10    0.1   100            1 mean        t.test       0.95 0.8  
#> 10 rexp     10    0.1   100            1 sd          t.test       0.8  0.516
#> # ℹ 14 more rows
#> Number of data generating functions: 4
#> Number of analyzing procedures: 3
#> Number of replications: 10
#> Estimated replications per hour: 499936
#> Start of the simulation: 2025-04-10 18:38:39.905763
#> End of the simulation: 2025-04-10 18:38:39.977772
### need to remove the locally adapted eval_tibbles()
### otherwise executing the examples would mask
### eval_tibbles from simTool-namespace.
rm(eval_tibbles)