Generates data according to all provided constellations in data_tibble and applies all provided constellations in proc_tibble to them.

eval_tibbles(
  data_grid,
  proc_grid = expand_tibble(proc = "length"),
  replications = 1,
  discard_generated_data = FALSE,
  post_analyze = identity,
  summary_fun = NULL,
  group_for_summary = NULL,
  ncpus = 1L,
  cluster = NULL,
  cluster_seed = rep(12345, 6),
  cluster_libraries = NULL,
  cluster_global_objects = NULL,
  envir = globalenv(),
  simplify = TRUE
)

Arguments

data_grid

a data.frame or tibble where the first column is a character vector with function names. The other columns contain parameters for the functions specified in the first column. Parameters with NA are ignored. If a column with name .truth exist, then the corresponding entry is passed to functions generated from proc_grid and the function specified in post_analyze.

proc_grid

similar as data_grid the first column must contain function names. The other columns contain parameters for the functions specified in the first column. The data generated according to data_grid will always be passed to the first unspecified argument of the functions specified in the first column of proc_grid. If a function specified in proc_grid has an argument .truth, then the corresponding entry in the .truth column from data_grid is passed to the .truth parameter or if no column .truth exist in data_grid, then all parameters used for the data generation are passed to the .truth parameter.

replications

number of replications for the simulation

discard_generated_data

if TRUE the generated data is deleted after all function constellations in proc_grid have been applied. Otherwise, ALL generated data sets will be part of the returned object.

post_analyze

this is a convenience function, that is applied directly after the data analyzing function. If this function has an argument .truth, then the corresponding entry in the .truth column from data_grid is passed to the .truth parameter or if no column .truth exist in data_grid, then all parameters used for the data generation are passed to the .truth parameter.

summary_fun

named list of univariate function to summarize the results (numeric or logical) over the replications, e.g. list(mean = mean, sd = sd).

group_for_summary

if the result returned by the data analyzing function or post_analyze is a data.frame with more than one row, one usually is interested in summarizing the results while grouping for some variables. This group variables can be passed as a character vector into group_for_summary

ncpus

a cluster of ncpus workers (R-processes) is created on the local machine to conduct the simulation. If ncpus equals one no cluster is created and the simulation is conducted by the current R-process.

cluster

a cluster generated by the parallel package that will be used to conduct the simulation. If cluster is specified, then ncpus will be ignored.

cluster_seed

if the simulation is done in parallel manner, then the combined multiple-recursive generator from L'Ecuyer (1999) is used to generate random numbers. Thus cluster_seed must be a (signed) integer vector of length 6. The 6 elements of the seed are internally regarded as 32-bit unsigned integers. Neither the first three nor the last three should be all zero, and they are limited to less than 4294967087 and 4294944443 respectively.

cluster_libraries

a character vector specifying the packages that should be loaded by the workers.

cluster_global_objects

a character vector specifying the names of R objects in the global environment that should be exported to the global environment of every worker.

envir

must be provided if the functions specified in data_grid or proc_grid are not part of the global environment.

simplify

usually the result column is nested, by default it is tried to unnest it.

Value

The returned object list of the class eval_tibbles, where the element simulations contain the results of the simulation.

Note

If cluster is provided by the user the function eval_tibbles will NOT stop the cluster. This has to be done by the user. Conducting parallel simulations by specifying ncpus will internally create a cluster and stop it after the simulation is done.

Author

Marsel Scheer

Examples

rng <- function(data, ...) { ret <- range(data) names(ret) <- c("min", "max") ret } ### The following line is only necessary ### if the examples are not executed in the global ### environment, which for instance is the case when ### the oneline-documentation ### http://marselscheer.github.io/simTool/reference/eval_tibbles.html ### is build. In such case eval_tibble() would search the ### above defined function rng() in the global environment where ### it does not exist! eval_tibbles <- purrr::partial(eval_tibbles, envir = environment()) dg <- expand_tibble(fun = "rnorm", n = c(5L, 10L)) pg <- expand_tibble(proc = c("rng", "median", "length")) eval_tibbles(dg, pg, rep = 2, simplify = FALSE)
#> # A tibble: 12 × 5 #> fun n replications proc results #> <chr> <int> <int> <chr> <list> #> 1 rnorm 5 1 rng <dbl [2]> #> 2 rnorm 5 1 median <dbl [1]> #> 3 rnorm 5 1 length <int [1]> #> 4 rnorm 5 2 rng <dbl [2]> #> 5 rnorm 5 2 median <dbl [1]> #> 6 rnorm 5 2 length <int [1]> #> 7 rnorm 10 1 rng <dbl [2]> #> 8 rnorm 10 1 median <dbl [1]> #> 9 rnorm 10 1 length <int [1]> #> 10 rnorm 10 2 rng <dbl [2]> #> 11 rnorm 10 2 median <dbl [1]> #> 12 rnorm 10 2 length <int [1]> #> Number of data generating functions: 2 #> Number of analyzing procedures: 3 #> Number of replications: 2 #> Estimated replications per hour: 8858606 #> Start of the simulation: 2021-09-06 18:48:45 #> End of the simulation: 2021-09-06 18:48:45
eval_tibbles(dg, pg, rep = 2)
#> # A tibble: 16 × 5 #> fun n replications proc results #> <chr> <int> <int> <chr> <dbl> #> 1 rnorm 5 1 rng 0.112 #> 2 rnorm 5 1 rng 1.62 #> 3 rnorm 5 1 median 0.244 #> 4 rnorm 5 1 length 5 #> 5 rnorm 5 2 rng -1.91 #> 6 rnorm 5 2 rng 1.07 #> 7 rnorm 5 2 median -0.279 #> 8 rnorm 5 2 length 5 #> 9 rnorm 10 1 rng -1.91 #> 10 rnorm 10 1 rng 2.76 #> 11 rnorm 10 1 median 0.0583 #> 12 rnorm 10 1 length 10 #> 13 rnorm 10 2 rng -2.27 #> 14 rnorm 10 2 rng 2.68 #> 15 rnorm 10 2 median 0.0244 #> 16 rnorm 10 2 length 10 #> Number of data generating functions: 2 #> Number of analyzing procedures: 3 #> Number of replications: 2 #> Estimated replications per hour: 16314958 #> Start of the simulation: 2021-09-06 18:48:45 #> End of the simulation: 2021-09-06 18:48:45
eval_tibbles(dg, pg, rep = 2, post_analyze = purrr::compose(as.data.frame, t) )
#> # A tibble: 12 × 7 #> fun n replications proc min max V1 #> <chr> <int> <int> <chr> <dbl> <dbl> <dbl> #> 1 rnorm 5 1 rng -1.18 1.11 NA #> 2 rnorm 5 1 median NA NA -0.246 #> 3 rnorm 5 1 length NA NA 5 #> 4 rnorm 5 2 rng -1.70 1.07 NA #> 5 rnorm 5 2 median NA NA 0.132 #> 6 rnorm 5 2 length NA NA 5 #> 7 rnorm 10 1 rng -1.47 1.34 NA #> 8 rnorm 10 1 median NA NA 0.260 #> 9 rnorm 10 1 length NA NA 10 #> 10 rnorm 10 2 rng -2.61 1.92 NA #> 11 rnorm 10 2 median NA NA 0.495 #> 12 rnorm 10 2 length NA NA 10 #> Number of data generating functions: 2 #> Number of analyzing procedures: 3 #> Number of replications: 2 #> Estimated replications per hour: 841455 #> Start of the simulation: 2021-09-06 18:48:45 #> End of the simulation: 2021-09-06 18:48:45
eval_tibbles(dg, pg, rep = 2, summary_fun = list(mean = mean, sd = sd))
#> # A tibble: 12 × 8 #> fun n replications summary_fun proc min max value #> <chr> <int> <int> <chr> <chr> <dbl> <dbl> <dbl> #> 1 rnorm 5 1 mean rng -0.196 1.37 NA #> 2 rnorm 5 1 mean median NA NA 0.716 #> 3 rnorm 5 1 mean length NA NA 5 #> 4 rnorm 5 1 sd rng 0.224 0.431 NA #> 5 rnorm 5 1 sd median NA NA 0.325 #> 6 rnorm 5 1 sd length NA NA 0 #> 7 rnorm 10 1 mean rng -1.72 1.55 NA #> 8 rnorm 10 1 mean median NA NA -0.185 #> 9 rnorm 10 1 mean length NA NA 10 #> 10 rnorm 10 1 sd rng 0.0509 0.812 NA #> 11 rnorm 10 1 sd median NA NA 0.621 #> 12 rnorm 10 1 sd length NA NA 0 #> Number of data generating functions: 2 #> Number of analyzing procedures: 3 #> Number of replications: 2 #> Estimated replications per hour: 99252 #> Start of the simulation: 2021-09-06 18:48:45 #> End of the simulation: 2021-09-06 18:48:45
regData <- function(n, SD) { data.frame( x = seq(0, 1, length = n), y = rnorm(n, sd = SD) ) } eg <- eval_tibbles( expand_tibble(fun = "regData", n = 5L, SD = 1:2), expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")), replications = 3 ) eg
#> # A tibble: 12 × 7 #> fun n SD replications proc formula results #> <chr> <int> <int> <int> <chr> <chr> <list> #> 1 regData 5 1 1 lm y~x <lm> #> 2 regData 5 1 1 lm y~I(x^2) <lm> #> 3 regData 5 1 2 lm y~x <lm> #> 4 regData 5 1 2 lm y~I(x^2) <lm> #> 5 regData 5 1 3 lm y~x <lm> #> 6 regData 5 1 3 lm y~I(x^2) <lm> #> 7 regData 5 2 1 lm y~x <lm> #> 8 regData 5 2 1 lm y~I(x^2) <lm> #> 9 regData 5 2 2 lm y~x <lm> #> 10 regData 5 2 2 lm y~I(x^2) <lm> #> 11 regData 5 2 3 lm y~x <lm> #> 12 regData 5 2 3 lm y~I(x^2) <lm> #> Number of data generating functions: 2 #> Number of analyzing procedures: 2 #> Number of replications: 3 #> Estimated replications per hour: 374120 #> Start of the simulation: 2021-09-06 18:48:45 #> End of the simulation: 2021-09-06 18:48:46
presever_rownames <- function(mat) { rn <- rownames(mat) ret <- tibble::as_tibble(mat) ret$term <- rn ret } eg <- eval_tibbles( expand_tibble(fun = "regData", n = 5L, SD = 1:2), expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")), post_analyze = purrr::compose(presever_rownames, coef, summary), # post_analyze = broom::tidy, # is a nice out of the box alternative summary_fun = list(mean = mean, sd = sd), group_for_summary = "term", replications = 3 )
#> Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.
eg$simulation
#> # A tibble: 16 × 12 #> fun n SD replications summary_fun proc formula term Estimate #> <chr> <int> <int> <int> <chr> <chr> <chr> <chr> <dbl> #> 1 regData 5 1 1 mean lm y~x (Intercept) -0.137 #> 2 regData 5 1 1 mean lm y~x x 0.148 #> 3 regData 5 1 1 mean lm y~I(x^2) (Intercept) -0.121 #> 4 regData 5 1 1 mean lm y~I(x^2) I(x^2) 0.154 #> 5 regData 5 1 1 sd lm y~x (Intercept) 0.330 #> 6 regData 5 1 1 sd lm y~x x 0.298 #> 7 regData 5 1 1 sd lm y~I(x^2) (Intercept) 0.240 #> 8 regData 5 1 1 sd lm y~I(x^2) I(x^2) 0.913 #> 9 regData 5 2 1 mean lm y~x (Intercept) -1.05 #> 10 regData 5 2 1 mean lm y~x x 2.58 #> 11 regData 5 2 1 mean lm y~I(x^2) (Intercept) -0.851 #> 12 regData 5 2 1 mean lm y~I(x^2) I(x^2) 2.91 #> 13 regData 5 2 1 sd lm y~x (Intercept) 0.754 #> 14 regData 5 2 1 sd lm y~x x 0.492 #> 15 regData 5 2 1 sd lm y~I(x^2) (Intercept) 0.667 #> 16 regData 5 2 1 sd lm y~I(x^2) I(x^2) 0.655 #> # … with 3 more variables: Std. Error <dbl>, t value <dbl>, Pr(>|t|) <dbl>
dg <- expand_tibble(fun = "rexp", rate = c(10, 100), n = c(50L, 100L)) pg <- expand_tibble(proc = c("t.test"), conf.level = c(0.8, 0.9, 0.95)) et <- eval_tibbles(dg, pg, ncpus = 1, replications = 10^1, post_analyze = function(ttest, .truth) { mu <- 1 / .truth$rate ttest$conf.int[1] <= mu && mu <= ttest$conf.int[2] }, summary_fun = list(mean = mean, sd = sd) ) et
#> # A tibble: 24 × 8 #> fun rate n replications summary_fun proc conf.level value #> <chr> <dbl> <int> <int> <chr> <chr> <dbl> <dbl> #> 1 rexp 10 50 1 mean t.test 0.8 0.9 #> 2 rexp 10 50 1 mean t.test 0.9 0.9 #> 3 rexp 10 50 1 mean t.test 0.95 0.9 #> 4 rexp 10 50 1 sd t.test 0.8 0.316 #> 5 rexp 10 50 1 sd t.test 0.9 0.316 #> 6 rexp 10 50 1 sd t.test 0.95 0.316 #> 7 rexp 100 50 1 mean t.test 0.8 0.6 #> 8 rexp 100 50 1 mean t.test 0.9 0.7 #> 9 rexp 100 50 1 mean t.test 0.95 0.7 #> 10 rexp 100 50 1 sd t.test 0.8 0.516 #> # … with 14 more rows #> Number of data generating functions: 4 #> Number of analyzing procedures: 3 #> Number of replications: 10 #> Estimated replications per hour: 216361 #> Start of the simulation: 2021-09-06 18:48:46 #> End of the simulation: 2021-09-06 18:48:46
dg <- dplyr::bind_rows( expand_tibble(fun = "rexp", rate = 10, .truth = 1 / 10, n = c(50L, 100L)), expand_tibble(fun = "rnorm", .truth = 0, n = c(50L, 100L)) ) pg <- expand_tibble(proc = c("t.test"), conf.level = c(0.8, 0.9, 0.95)) et <- eval_tibbles(dg, pg, ncpus = 1, replications = 10^1, post_analyze = function(ttest, .truth) { ttest$conf.int[1] <= .truth && .truth <= ttest$conf.int[2] }, summary_fun = list(mean = mean, sd = sd) ) et
#> # A tibble: 24 × 9 #> fun rate .truth n replications summary_fun proc conf.level value #> <chr> <dbl> <dbl> <int> <int> <chr> <chr> <dbl> <dbl> #> 1 rexp 10 0.1 50 1 mean t.test 0.8 0.9 #> 2 rexp 10 0.1 50 1 mean t.test 0.9 0.9 #> 3 rexp 10 0.1 50 1 mean t.test 0.95 1 #> 4 rexp 10 0.1 50 1 sd t.test 0.8 0.316 #> 5 rexp 10 0.1 50 1 sd t.test 0.9 0.316 #> 6 rexp 10 0.1 50 1 sd t.test 0.95 0 #> 7 rexp 10 0.1 100 1 mean t.test 0.8 0.6 #> 8 rexp 10 0.1 100 1 mean t.test 0.9 0.7 #> 9 rexp 10 0.1 100 1 mean t.test 0.95 0.8 #> 10 rexp 10 0.1 100 1 sd t.test 0.8 0.516 #> # … with 14 more rows #> Number of data generating functions: 4 #> Number of analyzing procedures: 3 #> Number of replications: 10 #> Estimated replications per hour: 204119 #> Start of the simulation: 2021-09-06 18:48:46 #> End of the simulation: 2021-09-06 18:48:46
### need to remove the locally adapted eval_tibbles() ### otherwise executing the examples would mask ### eval_tibbles from simTool-namespace. rm(eval_tibbles)