Generates data according to all provided
constellations in data_tibble
and applies
all provided constellations in proc_tibble
to them.
Usage
eval_tibbles(
data_grid,
proc_grid = expand_tibble(proc = "length"),
replications = 1,
discard_generated_data = FALSE,
post_analyze = identity,
summary_fun = NULL,
group_for_summary = NULL,
ncpus = 1L,
cluster = NULL,
cluster_seed = rep(12345, 6),
cluster_libraries = NULL,
cluster_global_objects = NULL,
envir = globalenv(),
simplify = TRUE
)
Arguments
- data_grid
a
data.frame
ortibble
where the first column is a character vector with function names. The other columns contain parameters for the functions specified in the first column. Parameters with NA are ignored. If a column with name.truth
exist, then the corresponding entry is passed to functions generated fromproc_grid
and the function specified inpost_analyze
.- proc_grid
similar as
data_grid
the first column must contain function names. The other columns contain parameters for the functions specified in the first column. The data generated according todata_grid
will always be passed to the first unspecified argument of the functions specified in the first column ofproc_grid
. If a function specified inproc_grid
has an argument.truth
, then the corresponding entry in the.truth
column fromdata_grid
is passed to the.truth
parameter or if no column.truth
exist indata_grid
, then all parameters used for the data generation are passed to the.truth
parameter.- replications
number of replications for the simulation
- discard_generated_data
if
TRUE
the generated data is deleted after all function constellations inproc_grid
have been applied. Otherwise, ALL generated data sets will be part of the returned object.- post_analyze
this is a convenience function, that is applied directly after the data analyzing function. If this function has an argument
.truth
, then the corresponding entry in the.truth
column fromdata_grid
is passed to the.truth
parameter or if no column.truth
exist indata_grid
, then all parameters used for the data generation are passed to the.truth
parameter.- summary_fun
named list of univariate function to summarize the results (numeric or logical) over the replications, e.g. list(mean = mean, sd = sd).
- group_for_summary
if the result returned by the data analyzing function or
post_analyze
is adata.frame
with more than one row, one usually is interested in summarizing the results while grouping for some variables. This group variables can be passed as a character vector intogroup_for_summary
- ncpus
a cluster of
ncpus
workers (R-processes) is created on the local machine to conduct the simulation. Ifncpus
equals one no cluster is created and the simulation is conducted by the current R-process.- cluster
a cluster generated by the
parallel
package that will be used to conduct the simulation. Ifcluster
is specified, thenncpus
will be ignored.- cluster_seed
if the simulation is done in parallel manner, then the combined multiple-recursive generator from L'Ecuyer (1999) is used to generate random numbers. Thus
cluster_seed
must be a (signed) integer vector of length 6. The 6 elements of the seed are internally regarded as 32-bit unsigned integers. Neither the first three nor the last three should be all zero, and they are limited to less than 4294967087 and 4294944443 respectively.- cluster_libraries
a character vector specifying the packages that should be loaded by the workers.
- cluster_global_objects
a character vector specifying the names of R objects in the global environment that should be exported to the global environment of every worker.
- envir
must be provided if the functions specified in
data_grid
orproc_grid
are not part of the global environment.- simplify
usually the result column is nested, by default it is tried to unnest it.
Value
The returned object list of the class
eval_tibbles
, where the element simulations
contain
the results of the simulation.
Note
If cluster
is provided by the user the
function eval_tibbles
will NOT stop the cluster.
This has to be done by the user. Conducting parallel
simulations by specifying ncpus
will internally
create a cluster and stop it after the simulation
is done.
Examples
rng <- function(data, ...) {
ret <- range(data)
names(ret) <- c("min", "max")
ret
}
### The following line is only necessary
### if the examples are not executed in the global
### environment, which for instance is the case when
### the oneline-documentation
### http://marselscheer.github.io/simTool/reference/eval_tibbles.html
### is build. In such case eval_tibble() would search the
### above defined function rng() in the global environment where
### it does not exist!
eval_tibbles <- purrr::partial(eval_tibbles, envir = environment())
dg <- expand_tibble(fun = "rnorm", n = c(5L, 10L))
pg <- expand_tibble(proc = c("rng", "median", "length"))
eval_tibbles(dg, pg, rep = 2, simplify = FALSE)
#> # A tibble: 12 × 5
#> fun n replications proc results
#> <chr> <int> <int> <chr> <list>
#> 1 rnorm 5 1 rng <dbl [2]>
#> 2 rnorm 5 1 median <dbl [1]>
#> 3 rnorm 5 1 length <int [1]>
#> 4 rnorm 5 2 rng <dbl [2]>
#> 5 rnorm 5 2 median <dbl [1]>
#> 6 rnorm 5 2 length <int [1]>
#> 7 rnorm 10 1 rng <dbl [2]>
#> 8 rnorm 10 1 median <dbl [1]>
#> 9 rnorm 10 1 length <int [1]>
#> 10 rnorm 10 2 rng <dbl [2]>
#> 11 rnorm 10 2 median <dbl [1]>
#> 12 rnorm 10 2 length <int [1]>
#> Number of data generating functions: 2
#> Number of analyzing procedures: 3
#> Number of replications: 2
#> Estimated replications per hour: 13833709
#> Start of the simulation: 2025-04-10 18:38:39.315084
#> End of the simulation: 2025-04-10 18:38:39.315604
eval_tibbles(dg, pg, rep = 2)
#> # A tibble: 16 × 5
#> fun n replications proc results
#> <chr> <int> <int> <chr> <dbl>
#> 1 rnorm 5 1 rng 0.112
#> 2 rnorm 5 1 rng 1.62
#> 3 rnorm 5 1 median 0.244
#> 4 rnorm 5 1 length 5
#> 5 rnorm 5 2 rng -1.91
#> 6 rnorm 5 2 rng 1.07
#> 7 rnorm 5 2 median -0.279
#> 8 rnorm 5 2 length 5
#> 9 rnorm 10 1 rng -1.91
#> 10 rnorm 10 1 rng 2.76
#> 11 rnorm 10 1 median 0.0583
#> 12 rnorm 10 1 length 10
#> 13 rnorm 10 2 rng -2.27
#> 14 rnorm 10 2 rng 2.68
#> 15 rnorm 10 2 median 0.0244
#> 16 rnorm 10 2 length 10
#> Number of data generating functions: 2
#> Number of analyzing procedures: 3
#> Number of replications: 2
#> Estimated replications per hour: 21663550
#> Start of the simulation: 2025-04-10 18:38:39.355151
#> End of the simulation: 2025-04-10 18:38:39.355483
eval_tibbles(dg, pg,
rep = 2,
post_analyze = purrr::compose(as.data.frame, t)
)
#> # A tibble: 12 × 7
#> fun n replications proc min max V1
#> <chr> <int> <int> <chr> <dbl> <dbl> <dbl>
#> 1 rnorm 5 1 rng -1.18 1.11 NA
#> 2 rnorm 5 1 median NA NA -0.246
#> 3 rnorm 5 1 length NA NA 5
#> 4 rnorm 5 2 rng -1.70 1.07 NA
#> 5 rnorm 5 2 median NA NA 0.132
#> 6 rnorm 5 2 length NA NA 5
#> 7 rnorm 10 1 rng -1.47 1.34 NA
#> 8 rnorm 10 1 median NA NA 0.260
#> 9 rnorm 10 1 length NA NA 10
#> 10 rnorm 10 2 rng -2.61 1.92 NA
#> 11 rnorm 10 2 median NA NA 0.495
#> 12 rnorm 10 2 length NA NA 10
#> Number of data generating functions: 2
#> Number of analyzing procedures: 3
#> Number of replications: 2
#> Estimated replications per hour: 964854
#> Start of the simulation: 2025-04-10 18:38:39.393957
#> End of the simulation: 2025-04-10 18:38:39.40142
eval_tibbles(dg, pg, rep = 2, summary_fun = list(mean = mean, sd = sd))
#> # A tibble: 12 × 8
#> fun n replications summary_fun proc min max value
#> <chr> <int> <int> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 rnorm 5 1 mean rng -0.196 1.37 NA
#> 2 rnorm 5 1 mean median NA NA 0.716
#> 3 rnorm 5 1 mean length NA NA 5
#> 4 rnorm 5 1 sd rng 0.224 0.431 NA
#> 5 rnorm 5 1 sd median NA NA 0.325
#> 6 rnorm 5 1 sd length NA NA 0
#> 7 rnorm 10 1 mean rng -1.72 1.55 NA
#> 8 rnorm 10 1 mean median NA NA -0.185
#> 9 rnorm 10 1 mean length NA NA 10
#> 10 rnorm 10 1 sd rng 0.0509 0.812 NA
#> 11 rnorm 10 1 sd median NA NA 0.621
#> 12 rnorm 10 1 sd length NA NA 0
#> Number of data generating functions: 2
#> Number of analyzing procedures: 3
#> Number of replications: 2
#> Estimated replications per hour: 238203
#> Start of the simulation: 2025-04-10 18:38:39.439408
#> End of the simulation: 2025-04-10 18:38:39.469634
regData <- function(n, SD) {
data.frame(
x = seq(0, 1, length = n),
y = rnorm(n, sd = SD)
)
}
eg <- eval_tibbles(
expand_tibble(fun = "regData", n = 5L, SD = 1:2),
expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")),
replications = 3
)
eg
#> # A tibble: 12 × 7
#> fun n SD replications proc formula results
#> <chr> <int> <int> <int> <chr> <chr> <list>
#> 1 regData 5 1 1 lm y~x <lm>
#> 2 regData 5 1 1 lm y~I(x^2) <lm>
#> 3 regData 5 1 2 lm y~x <lm>
#> 4 regData 5 1 2 lm y~I(x^2) <lm>
#> 5 regData 5 1 3 lm y~x <lm>
#> 6 regData 5 1 3 lm y~I(x^2) <lm>
#> 7 regData 5 2 1 lm y~x <lm>
#> 8 regData 5 2 1 lm y~I(x^2) <lm>
#> 9 regData 5 2 2 lm y~x <lm>
#> 10 regData 5 2 2 lm y~I(x^2) <lm>
#> 11 regData 5 2 3 lm y~x <lm>
#> 12 regData 5 2 3 lm y~I(x^2) <lm>
#> Number of data generating functions: 2
#> Number of analyzing procedures: 2
#> Number of replications: 3
#> Estimated replications per hour: 823953
#> Start of the simulation: 2025-04-10 18:38:39.507479
#> End of the simulation: 2025-04-10 18:38:39.520586
presever_rownames <- function(mat) {
rn <- rownames(mat)
ret <- tibble::as_tibble(mat)
ret$term <- rn
ret
}
eg <- eval_tibbles(
expand_tibble(fun = "regData", n = 5L, SD = 1:2),
expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")),
post_analyze = purrr::compose(presever_rownames, coef, summary),
# post_analyze = broom::tidy, # is a nice out of the box alternative
summary_fun = list(mean = mean, sd = sd),
group_for_summary = "term",
replications = 3
)
#> Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.
#> ℹ The deprecated feature was likely used in the dplyr package.
#> Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
eg$simulation
#> # A tibble: 16 × 12
#> fun n SD replications summary_fun proc formula term Estimate
#> <chr> <int> <int> <int> <chr> <chr> <chr> <chr> <dbl>
#> 1 regData 5 1 1 mean lm y~x (Interc… -0.137
#> 2 regData 5 1 1 mean lm y~x x 0.148
#> 3 regData 5 1 1 mean lm y~I(x^2) (Interc… -0.121
#> 4 regData 5 1 1 mean lm y~I(x^2) I(x^2) 0.154
#> 5 regData 5 1 1 sd lm y~x (Interc… 0.330
#> 6 regData 5 1 1 sd lm y~x x 0.298
#> 7 regData 5 1 1 sd lm y~I(x^2) (Interc… 0.240
#> 8 regData 5 1 1 sd lm y~I(x^2) I(x^2) 0.913
#> 9 regData 5 2 1 mean lm y~x (Interc… -1.05
#> 10 regData 5 2 1 mean lm y~x x 2.58
#> 11 regData 5 2 1 mean lm y~I(x^2) (Interc… -0.851
#> 12 regData 5 2 1 mean lm y~I(x^2) I(x^2) 2.91
#> 13 regData 5 2 1 sd lm y~x (Interc… 0.754
#> 14 regData 5 2 1 sd lm y~x x 0.492
#> 15 regData 5 2 1 sd lm y~I(x^2) (Interc… 0.667
#> 16 regData 5 2 1 sd lm y~I(x^2) I(x^2) 0.655
#> # ℹ 3 more variables: `Std. Error` <dbl>, `t value` <dbl>, `Pr(>|t|)` <dbl>
dg <- expand_tibble(fun = "rexp", rate = c(10, 100), n = c(50L, 100L))
pg <- expand_tibble(proc = c("t.test"), conf.level = c(0.8, 0.9, 0.95))
et <- eval_tibbles(dg, pg,
ncpus = 1,
replications = 10^1,
post_analyze = function(ttest, .truth) {
mu <- 1 / .truth$rate
ttest$conf.int[1] <= mu && mu <= ttest$conf.int[2]
},
summary_fun = list(mean = mean, sd = sd)
)
et
#> # A tibble: 24 × 8
#> fun rate n replications summary_fun proc conf.level value
#> <chr> <dbl> <int> <int> <chr> <chr> <dbl> <dbl>
#> 1 rexp 10 50 1 mean t.test 0.8 0.9
#> 2 rexp 10 50 1 mean t.test 0.9 0.9
#> 3 rexp 10 50 1 mean t.test 0.95 0.9
#> 4 rexp 10 50 1 sd t.test 0.8 0.316
#> 5 rexp 10 50 1 sd t.test 0.9 0.316
#> 6 rexp 10 50 1 sd t.test 0.95 0.316
#> 7 rexp 100 50 1 mean t.test 0.8 0.6
#> 8 rexp 100 50 1 mean t.test 0.9 0.7
#> 9 rexp 100 50 1 mean t.test 0.95 0.7
#> 10 rexp 100 50 1 sd t.test 0.8 0.516
#> # ℹ 14 more rows
#> Number of data generating functions: 4
#> Number of analyzing procedures: 3
#> Number of replications: 10
#> Estimated replications per hour: 253289
#> Start of the simulation: 2025-04-10 18:38:39.719946
#> End of the simulation: 2025-04-10 18:38:39.862076
dg <- dplyr::bind_rows(
expand_tibble(fun = "rexp", rate = 10, .truth = 1 / 10, n = c(50L, 100L)),
expand_tibble(fun = "rnorm", .truth = 0, n = c(50L, 100L))
)
pg <- expand_tibble(proc = c("t.test"), conf.level = c(0.8, 0.9, 0.95))
et <- eval_tibbles(dg, pg,
ncpus = 1,
replications = 10^1,
post_analyze = function(ttest, .truth) {
ttest$conf.int[1] <= .truth && .truth <= ttest$conf.int[2]
},
summary_fun = list(mean = mean, sd = sd)
)
et
#> # A tibble: 24 × 9
#> fun rate .truth n replications summary_fun proc conf.level value
#> <chr> <dbl> <dbl> <int> <int> <chr> <chr> <dbl> <dbl>
#> 1 rexp 10 0.1 50 1 mean t.test 0.8 0.9
#> 2 rexp 10 0.1 50 1 mean t.test 0.9 0.9
#> 3 rexp 10 0.1 50 1 mean t.test 0.95 1
#> 4 rexp 10 0.1 50 1 sd t.test 0.8 0.316
#> 5 rexp 10 0.1 50 1 sd t.test 0.9 0.316
#> 6 rexp 10 0.1 50 1 sd t.test 0.95 0
#> 7 rexp 10 0.1 100 1 mean t.test 0.8 0.6
#> 8 rexp 10 0.1 100 1 mean t.test 0.9 0.7
#> 9 rexp 10 0.1 100 1 mean t.test 0.95 0.8
#> 10 rexp 10 0.1 100 1 sd t.test 0.8 0.516
#> # ℹ 14 more rows
#> Number of data generating functions: 4
#> Number of analyzing procedures: 3
#> Number of replications: 10
#> Estimated replications per hour: 499936
#> Start of the simulation: 2025-04-10 18:38:39.905763
#> End of the simulation: 2025-04-10 18:38:39.977772
### need to remove the locally adapted eval_tibbles()
### otherwise executing the examples would mask
### eval_tibbles from simTool-namespace.
rm(eval_tibbles)