In this vignette we use synthetic data to model the data generating process of longitudinal datasets. We will evaluate how the effect size and variability of the data affects the obtained estimates.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(rstanarm)
#> Loading required package: Rcpp
#> This is rstanarm version 2.32.1
#> - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
#> - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
#> - For execution on a local, multicore CPU with excess RAM we recommend calling
#> options(mc.cores = parallel::detectCores())
library(simstudy)
library(tidybayes)
library(biogrowleR)
ggplot2::theme_set(ggplot2::theme_bw(base_size = 15))
An example with synthetic data
We follow the tutorial at https://kgoldfeld.github.io/simstudy/articles/longitudinal.html to simulate data and understand the minimum number of samples necessary for capturing any effect depending on its size.
To understand the possible distributions the documentation at https://cran.r-project.org/web/packages/simstudy/vignettes/simstudy.html explains it.
set.seed(123)
# we define the starting measurement as 7 with a variance of 3, that
# is usually the prior
def <- simstudy::defData(
varname = "xbase",
dist = "normal",
formula = 7,
variance = 3
)
# we want 5 or 6 measurements for each subject, in the case, each gland.
def <- simstudy::defData(
def,
varname = "nCount",
dist = "uniformInt",
formula = "5;6"
)
# the average interval time between each measurement is 5 units with very
# low variance, we want this to be constant as the measurements are done
# together at the same day
def <- simstudy::defData(
def,
varname = "mInterval",
dist = "gamma",
formula = 5,
variance = 1e-100
)
# we assign the treatment groups using a 60%/40% ratio to control and
# treatment respectively
def <- simstudy::defData(
def,
varname = "treatment",
dist = "trtAssign",
formula = "0.6;0.4"
)
dt <- simstudy::genData(10, def)
dtPeriod <- simstudy::addPeriods(dt)
def2 <- simstudy::defDataAdd(
varname = "total_flux",
dist = "normal",
formula = "xbase + .1 * time + .2 * treatment + .1 * treatment * time",
variance = 3
)
dtPeriod <- simstudy::addColumns(def2, dtPeriod) %>%
dplyr::mutate(treatment = dplyr::case_when(
treatment == 0 ~ "ctrl",
TRUE ~ "treat"
))
biogrowleR::plot_summary_growth_curves(
dtPeriod %>%
dplyr::mutate(treatment = factor(treatment)),
column_days = "time",
column_conditions = "treatment",
columns_vals = "total_flux",
ylabs = expression("log"[10] * "(Total flux (photons/sec))")
)$total_flux +
ggplot2::labs(
color = "Treatment",
linetype = "Treatment"
) +
ggplot2::theme(legend.position.inside = c(0.2, 0.8))
And we can visualize each individual curve stratified by treatment.
dtPeriod %>%
ggplot2::ggplot(
aes(
x = time,
y = total_flux,
group = id
)
) +
ggplot2::geom_line(alpha = 0.6) +
ggplot2::stat_summary(
mapping = aes(group = treatment),
fun = mean,
geom = "line",
colour = "red",
linewidth = 2
) +
ggplot2::labs(
x = "Days after treatment",
y = expression("log"[10] * "(Total flux (photons/sec))")
) +
ggplot2::facet_wrap(~treatment, nrow = 1) +
ggplot2::theme_bw(base_size = 20)
The behavior is similar to the other tutorials we wrote on IVIS data. The curves they all have different starting points and also some have datapoints missing. In some of the curves there are measurements that would be considered outliers and can be further removed in the analysis.
For now we keep the whole dataset, as we know the data generating process. We calculate the bayesian effect size as described in the IVIS tutorial.
fit_01 <- rstanarm::stan_glmer(
total_flux ~ time * treatment + (1| id),
data = dtPeriod,
cores = 4,
iter = 4000,
prior_intercept = normal(7, 3),
prior = normal(0, 4),
refresh = 0
)
It takes around 3s running rstanarm
to obtain the
estimates using 4 parallel chains, i.e., 4 cores.
new_df_predict <- biogrowleR::get_df_predict(
dtPeriod,
column_days = "time",
column_conditions = "treatment"
)
results_estimates_global <- tidybayes::add_linpred_draws(
object = fit_01,
newdata = new_df_predict,
re_formula = NA,
value = "total_flux"
)
stats_from_data <- biogrowleR::get_stats_from_data(
dtPeriod,
column_days = "time",
column_conditions = "treatment",
column_vals = "total_flux"
)
plot_estimates <- biogrowleR::plot_posterior_estimates(
results_estimates_global,
stats_from_data,
column_days = "time",
column_conditions = "treatment",
column_vals = "total_flux",
title = NULL
)
plot_estimates +
ggplot2::labs(
color = "Treatment",
fill = "Treatment",
linetype = "Treatment"
) +
ggplot2::theme(legend.position.inside = c(0.2, 0.8))
And the estimates are shown below. Remember that the formula was . The time coefficient should be close to 0.07, the treatment close to 0.07 and the interaction term close to 0.1. The treatment coefficient is overestimated, but the two other coefficients are well calculated. The credible interval is very wide for the treatment estimate, meaning that there is not enough data to analyse the magnitude of the treatment coefficient alone.
fit_01 %>%
broom.mixed::tidy(conf.int = TRUE) %>%
dplyr::select(term, estimate, std.error, conf.low, conf.high, group) %>%
knitr::kable()
term | estimate | std.error | conf.low | conf.high | group |
---|---|---|---|---|---|
(Intercept) | 7.4645902 | 0.9245096 | 5.8910404 | 9.0069670 | NA |
time | 0.0790137 | 0.0321092 | 0.0229535 | 0.1337337 | NA |
treatmenttreat | -0.9523488 | 1.3791624 | -3.3302571 | 1.4150896 | NA |
time:treatmenttreat | 0.1309009 | 0.0553425 | 0.0388106 | 0.2234487 | NA |
sd_(Intercept).id | 2.1199471 | NA | NA | NA | id |
sd_Observation.Residual | 1.6960398 | NA | NA | NA | Residual |
And the figure below shows the effect size.
# first get the pooled sds that will be used in the next function.
pooled_sds <- biogrowleR::get_pooled_sds(
df = dtPeriod,
column_conditions = "treatment",
column_days = "time",
column_vals = "total_flux"
)
# days correspond to the first and last days of treatment in the case
# of using linear regression to model the data
days <- c(
min(results_estimates_global$time),
max(results_estimates_global$time)
)
# combination_conditions is the same as used before for the frequentist
# approach.
results_differences <- biogrowleR::get_slope_differences(
results_estimates = results_estimates_global,
combination_conditions = combn(unique(dtPeriod$treatment), m = 2),
days = days,
pooled_sds = pooled_sds,
column_conditions = "treatment",
column_vals = "total_flux",
column_days = "time"
)
biogrowleR::plot_eff_size_dist(
results_differences,
days
) +
ggplot2::labs(title = paste0("Day ", days[2], " - ", "Day ", days[1]))
The treatment conditions grows faster than the control condition, which is indicated by a positive effect size.
Changing variance and magnitude
We now try to understand the variance to sample size relationship given a fixed effect size. We will change the variance of total flux in the simulation from 1 to 5 and the number of experimental units per group from 3 to 20. The magnitude of the interaction term between time and treatment groups will vary from 0 to 1 with step increments of 0.1.
The data generating process is described by the following formula:
where initial measurement is a randomly generated value for each individual experimental unit following a normal distribution with mean 7 and variance 3, the usual found in in vivo experiments with mice. For the sake of simplicity in all models we will fix and . What will change is . We only use non-negative numbers as we are modeling the data linearly, so the sign of the true interaction term should not matter in this simulation study. For each experimental unit we simulate 5 to 6 measurements, where some of them will be missing the last measurement.
Next, we plot the estimated average posterior distribution of the interaction term by sample size and variance. We ignore on purpose any correlation structure among the experimental units, we are assuming only repeated measurements within each experimental unit. In practice it is different, mice are housed in the same cage or position in a well might affect the measurements. The sample sizes will vary from 2 to 15 for each treatment arm.
# simulate the datasets based on the parameters given. this will be used
# to analyze what is the best sample size given a variance and magnitude
# of the interaction between time and treatment
get_data_simulation <- function(
sample_size,
variance_magnitude,
magnitude_interaction
){
def <- simstudy::defData(
varname = "xbase",
dist = "normal",
formula = 7,
variance = 3
)
# we want 5 or 6 measurements for each subject, in the case, each gland.
def <- simstudy::defData(
def,
varname = "nCount",
dist = "uniformInt",
formula = "5;6"
)
# the average interval time between each measurement is 5 units with very
# low variance, we want this to be constant as the measurements are done
# together at the same day
def <- simstudy::defData(
def,
varname = "mInterval",
dist = "gamma",
formula = 5,
variance = 1e-100
)
# we assigne the treatment groups using a 60%/40% ratio to control and
# treatment respectively
def <- simstudy::defData(
def,
varname = "treatment",
dist = "trtAssign",
formula = "0.5;0.5"
)
dt <- simstudy::genData(sample_size*2, def)
dtPeriod <- simstudy::addPeriods(dt)
def2 <- simstudy::defDataAdd(
varname = "total_flux",
dist = "normal",
formula = paste0(
"xbase + .07 * time + .07 * treatment +",
magnitude_interaction, " * treatment * time"
),
variance = variance_magnitude
)
dtPeriod <- simstudy::addColumns(def2, dtPeriod) %>%
dplyr::mutate(treatment = dplyr::case_when(
treatment == 0 ~ "ctrl",
TRUE ~ "treat"
))
dtPeriod
}
# use rstanarm to model the data generating process
calculate_magnitude <- function(data_sim){
start.time <- proc.time()[3]
fit <- rstanarm::stan_glmer(
formula = total_flux ~ time * treatment + (1| id),
data = data_sim,
chains = 4,
iter = 2000,
refresh = 0
)
end.time <- proc.time()[3]
list(
fit = fit,
diff.time = end.time-start.time
)
}
# extract the estimates for the interaction term so it can be plotted
# easily afterwards
extract_magnitude_and_ci <- function(fit_and_time, conf.level = 0.5){
fit_and_time$fit %>%
broom.mixed::tidy(conf.int = TRUE, conf.level = conf.level) %>%
dplyr::select(term, estimate, std.error, conf.low, conf.high) %>%
dplyr::filter(term == "time:treatmenttreat") %>%
dplyr::mutate(running_time = fit_and_time$diff.time)
}
The code chunk belows generates the datasets and estimates given the parameters.
# set the parameters
sample_sizes <- seq(2, 15)
variance_magnitude <- seq(1, 5)
magnitude_interaction <- seq(0, 0.2, by = 0.05)
# calculate all combinations of the above parameters
params_simulation <- expand.grid(
sample_sizes,
variance_magnitude,
magnitude_interaction
) %>%
`colnames<-`(
c("sample_size", "variance_magnitude", "magnitude_interaction")
)
# simulate the data given the parameters
data_simulation <- apply(
params_simulation,
1,
\(x) get_data_simulation(x[1], x[2], x[3]),
simplify = FALSE
)
# estimate the data generating process
fits_magnitudes <- parallel::mclapply(
data_simulation,
calculate_magnitude,
mc.cores = 15
)
# obtain the estimates so we can plot it later
magnitude_cis <- lapply(
fits_magnitudes,
extract_magnitude_and_ci,
conf.level = 0.95
) %>%
dplyr::bind_rows() %>%
dplyr::mutate(id = 1:nrow(params_simulation)) %>%
dplyr::inner_join(
.,
params_simulation %>%
data.frame %>%
dplyr::mutate(id = 1:nrow(params_simulation)),
by = "id"
) %>%
dplyr::mutate(
diff_estimate_true_param = estimate - magnitude_interaction,
diff_conf_high = conf.high - magnitude_interaction,
width_ci = conf.high - conf.low
)
The figure below shows the difference in the extremes of the 95% credible intervals versus sample size per group stratified by the magnitude of the interaction term and variance used for simulating the data. The dashed line in each subplot corresponds to the true estimate that is indicated in the right column.
magnitude_cis %>%
ggplot2::ggplot(aes(
x = sample_size,
y = width_ci
)) +
ggplot2::geom_point() +
ggplot2::geom_smooth(method = 'loess', formula = 'y ~ x') +
ggplot2::geom_point(
mapping = aes(y = estimate, x = sample_size),
color = "red"
) +
ggplot2::geom_hline(
mapping = aes(yintercept = magnitude_interaction),
linetype = "dashed",
color = "red"
) +
ggplot2::facet_grid(
magnitude_interaction~variance_magnitude,
#scale = "free",
labeller = as_labeller(c(
"1" = "Var = 1",
"2" = "Var = 2",
"3" = "Var = 3",
"4" = "Var = 4",
"5" = "Var = 5",
"0" = "Magnitude = 0",
"0.05" = "Magnitude = 0.05",
"0.1" = "Magnitude = 0.1",
"0.15" = "Magnitude = 0.15",
"0.2" = "Magnitude = 0.2"
))
) +
ggplot2::labs(
x = "Sample size (per group)",
y = paste0(
"C.I. high - C.I. low (95% credible interval)",
"<br><i style='color:red'>Estimate of interaction term</i>"
),
title = paste0(
"<i style='color:red'>Dashed line</i> corresponds to the",
" original interaction term value"
)
) +
ggplot2::scale_x_continuous(
breaks = sample_sizes,
guide = guide_axis(angle = 75)
) +
ggplot2::theme_bw(base_size = 25) +
ggplot2::theme(
axis.title.y = ggtext::element_markdown(),
title = ggtext::element_markdown(),
axis.text.x = element_text(size = 15)
)
For bigger magnitudes it is not necessary to have larger sample sizes per group (over 8 samples). As the magnitude decreases, such as 0 or 0.05, it is imperative to have larger sample sizes. If variance is too large, such as 5, the estimate will still be noisy and even 15 samples are not enough. When variance is 4 and magnitude is 0.05, to have more accurate estimates with proper credible intervals, around 11~12 per group are necessary.
Running time
We calculated the running time in seconds depending on the sample size and variance. In average the running time scales linearly with the increase of samples in each group.
For each extra sample there is an increase of
as.numeric(round(time_avg[2, 2], digits = 2))
seconds in
average (figure below).
magnitude_cis %>%
ggplot2::ggplot(aes(
x = factor(sample_size),
y = running_time,
color = factor(variance_magnitude)
)) +
ggplot2::geom_boxplot(color = "black") +
ggplot2::geom_jitter() +
ggplot2::labs(
x = "Sample size (per group)",
y = "Running time (seconds)",
color = "Variance"
) +
ggplot2::theme_bw(base_size = 20) +
ggplot2::guides(color = guide_legend(override.aes = list(size = 5)))