Randomization of experimental units is necessary when are comparing different conditions. They ensure that we have less biased measures for the comparisons.
In this vignette we show how to use the package in order to perform the randomization before doing any experiment.
Randomization strategy
The idea is to use pre-treatment longitudinal measurements to ensure that we have balanced groups when randomizing. For this we calculate a linear regression for each mouse individually and get the slopes, they should be used to randomize the groups.
Each mouse has multiple glands and associated measurements, we use all the glands to calculate the linear regression.
Data description
We use real IVIS data for the next sections. The data is restricted to only pre-treatment measurements.
library(dplyr)
library(tidyr)
library(kableExtra)
library(ggplot2)
library(biogrowleR)
pdx_treat_ivis_long <- pdx_treat_ivis_long %>%
dplyr::filter(days_after_treatment <= 0) %>%
tidyr::separate(
id,
into = c("pdx", "mouse", "gland", "cage", "treatment"),
sep = "_",
remove = FALSE
) %>%
dplyr::mutate(treatment = NULL, color = "no_assignment")
The number of mouse and glands per mouse is shown below. Each row corresponds to one mouse and each column to one gland with injected human cells.
pdx_treat_ivis_long %>%
janitor::tabyl(mouse, gland) %>%
kableExtra::kbl() %>%
kableExtra::kable_classic(full_width = FALSE)
mouse | g3l | g3r | g4l | g4r |
---|---|---|---|---|
m02 | 6 | 6 | 6 | 6 |
m03 | 6 | 6 | 6 | 6 |
m05 | 6 | 6 | 6 | 6 |
m06 | 6 | 6 | 6 | 6 |
m08 | 6 | 6 | 6 | 6 |
m09 | 6 | 6 | 6 | 6 |
m10 | 6 | 6 | 6 | 6 |
m11 | 6 | 6 | 6 | 6 |
m13 | 6 | 6 | 6 | 6 |
All mice have six measurements in third and fourth glands of both sides. Moreover there are in total 9 mice in this experiment.
Exploring the data
As always, it is important to explore the data, in this case we use the functions in biogrowler to give us an idea of how the measurements look like.
biogrowleR::plot_outcome_by_id(
pdx_treat_ivis_long,
x = "days_after_treatment",
y = "log10tf",
wrap_on = "id",
color = "color",
legend_lab = NULL
) + ggplot2::theme(legend.position = "none")
We see that mouses 2 and 3 have noisy measurements, but the cells have a tendency to grow. On the other hand, mouse 13 seems to be all over the place. It could have been cells that were not well injected. All other mice have very nice growth curves as we expect.
Mouse 13 is not useful for downstream analysis, as the measurements are too noisy to be used for an experiment. Just be careful when looking at the y axis, as they vary for each subplot, this makes it easier to see the peculiarities of each gland measurements.
In total we have 8 mice for randomization, excluding mouse 13, which gives 4 mice for each condition.
Calculating the slopes
We start by getting the slopes for each mouse individually along with the standard deviation of the estimate. They both will be used for randomization, we want the samples to be split in a way that similar slopes with similar standard deviations are in different groups. Make sure that the last timepoint is 0, so all other values are actually negative. This assumption is used when calculating the linear regression.
slopes_pdx <- get_slopes(
data.frame(pdx_treat_ivis_long) %>%
dplyr::filter(mouse != "m13")
)
#> Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
#> dplyr 1.1.0.
#> ℹ Please use `reframe()` instead.
#> ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
#> always returns an ungrouped data frame and adjust accordingly.
#> ℹ The deprecated feature was likely used in the biogrowleR package.
#> Please report the issue at <https://gitlab.com/upbri/biogrowleR/-/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
We can now visualize the slopes and associated standard errors.
slopes_pdx %>%
dplyr::filter(term == "days_after_treatment") %>%
ggplot2::ggplot(aes(x = estimate, y = std.error, label = mouse)) +
ggplot2::geom_point(size = 3) +
ggrepel::geom_label_repel() +
ggplot2::theme_bw(base_size = 15)
The term (Intercept)
shows the estimated
of the total flux and the associated standard error. Ideally we want
those samples that have similar estimated
total flux and slope to be in different groups, to have a balanced
design.
slopes_pdx %>%
dplyr::filter(term == "(Intercept)") %>%
ggplot2::ggplot(aes(x = estimate, y = std.error, label = mouse)) +
ggplot2::geom_point(size = 3) +
ggrepel::geom_label_repel() +
ggplot2::theme_bw(base_size = 15)
And the last plot we show the estimates of the last day of measurement before any randomization and the estimates of the slope for each mouse.
slopes_pdx %>%
tidyr::pivot_wider(
names_from = c(term),
values_from = c(estimate, std.error),
id_cols = mouse
) %>%
janitor::clean_names() %>%
ggplot2::ggplot(aes(
x = estimate_intercept,
y = estimate_days_after_treatment,
label = mouse)
) +
ggplot2::geom_point(size = 3) +
ggrepel::geom_label_repel() +
ggplot2::theme_bw(base_size = 15)
When using the functions it is important to have the last measurement as day 0, since the regression framework uses this fact to infer the average measurement in the last day.
Defining the treatment groups
We use the function get_best_splitting
to get the
randomization between the two groups and then we plot the pre-treatment
curves based on the group assignment. We use the default parameters in
the package. The initial_seed
parameter can be changed to
use another initial splitting of the data. This is important because it
assures we can control the starting point at least.
df_long <- pdx_treat_ivis_long %>%
dplyr::filter(mouse != "m13")
# number of future treatments
nb_conditions <- 2
final_splitting <- get_best_splitting(
df_long,
nb_conditions,
block_factor = NULL,
initial_seed = 123
)
#> Iteration #50
#> Iteration #100
df_long <- df_long %>%
dplyr::left_join(., final_splitting$splitting, by = "mouse")
biogrowleR::plot_summary_growth_curves(
df_long,
column_conditions = "cluster"
)$log10tf
We see how the curves are close to each other, which is a good sign of the randomization. Since the groups were defined without any preference, you can assign control to the first group and treated to the second group. Or even better, we can sample using R the groups.
Group 1 should be control and group 2 should be treatment.
df_long <- df_long %>%
dplyr::mutate(treatment = dplyr::case_when(
cluster == 1 ~ "ctrl",
TRUE ~ "treat"
))
df_long %>% head
#> # A tibble: 6 × 11
#> days_after_treatment id pdx mouse gland cage tf log10tf color
#> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
#> 1 -46 pdx_m05_g3l… pdx m05 g3l cxx 7.31e6 6.86 no_a…
#> 2 -46 pdx_m05_g3r… pdx m05 g3r cxx 1.20e6 6.08 no_a…
#> 3 -46 pdx_m05_g4l… pdx m05 g4l cxx 1.84e6 6.27 no_a…
#> 4 -46 pdx_m05_g4r… pdx m05 g4r cxx 2.54e6 6.41 no_a…
#> 5 -46 pdx_m06_g3l… pdx m06 g3l cxx 1.6 e6 6.20 no_a…
#> 6 -46 pdx_m06_g3r… pdx m06 g3r cxx 1.27e7 7.10 no_a…
#> # ℹ 2 more variables: cluster <chr>, treatment <chr>
The graph below shows that it was necessary to only swap a few times the labels of the clusters among the samples. This makes sense for such a small dataset, the number of combinations is small and overall the samples were similar to each other.
if (nrow(final_splitting$vals) > 1){
final_splitting$vals %>%
data.frame %>%
tidyr::pivot_longer(
cols = c(
diff_intercepts,
diff_slopes,
std_errors_intercepts,
std_errors_slopes
),
values_to = "diff",
names_to = "which_diff"
) %>%
ggplot2::ggplot(aes(x = iter, y = diff)) +
ggplot2::geom_line() +
ggplot2::facet_wrap(~which_diff, scales = "free_y", ncol = 1) +
ggplot2::theme_bw()
}
Another thing we can check is the cluster assignment by mouse.
Blocking by an external factor
Suppose that in an experiment there were different cages housing the mice. To avoid any possible confounding between the treatment and cage, we can block the cage while doing the randomization.
As an example we use the previous dataset and assume that the mice are separated by cages in the following order: 2 and 3; 5, 6 and 8; 9-11.
df_long <- pdx_treat_ivis_long %>%
dplyr::filter(mouse != "m13") %>%
dplyr::mutate(cage = dplyr::case_when(
mouse %in% paste0("m0", 1:4) ~ "c1",
mouse %in% paste0("m0", 5:8) ~ "c2",
mouse %in% c("m09", paste0("m1", 0:2)) ~ "c3"
))
The only thing that changes now in the randomization process is that
we use the function get_best_splitting
and specify the
cage
column in the block_factor
argument.
# number of future treatments
nb_conditions <- 2
final_splitting <- get_best_splitting(
df_long,
nb_conditions,
block_factor = "cage",
initial_seed = 413
)
#> Iteration #50
#> Iteration #100
# join the best splitting with the original dataframe so we can color
# the growth curves by the cluster assignment and randomize the treatments
# later
df_long <- df_long %>%
dplyr::left_join(., final_splitting$splitting, by = "mouse")
biogrowleR::plot_summary_growth_curves(
df_long,
column_conditions = "cluster"
)$log10tf
We now check the number of conditions per cage.
df_long %>%
dplyr::distinct(mouse, .keep_all = TRUE) %>%
janitor::tabyl(cage, cluster)
#> cage 1 2
#> c1 1 1
#> c2 1 2
#> c3 2 1
There are two mice in the first cage and each one is assigned to one of the clusters. In cage 2 there are 3 mice with one assigned to cluster 2 and the other two mice assigned to cluster 2. Whereas, for cage 3 the opposite happens.
Simulated Annealing
The procedure used for randomization is a special condition of the
simulated annealing. We use a geometric temperature:
,
where
by default. This can be changed in the function
get_best_splitting
. The rule used is based on the
differences between intercepts and the differences between slopes.
The energy function on the other hand uses only the difference in intercepts to explore the neighborhood of a splitting. The function is given by:
where corresponds to the sum of the pairwise differences of the intercepts from the assigned groups.