Introduction
Current drug trials involve screening dozens of drugs and their potential combinations. This type of high-throughput drug screening provides more data and poses a challenge for their analysis, with longitudinal data making it more difficult to analyze the results. Here we propose a pipeline using Bayesian inference to interpret this type of data. We performed a high-throughput drug screening on a 96-well plate with the estrogen receptor (ER)-positive breast cancer cell line T47D. Cells were infected with an RFP-Luciferase virus. The cell line was treated with a single dose or a combination of drugs for 4 days in a 2D confluency assay, and fluorescence measurements were taken every 6 hours. A total of 25 treatments were used.
In our example, the negative control is the empty well and the positive control is fulvestrant. All the comparisons will be made to the control condtion DMSO 0.5%.
This workflow is only a recommendation. Always take into account the structure of your dataset and take sound decisions based on that.
Loading necessary packages
These packages are essential for the analysis, install them using the
function install.packages
.
Formatting the data
Different from the IVIS imaging example, we start with the formated data having the following structure:
df_confluency_htds %>%
head
#> # A tibble: 6 × 6
#> time_hr treatment value plate id well
#> <dbl> <chr> <dbl> <fct> <chr> <chr>
#> 1 0 E2 10 nM 18.1 1 E2 10 nM (B1) 1 (B1)
#> 2 0 E2 10 nM 17.8 1 E2 10 nM (B2) 1 (B2)
#> 3 0 E2 10 nM 21.2 1 E2 10 nM (B3) 1 (B3)
#> 4 0 E2 10 nM 22.0 1 E2 10 nM (B4) 1 (B4)
#> 5 0 E2 10 nM 21.5 1 E2 10 nM (B5) 1 (B5)
#> 6 0 E2 10 nM 25.8 1 E2 10 nM (B6) 1 (B6)
There are in total 7 columns, including the time (in hours), treatment, the original confluency measurement, ID, well in the plate and log2 signal. For the analysis in general we use the log2 signal. This dataset is available with the package so this tutorial can be followed.
Initial sanity checks
Once the data is loaded and formatted, we can start our exploration
journey. We first check the number of treatments per plate using the
tabyl
function from the package janitor
.
df_confluency_htds %>%
janitor::tabyl(treatment, plate)
#> treatment 1 2 3
#> 1230 100 nM 102 0 0
#> Abem 100 nM 0 102 0
#> ARV110 30 nM 102 0 0
#> ARV471 100 nM 102 0 0
#> DHT 10 nM 102 0 0
#> DHT 10 nM ARV110 30 nM 102 0 0
#> DMSO 0.5% 0 0 170
#> E2 10 nM 102 0 0
#> E2 10 nM ARV471 100 nM 102 0 0
#> E2 10 nM DHT 10 nM 102 0 0
#> E2 10 nM P4 10 nM 102 0 0
#> E2 10 nM P4 10 nM DHT 10 nM 102 0 0
#> Empty 0 102 170
#> Fulv 100 nM 0 0 170
#> Fulv 100 nM Abem 100 nM 0 102 0
#> Fulv 100 nM E2 10 nM 0 0 170
#> Fulv 100 nM Olap 1 µM 0 102 0
#> Fulv 100 nM Palb 100 nM 0 102 0
#> HC 1 µM 0 102 0
#> Mif 1 µM 0 102 0
#> Olap 1 µM 0 102 0
#> P4 10 nM 102 0 0
#> P4 10 nM 1230 100 nM 102 0 0
#> P4 10 nM DHT 10 nM 102 0 0
#> Palb 100 nM 0 102 0
Visualizing the data
Data is in the right format already, so we can proceed and visualize
it. This involves the use of the package ggplot2
. We
formatted the data already thinking in using ggplot2
, so
these plots are very easy to come with.
First plot is the raw signal from the control over time. Instead of
displaying the dots on top of each other for each time point, we used
geom_jitter
, so the points are not overlapping too much and
we can see the trends better on the data.
df_confluency_htds %>%
# select only the control samples at first
dplyr::filter(treatment %in% c("DMSO 0.5%")) %>%
dplyr::arrange(plate) %>%
dplyr::mutate(well = factor(well, unique(well))) %>%
ggplot2::ggplot(aes(x = time_hr, y = value, color = plate)) +
ggplot2::geom_jitter() +
# calculates the average for each timepoint individually
ggplot2::geom_smooth(method = "loess", formula = "y~x") +
ggplot2::stat_summary(
geom = "line",
fun = "mean",
linewidth = 2,
linetype="dashed"
) +
ggplot2::facet_wrap(~well) +
ggplot2::theme_bw(base_size = 20)
In general confluency is growing over time. We now plot some extra conditions together, such as DMSO, Empty, estradiol (E2), Fulvestrant (Fulv) and E2+Fulv.
df_confluency_htds %>%
dplyr::filter(treatment %in% c(
"DMSO 0.5%",
"Empty",
"E2 10 nM",
"Fulv 100 nM E2 10 nM",
"Fulv 100 nM"
)) %>%
ggplot2::ggplot(aes(x = time_hr, y = log2(value), color = treatment)) +
# calculates the average for each timepoint individually
ggplot2::geom_smooth(method = "loess", formula = "y~x") +
ggplot2::stat_summary(
geom = "line",
fun = "mean",
linewidth = 2,
linetype="dashed"
) +
ggplot2::facet_wrap(~plate) +
ggplot2::theme_bw(base_size = 20)
We see already in the last plate that there is some indication that the drug screening worked. Fulvestrant did reduce confluency over time and the empty had a slightly higher confluency compared to the control DSMO 0.5%.
Next we plot all the samples together. But as we can see, it is very hard to distinguish all the conditions at the same time. Later we will perform the analysis and show the best way to summarise the information of this drug screening.
df_confluency_htds %>%
ggplot2::ggplot(aes(x = time_hr, y = log2(value), color = treatment)) +
ggplot2::geom_jitter() +
# calculates the average for each timepoint individually
ggplot2::geom_smooth(method = "loess", formula = "y~x") +
ggplot2::stat_summary(
geom = "line",
fun = "mean",
linewidth = 2,
linetype="dashed"
) +
ggplot2::labs(
x = "Time (hours)",
y = "log2(signal)",
title = "Confluency measurements over time",
color = "Treatment"
) +
ggplot2::guides(color = guide_legend(ncol = 1)) +
ggplot2::theme_bw(base_size = 20)
If you want to add more treatments to this plot you can, but it
starts getting cluttered. We can instead use facet_wrap
and
plot them individually sharing the same y and x-axis as below by adding
an extra line. We can change the width and height of the figure by
specifying it in the chunk options. Here we use
fig.width = 10
, fig.height = 8
.
df_confluency_htds %>%
ggplot2::ggplot(aes(x = time_hr, y = log2(value))) +
ggplot2::geom_jitter() +
ggplot2::stat_summary(
geom = "line",
fun = "mean",
linewidth = 2,
linetype="dashed"
) +
ggplot2::geom_smooth(method = "loess", formula = "y~x") +
# here is the extra line
ggplot2::facet_wrap(~treatment) +
ggplot2::theme_bw()
This has the disadvantage that it is more difficult to see the differences between the conditions when they are small, but we can at least analyse them individually for quality checking purposes.
Overall the data seems fine, no big major outliers that we need to remove or address.
Data-generating process
The next step is to describe how the data was generated by using statistical models. This step is called modelling the data-generating process. It is crucial to do this because we can better understand the uncertainty in the data and in the data-acquiring process.
To do so, we consider time as a continuous variable and we will fit a linear splines with an interaction term between treatment and time.
Based on previous plots we can use either use a simple linear regression or a linear spline with an inflection point at 25h. Some of the treatments show an inflection at 25h.
fit_stan_linear <- rstanarm::stan_glm(
formula = log2_signal ~ time_hr * treatment,
data = df_confluency_htds,
chains = 4,
iter = 4000,
cores = 4
)
fit_stan_linear_2 <- rstanarm::stan_glmer(
formula = log2_signal ~ time_hr * treatment + (1 | id),
data = df_confluency_htds,
chains = 4,
iter = 4000,
cores = 4
)
fit_stan_splines <- rstanarm::stan_glmer(
formula = log2_signal ~
bs(time_hr, knots = c(25), degree = 1) * treatment + (1 | id),
data = df_confluency_htds,
chains = 4,
iter = 4000,
cores = 4
)
# we now add the loo so we can compare the models
fit_01 <- loo::loo(fit_stan_linear)
fit_02 <- loo::loo(fit_stan_linear_2)
fit_03 <- loo::loo(fit_stan_splines)
The values below show the ELPD (expected log pointwise predictive density) differences between the models. The top model is the first in the list. If the difference is big between the models, it means that the top model describes the best the data generating process among all the options fitted.
In our case we see that using splines better describes the data generating process. We choose this model for the downstream analysis.
loo::loo_compare(
fit_01,
fit_02,
fit_03
)
#> elpd_diff se_diff
#> fit_stan_splines 0.0 0.0
#> fit_stan_linear_2 -361.4 42.7
#> fit_stan_linear -2832.9 70.0
The next chunk is just data formatting so we can use it to do our visualization with the model. Below is the figure with the fitted log2(signal) given by the model.
# this gets a dataframe to be used when prediction the new log2_signal
# values for plotting later on
df_to_predict <- biogrowleR::get_df_predict(
df = df_confluency_htds,
column_days = "time_hr",
spline_degree = 1,
spline_knots = c(25)
)
# get the fitted log2_signal
results_estimates_global <- tidybayes::add_linpred_draws(
object = fit_stan_splines,
newdata = df_to_predict,
re_formula = NA,
value = "log2_signal",
ndraws = 2000
)
# and finally plot the fitted data.
results_estimates_global %>%
ggplot2::ggplot(aes(
x = time_hr,
y = log2_signal,
fill = treatment,
color = treatment
)) +
ggdist::stat_lineribbon(alpha = 1/4, .width = c(.25, .5, .8)) +
ggplot2::labs(
title = "T47D: 4 days Incucyte Confluency",
subtitle = "Uncertainty intervals are on the 25%, 50% and 80% levels",
x = "Time (hours)",
y = "log2(signal)",
color = "Treatment",
fill = "Treatment"
) +
ggplot2::theme_bw(base_size = 20) +
ggplot2::guides(
color = guide_legend(ncol = 1),
fill = guide_legend(ncol = 1)
)
Since we have many conditions it is very difficult to see the differences. What is the best way to show the differences?
And now we plot the difference between control and the other conditions.
reference_column <- "DMSO 0.5%"
columns_to_subtract <- setdiff(
unique(results_estimates_global$treatment),
c(reference_column)
)
df_difference <- results_estimates_global %>%
# we need to pivot the estimated log2 signal into several columns
# one for each treatment, so we can then select the reference column
# and subtract
tidyr::pivot_wider(
id_cols = c(.draw, time_hr),
names_from = "treatment",
values_from = "log2_signal"
) %>%
dplyr::mutate(
dplyr::across(
dplyr::all_of(columns_to_subtract),
~ . - !!sym(reference_column),
.names = paste0("{col}_vs_", reference_column)
)
) %>%
tidyr::pivot_longer(
cols = dplyr::all_of(paste0(
columns_to_subtract, "_vs_", reference_column
)),
values_to = "difference",
names_to = "condition"
)
df_difference %>%
ggplot2::ggplot(aes(
x = time_hr,
y = difference,
fill = condition,
color = condition
)) +
ggdist::stat_lineribbon(alpha = 1/4, .width = c(.25, .5, .8)) +
ggplot2::labs(
title = "T47D: 4 days Incucyte",
subtitle =
"Intervals correspond to 25%, 50% and 80% uncertainty levels",
x = "Time (hours)",
y = "Difference in log2(signal) to DMSO 0.5%",
fill = "Treatment",
color = "Treatment"
) +
ggplot2::guides(
fill = guide_legend(ncol = 1),
color = guide_legend(ncol = 1)
) +
ggplot2::theme_bw(base_size = 20)
It is difficult to appreciate the differences for all the conditions
at the same time. We can plot each comparison individually by using
facet_wrap
once again. The uncertainty intervals correspond
to the following levels: 25%, 50% and 80%. You can change this numbers,
either include more or remove them on the .width
parameter.
df_difference %>%
ggplot2::ggplot(aes(
x = time_hr,
y = difference,
fill = condition
)) +
ggdist::stat_lineribbon(
alpha = .5,
.width = c(.25, .5, .8)
) +
ggplot2::facet_wrap(~ condition) +
ggplot2::theme_bw(base_size = 20) +
ggplot2::labs(
title = "T47D: 4 days Incucyte",
subtitle = "Uncertainty intervals are on the 25%, 50% and 80% levels",
x = "Time (hours)",
y = "log2(signal)"
) +
ggplot2::theme_bw(base_size = 20) +
ggplot2::theme(legend.position = "None")
Forest plot
One aspect of interest is the signal difference at a specific
timepoint. We can have a forest plot-like figure. For this we use the
function ggdist::gradientinterval
so we display the
uncertainty intervals for specific timepoints. Here we compare at 96h
and 0h.
Since the starting points are different we should actually look at
the differences of the estimates while taking the time 0 into
consideration. It is similar to a difference in the slopes, but here we
used splines to calculate the estimates. So what we do now is to
calculate a difference of differences, which essentially amounts to
calculate a difference in the slopes. To get the data and obtain the
forest plot we use the function get_difference_summarized
,
that returns a dataframe with the difference draws given the timepoints
of interest.
time_hr_interest <- c(0,96)
df_difference_summarized <- biogrowleR::get_difference_summarized(
results_estimates_global,
time_hr_interest,
"DMSO 0.5%"
)
p1 <- df_difference_summarized$differences %>%
ggplot2::ggplot(aes(
x = difference,
y = condition,
fill = condition
)) +
tidybayes::stat_gradientinterval(position = "dodge") +
ggplot2::geom_vline(xintercept = 0, linetype = "dashed") +
ggplot2::geom_vline(
xintercept = df_difference_summarized$mean_differences %>%
dplyr::filter(condition == "Fulv 100 nM vs DMSO 0.5%") %>%
dplyr::pull(mean_value),
linetype = 6
) +
ggplot2::geom_vline(
xintercept = df_difference_summarized$mean_differences %>%
dplyr::filter(condition == "Empty vs DMSO 0.5%") %>%
dplyr::pull(mean_value),
linetype = 6
) +
ggplot2::labs(
x = "Difference in log2 signal",
y = NULL,
title = paste0(
"Difference between ", time_hr_interest[2], "h\nand ",
time_hr_interest[1], "h"
)
) +
ggplot2::scale_fill_viridis_d() +
ggplot2::theme_bw(base_size = 20) +
ggplot2::theme(legend.position = "none")
p1
Interestingly DHT + ARV110 work better together than either drugs alone. ARV110 is an androgen receptor(AR) PROTAC.
Next we can plot the dynamics for specific conditions of interest.
df_difference_sub <- df_difference %>%
dplyr::filter(condition %in% c(
"E2 10 nM_vs_DMSO 0.5%",
"DHT 10 nM ARV110 30 nM_vs_DMSO 0.5%",
"ARV110 30 nM_vs_DMSO 0.5%",
"Fulv 100 nM E2 10 nM_vs_DMSO 0.5%",
"Fulv 100 nM_vs_DMSO 0.5%",
"Empty_vs_DMSO 0.5%"
))
df_difference_sub %>%
ggplot2::ggplot(aes(
x = time_hr,
y = difference,
fill = condition
)) +
ggdist::stat_lineribbon(
alpha = .5,
.width = c(.25, .5, .8)
) +
ggplot2::facet_wrap(
~ condition,
nrow = 2,
labeller = as_labeller(
gsub("_vs_DMSO 0.5%", "", unique(df_difference_sub$condition)) %>%
`names<-`(unique(df_difference_sub$condition))
)
) +
ggplot2::labs(
x = "Time (hours)",
y = "Difference vs DMSO 0.5"
) +
ggplot2::theme_bw(base_size = 20) +
ggplot2::theme(legend.position = "None")
We see that E2 + Fulvestrant or Fulvestrant alone takes some time to really start making the confluency decrease.
Heatmap
Instead of the forest plot we can have a heatmap. I don’t recommend to use only this type of plot since you are throwing away all the uncertainty. If you wish to use this plot, I would advise to use it with care and always include another summary of the data highlighting the uncertainty.
We plot a heatmap that is colored by the average difference in the specific timepoint. On the y axis we have the comparisons and the x axis correspond to the time points. We color code by the average difference.
df_difference %>%
dplyr::mutate(condition = stringr::str_replace_all(condition, "_", " "),) %>%
dplyr::group_by(time_hr, condition) %>%
dplyr::summarise(
mean_difference = mean(difference),
.groups = "keep"
) %>%
dplyr::mutate(
condition = factor(
condition,
levels = df_difference_summarized$differences$condition %>% levels
)
) %>%
ggplot2::ggplot(aes(
x = time_hr,
y = condition,
fill = mean_difference
)) +
ggplot2::geom_tile(color = "white") +
ggplot2::scale_fill_viridis_c() +
ggplot2::labs(
x = "Time (hours)",
y = NULL,
fill = "Mean\ndifference",
title = stringr::str_wrap(
"Negative values correspond to a decrease in signal",
width = 35
)
) +
ggplot2::theme_bw(base_size = 20)
The plot is equivalent to the figure below. This is the same plot as the one provided earlier with the uncertainty bands. Without the bands it is easier to see the differences for several comparisons at the same time.
df_difference %>%
dplyr::mutate(condition = stringr::str_replace_all(condition, "_", " "),) %>%
dplyr::group_by(time_hr, condition) %>%
dplyr::summarise(
mean_difference = mean(difference),
.groups = "keep"
) %>%
ggplot2::ggplot(aes(
x = time_hr,
color = condition,
y = mean_difference
)) +
ggplot2::geom_line() +
ggplot2::labs(
x = "Time (hours)",
color = "Conditions",
y = "Mean difference"
) +
ggplot2::guides(color = guide_legend(ncol = 1)) +
ggplot2::theme_bw(base_size = 15)
The advantage of doing the statistical modelling is that you can extrapolate the points in between, highlighting the possible true differences between the treatments and improving the visualization process. Otherwise the heatmap might have been too sparse.