Skip to contents

Spline Regression using R

In this part of the workflow, we will use R for spline regression.

A spline is a special function defined piecewise by polynomials. The term ‘spline’ refers to a broad class of functions used in applications that require interpolation and/or smoothing of data. The data may be one-dimensional or multi-dimensional.

Spline regression is one of the non-parametric regression techniques in which the data set is divided into intervals bounded by points that we call nodes or knots.

Splines provide a means of smooth interpolation between fixed points, called knots. Polynomial regression is calculated between knots. In other words, splines are series of polynomial segments that meet at knots.

We will use the spline package in R, which includes the bs function to create a b-spline term in a regression model.

library(dplyr)
library(ggplot2)
library(cowplot)

# packages to perform the calculations and tests
library(splines) # spline models
library(lme4) # frequentist
library(lmtest)

# load the support package biogrowleR to help in the analysis
library(biogrowleR)

Let’s first have a quick look at our data, by plotting the average log 10 total flux for all conditions in the same graph.

t110_plots <- biogrowleR::plot_summary_growth_curves(t110_ivis_long)
cowplot::plot_grid(plotlist = t110_plots)

As previously observed, curves have different starting points and there is high variability in the measurements, especially in the control condition.

We want to apply the spline model to compare the growth rates between the conditions.

Calculating the models

Similarly to the linear model approach, we need to calculate all possible pairwise combinations, by means of a loop through all possible comparisons.

Since the growth rate is almost linear for most conditions, we decided to build s simple linear spline model.

When fitting a spline model, it is necessary to decide how many knots to use and where to place them. As we have said, knots are the internal breakpoints that define the spline.

In our case, looking at the growth curves, we decided to place a knot at the 22-day time point where we estimated that the growth rate varied most rapidly for most conditions. We could have inserted a second knot on day 29, but this could have led to overfitting of the data.

# Calculate all possible pairwise combinations. 
combination_treatments <- combn(
    t110_ivis_long$treatment %>% unique %>% as.vector,
    m = 2
)
# Calculate the tests.
# We apply the bs function (B-Spline Basis for Polynomial Splines) 
# to generate the B-spline basis matrix for a polynomial 
# spline. Missing values are allowed.
results_spline_models <- apply(
    combination_treatments,
    2,
    function(treatments, df){
        # calculate the baseline model, which corresponds to a model
        # without the interaction. whenever comparing two models, one needs
        # to have a baseline and the one with the interaction term on top
        # of the baseline. Note to calculate the lr-test, we need to use
        # the maximum likelihood estimation, so be sure to set REML to FALSE
        # when fitting the model. Note also that the knot parameter for the 
        # spline model is specified as knots=c(22) and degree is 1, since
        # are dealing with linear splines
        fit_01 <- lme4::lmer(
            log10tf ~ bs(days_after_treatment, knots=c(22), degree=1) + 
                treatment + (1 | id),
            data = df %>% dplyr::filter(treatment %in% treatments),
            REML = FALSE
        )
        
        # now with the interaction term
        fit_02 <- lme4::lmer(
            log10tf ~ bs(days_after_treatment, knots=c(22), degree=1) * 
                treatment + (1 | id),
            data = df %>% dplyr::filter(treatment %in% treatments),
            REML = FALSE
        )
        
        test_effect <- lmtest::lrtest(fit_02, fit_01) %>% 
            broom::tidy()
        
        list(
            results = test_effect,
            fit_01 = fit_01,
            fit_02 = fit_02
        )
    },
    df = t110_ivis_long
)
names(results_spline_models) <- apply(
    X = combination_treatments,
    MARGIN = 2,
    function(x) paste0(x[1], "_", x[2])
)
tables_only_spline <- lapply(
    results_spline_models, 
    function(x) x$results[2, ]
) %>%
    dplyr::bind_rows(., .id = "comparison") %>%
    dplyr::select(comparison, p.value, statistic)

Results

We can now inspect the results. The table below shows the comparison, p-value and the test statistic.

tables_only_spline
#> # A tibble: 6 × 3
#>   comparison  p.value statistic
#>   <chr>         <dbl>     <dbl>
#> 1 ctrl_e2    0.0532       5.87 
#> 2 ctrl_p4    0.701        0.710
#> 3 ctrl_e2p4  0.946        0.111
#> 4 e2_p4      0.0278       7.16 
#> 5 e2_e2p4    0.000409    15.6  
#> 6 p4_e2p4    0.582        1.08

In this case, under a significance level of 0.05, we can reject the null hypothesis for E2 vs E2P4 and E2 vs P4. The results for E2 vs CTRL show a p-value close to 0.053.

Visualisation of spline fit to average measurements

We can also visualize the spline fit using the function plot_summary_with_splines.

biogrowleR::plot_summary_with_splines(
    t110_ivis_long, 
    knots = c(22), 
    degree = 1
) + 
    ggplot2::theme_bw(base_size = 20)

This is a ggplot2 object, so after the plot, it can be modified as the user pleases.