Skip to contents

When using this package, data should be in a specific format so the functions can be used to their fullest. In this short tutorial we show how data should be.

Reading the data

To read the data into R, it is recommended to use one the functions and the command line, instead of using RStudio’s add in “Import dataset”. Depending on the format of your data, use read.csv, readxl::read_excel or read.table. The first is used for csv files, second to xls files and the last is used for tab delimited files.

The wide format

This is the most common format for tables. Each column corresponds and each row corresponds to the value measured. The different rows then correspond to different days. This is very useful data if you want to compare the values of two samples, calculate correlations, statistical tests and plots.

For example, consider the following dataset:

df <- data.frame(
    days = 1:10,
    sample_1 = rnorm(10, mean = 1, sd = 0.1), 
    sample_2 = rnorm(10, mean = 2, sd = 0.1), 
    sample_3 = rnorm(10, mean = 3, sd = 0.1)
)
head(df)
#>   days  sample_1 sample_2 sample_3
#> 1    1 0.8599956 1.944630 3.046815
#> 2    2 1.0255317 2.062898 3.036295
#> 3    3 0.7562736 2.206502 2.869546
#> 4    4 0.9994429 1.836901 3.073778
#> 5    5 1.0621553 2.051243 3.188850
#> 6    6 1.1148412 1.813699 2.990255

If we want to simply do a scatter plot, it is as easy as passing to ggplot the two columns as the x and y parameters.

ggplot2::ggplot(df, ggplot2::aes(x = sample_1, y = sample_2)) + 
    ggplot2::geom_point()

This is the most common format. However, to use the functions and perform statistical modeling, data needs to be in the long format.

The long format

The long format is different. Now each row corresponds to a different measure and a different sample. This makes the dataset longer, i.e., it has more rows. The idea is that given each column and each measurement, they will correspond to a row. To convert from wide format to long format, the function pivot_longer from tidyr can be used.

df_long <- tidyr::pivot_longer(
    df,
    cols = c("sample_1", "sample_2", "sample_3"),
    names_to = "sample",
    values_to = "value"
)

# this displays the first 5 elements of the dataframe
head(df_long)
#> # A tibble: 6 × 3
#>    days sample   value
#>   <int> <chr>    <dbl>
#> 1     1 sample_1 0.860
#> 2     1 sample_2 1.94 
#> 3     1 sample_3 3.05 
#> 4     2 sample_1 1.03 
#> 5     2 sample_2 2.06 
#> 6     2 sample_3 3.04

Comparing the two tables, we see that there are only 2 columns now and each row corresponds to a specific sample. Also we need to specify the columns that we want to make in the long format, in this case the samples. So each row has a value from a sample in a specific day.

Data in this format now can be used to plot curves very easily, just specify the days column as your x axis and value as your y column. Moreover, to compare the samples, they can be grouped on ggplot2.

ggplot2::ggplot(df_long, ggplot2::aes(x = days, y = value, color = sample)) +
    ggplot2::geom_line()