When using this package, data should be in a specific format so the functions can be used to their fullest. In this short tutorial we show how data should be.
Reading the data
To read the data into R, it is recommended to use one the functions
and the command line, instead of using RStudio’s add in “Import
dataset”. Depending on the format of your data, use
read.csv
, readxl::read_excel
or
read.table
. The first is used for csv files, second to
xls
files and the last is used for tab delimited files.
The wide format
This is the most common format for tables. Each column corresponds and each row corresponds to the value measured. The different rows then correspond to different days. This is very useful data if you want to compare the values of two samples, calculate correlations, statistical tests and plots.
For example, consider the following dataset:
df <- data.frame(
days = 1:10,
sample_1 = rnorm(10, mean = 1, sd = 0.1),
sample_2 = rnorm(10, mean = 2, sd = 0.1),
sample_3 = rnorm(10, mean = 3, sd = 0.1)
)
head(df)
#> days sample_1 sample_2 sample_3
#> 1 1 0.8599956 1.944630 3.046815
#> 2 2 1.0255317 2.062898 3.036295
#> 3 3 0.7562736 2.206502 2.869546
#> 4 4 0.9994429 1.836901 3.073778
#> 5 5 1.0621553 2.051243 3.188850
#> 6 6 1.1148412 1.813699 2.990255
If we want to simply do a scatter plot, it is as easy as passing to
ggplot
the two columns as the x
and
y
parameters.
ggplot2::ggplot(df, ggplot2::aes(x = sample_1, y = sample_2)) +
ggplot2::geom_point()
This is the most common format. However, to use the functions and perform statistical modeling, data needs to be in the long format.
The long format
The long format is different. Now each row corresponds to a different
measure and a different sample. This makes the dataset longer, i.e., it
has more rows. The idea is that given each column and each measurement,
they will correspond to a row. To convert from wide format to long
format, the function pivot_longer
from tidyr
can be used.
df_long <- tidyr::pivot_longer(
df,
cols = c("sample_1", "sample_2", "sample_3"),
names_to = "sample",
values_to = "value"
)
# this displays the first 5 elements of the dataframe
head(df_long)
#> # A tibble: 6 × 3
#> days sample value
#> <int> <chr> <dbl>
#> 1 1 sample_1 0.860
#> 2 1 sample_2 1.94
#> 3 1 sample_3 3.05
#> 4 2 sample_1 1.03
#> 5 2 sample_2 2.06
#> 6 2 sample_3 3.04
Comparing the two tables, we see that there are only 2 columns now and each row corresponds to a specific sample. Also we need to specify the columns that we want to make in the long format, in this case the samples. So each row has a value from a sample in a specific day.
Data in this format now can be used to plot curves very easily, just
specify the days
column as your x axis and
value
as your y column. Moreover, to compare the samples,
they can be grouped on ggplot2.