What is tidy data?
Tidy data have the following characteristics:
- Observations are in rows
- Variables are in columns
- Contained in a single dataset
An example of tidy data
Participant | Gender | Trial | Value |
---|---|---|---|
01 | M | 1 | 100 |
01 | M | 2 | 210 |
02 | F | 1 | 50 |
02 | F | 2 | 75 |
An example of messy data
Participant | Trial1 | Trial2 |
---|---|---|
M01 | 100 | 210 |
F02 | 50 | 75 |
R Packages for tidy data
library("tidyverse")
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Workflow for data wrangling
- Import data
- Change from wide data to long
- Split columns
- Fix typos or irregulaties
- Create new variables as needed
Import messy data
data.messy <- read_csv("messy_data.csv")
## Parsed with column specification:
## cols(
## participant = col_character(),
## vowel = col_character(),
## word = col_character(),
## f2_trial1 = col_integer(),
## f2_trial2 = col_integer(),
## f2_trial3 = col_integer(),
## f2_trial4 = col_integer(),
## f1_trial1 = col_integer(),
## f1_trial2 = col_integer(),
## f1_trial3 = col_integer(),
## f1_trial4 = col_integer()
## )
Find the problems
participant | vowel | word | f2_trial1 | f2_trial2 | f2_trial3 |
---|---|---|---|---|---|
f.29 | a | apple | 1736 | 1742 | 1730 |
f.30 | a | apple | 1766 | 1783 | 1744 |
m.27 | a | apple | 1780 | 1803 | 1763 |
m.18 | a | apple | 1701 | 1707 | 1687 |
m.29 | a | apple | 1693 | 1697 | 1689 |
f.34 | a | apple | 1768 | 1779 | 1741 |
f.19 | a | apple | 1758 | 1776 | 1755 |
Explore the data more
f2_trial4 | f1_trial1 | f1_trial2 | f1_trial3 | f1_trial4 |
---|---|---|---|---|
1737 | 700 | 693 | 713 | 684 |
1768 | 710 | 713 | 747 | 694 |
1780 | 682 | 685 | 708 | 656 |
1704 | 715 | 714 | 735 | 701 |
1696 | 667 | 684 | 706 | 657 |
1773 | 683 | 679 | 693 | 659 |
1760 | 758 | 748 | 777 | 734 |
What makes that dataset messy?
Why is that data messy? How does it violate the tidy data rules?
Problem 1: The columns contain observations
The first problem is the f2_trial1, f2_trial2, etc., represent observations instead of variables. We want to convert all of these columns to just three columns: trial (1 or 2), measurement type (F1 or F2), and value.
To do this we use gather()
from the package tidyr
, which converts from a wide format to long format.
data.long <- gather(data.messy,
key=measurement,
value=value,
-participant,-vowel,-word)
Long data
participant | vowel | word | measurement | value |
---|---|---|---|---|
f.29 | a | apple | f2_trial1 | 1736 |
f.19 | u | boot | f2_trial1 | 737 |
f.29 | a | apple | f2_trial2 | 1742 |
m.30 | u | boot | f2_trial2 | 642 |
f.21 | a | apple | f2_trial3 | 1636 |
f.48 | u | boot | f1_trial2 | 279 |
f.34 | u | Hoop | f1_trial4 | 224 |
Problem 2: Two variables in one column
The next problem that we have to deal with is that two of our columns have two variables in them. Participant has both gender and the participant number. Measurment has both the trial and the measurement type (F1 or F2). To separate the columns we use separate()
from tidyr
.
Separating columns
# separate measurement
data.long <- separate(data.long,
measurement,
c("measurement","trial"),
sep="_")
# separate participant (note: for . need regex \.
# but need to escape the \ with a \, so \\. )
# convert=T allows the new columns to have different data types
data.long <- separate(data.long,
participant,
c("gender","participant"),
sep="\\.",
convert=T)
Almost tidy data
gender | participant | vowel | word | measurement | trial | value |
---|---|---|---|---|---|---|
f | 29 | a | apple | f2 | trial1 | 1736 |
m | 27 | u | hoop | f2 | trial1 | 731 |
f | 30 | u | hoop | f2 | trial2 | 614 |
f | 37 | u | boot | f2 | trial3 | 390 |
m | 18 | u | boot | f2 | trial4 | 588 |
f | 37 | u | hoop | f1 | trial1 | 238 |
f | 21 | a | apple | f1 | trial4 | 727 |
Problem 3: Irregular spellings
When entering data by hand, there are sometime irregularities in spelling, such a variation in capitalizations, typos, or extra spaces.
To check for irregulaties, we convert all of our string variables to factors and then check what levels of the factors exist.
data.long$gender <- as.factor(data.long$gender)
data.long$participant <- as.factor(data.long$participant)
data.long$word <- as.factor(data.long$word)
data.long$vowel <- as.factor(data.long$vowel)
data.long$measurement <- as.factor(data.long$measurement)
data.long$trial <- as.factor(data.long$trial)
Checking factors
To check the levels of each factor, we can use the summary()
command. What problems in spelling are there?
summary(data.long[,1:6])
## gender participant vowel word measurement trial
## f:256 29 : 96 a:192 apple:112 f1:224 trial1:112
## m:192 30 : 64 A: 32 at :112 f2:224 trial2:112
## 18 : 32 u:224 boot :104 trial3:112
## 19 : 32 Boot : 8 trial4:112
## 21 : 32 hoop :104
## 27 : 32 Hoop : 8
## (Other):160
Using stringr to fix strings
Now that we have identified the errors, we can fix them with str_replace()
from stringr
.
library(stringr)
data.long$word <- str_replace(data.long$word,"Hoop","hoop")
data.long$word <- str_replace(data.long$word,"Boot","boot")
data.long$vowel <- str_replace(data.long$vowel,"A","a")
Clean data
Let’s check to see if there are other errors.
data.long$gender <- as.factor(data.long$gender)
data.long$participant <- as.factor(data.long$participant)
data.long$word <- as.factor(data.long$word)
data.long$vowel <- as.factor(data.long$vowel)
data.long$measurement <- as.factor(data.long$measurement)
data.long$trial <- as.factor(data.long$trial)
summary(data.long[,1:6])
## gender participant vowel word measurement trial
## f:256 29 : 96 a:224 apple:112 f1:224 trial1:112
## m:192 30 : 64 u:224 at :112 f2:224 trial2:112
## 18 : 32 boot :112 trial3:112
## 19 : 32 hoop :112 trial4:112
## 21 : 32
## 27 : 32
## (Other):160
Exploring dplyr functions
Now that we have clean data we can process and explore our data with the dplyr
functions filter()
, arrange()
, select()
, summarise()
, group_by()
, and mutate()
. We will also use the pipe %>%
from the magrittr
package to make using these (and other) functions easier.
The pipe
The pipe %>%
allows us to nest code very easily is read as then.
For example, let’s say we want to select only a few columns of a dataframe, then select row values that are greater than a certain value, then mean for each variable by participant.
The pseudo-code would like this:
new.df <- old.df %>% select columns that we want %>%
filter the row values that we want %>%
group by participant %>%
calculate the mean for each variable
Selecting columns with select()
The function select()
from dplyr
is used to select certain columns from a dataframe. Columns can selected individually, excluded individually by using -
before the name, or selected as a range with :
, as in first_column:last_column.
Let’s select the columns participant, vowel, measurement, trial, and value.
data.selected <- data.long %>%
select(participant, vowel, measurement, trial, value)
# or can select those by excluding the other columns with -
data.selected2 <- data.long %>%
select(-gender,-word)
Arranging values
The arrange()
function from dplyr
is in a sort function. You can use it to sort by one column or by multiple columns. It can also sort in descending order by using desc()
arround the column name.
Let’s sort the data.selected
by participant, then vowel, then descending order of trial.
data.sorted <- data.selected %>%
arrange(participant, vowel, desc(trial))
Filtering rows
The function filter()
from dplyr
allows us to subset the data by selecting rows based on whatever logical rule we want.
For this example let’s keep on the rows where the value is greater than or equal to 500 and the participant is not 19.
data.filtered <- data.sorted %>%
filter(participant != "19" & value >= 500)
Logical values
The logical values that can be used are
== | equal to |
!= | not equal to |
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than or equal too |
& | AND |
| | OR |
Adding new columns
The mutate()
function of dplyr
allows us to add columns to our dataframe based on calculations run on other columns. It is a 1:1 calculation. As an example let’s say we want to calculate the z-score for the values we measured. To do this we have to first use a helper function called group_by()
. This function groups the data in groups first before doing other calculations. Let’s calculate the z-score for each value grouped by participant and measurement type (F1 or F2).
data.zscore <- data.filtered %>%
group_by(participant, measurement) %>%
mutate(z.score = (value - mean(value))/sd(value))
Z score
participant | vowel | measurement | trial | value | z.score |
---|---|---|---|---|---|
18 | a | f2 | trial4 | 1704 | 0.9467597 |
18 | a | f2 | trial4 | 1734 | 0.9975243 |
18 | a | f1 | trial4 | 701 | -1.2665945 |
18 | a | f1 | trial4 | 731 | -0.2533189 |
18 | a | f2 | trial3 | 1687 | 0.9179931 |
18 | a | f2 | trial3 | 1707 | 0.9518362 |
Summarizing data
The summarise()
function from dplyr
also creates new columns; however, it is not a 1:1 function. The function instead creates a reduced output that summarises the data. It also only returns as many columns as is necessary to display the results.
An example application would be to show the means, standard deviations, or similar statistics.
data.global.means <- data.filtered %>%
summarise(global.means = mean(value))
data.global.means
## # A tibble: 1 x 1
## global.means
## <dbl>
## 1 1051
Grouped central tendencies
data.grouped.ct <- data.filtered %>%
group_by(participant, vowel, measurement) %>%
summarise(means = mean(value), sds = sd(value),
medians = median(value))
head(data.grouped.ct)
## # A tibble: 6 x 6
## # Groups: participant, vowel [4]
## participant vowel measurement means sds medians
## <fct> <fct> <fct> <dbl> <dbl> <dbl>
## 1 18 a f1 738 29.6 733
## 2 18 a f2 1716 23.0 1707
## 3 18 u f2 573 17.7 578
## 4 21 a f1 752 18.5 747
## 5 21 a f2 1677 23.5 1674
## 6 21 u f2 744 16.7 745