Tidy Data

Mar 7, 2017 8 min read R

What is tidy data?

Tidy data have the following characteristics:

Observations are in rows
Variables are in columns
Contained in a single dataset

An example of tidy data

Participant	Gender	Trial	Value
01	M	1	100
01	M	2	210
02	F	1	50
02	F	2	75

An example of messy data

Participant	Trial1	Trial2
M01	100	210
F02	50	75

R Packages for tidy data

library("tidyverse")
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Workflow for data wrangling

Import data
Change from wide data to long
Split columns
Fix typos or irregulaties
Create new variables as needed

Import messy data

data.messy <- read_csv("messy_data.csv")
## Parsed with column specification:
## cols(
##   participant = col_character(),
##   vowel = col_character(),
##   word = col_character(),
##   f2_trial1 = col_integer(),
##   f2_trial2 = col_integer(),
##   f2_trial3 = col_integer(),
##   f2_trial4 = col_integer(),
##   f1_trial1 = col_integer(),
##   f1_trial2 = col_integer(),
##   f1_trial3 = col_integer(),
##   f1_trial4 = col_integer()
## )

Find the problems

participant	vowel	word	f2_trial1	f2_trial2	f2_trial3
f.29	a	apple	1736	1742	1730
f.30	a	apple	1766	1783	1744
m.27	a	apple	1780	1803	1763
m.18	a	apple	1701	1707	1687
m.29	a	apple	1693	1697	1689
f.34	a	apple	1768	1779	1741
f.19	a	apple	1758	1776	1755

Explore the data more

f2_trial4	f1_trial1	f1_trial2	f1_trial3	f1_trial4
1737	700	693	713	684
1768	710	713	747	694
1780	682	685	708	656
1704	715	714	735	701
1696	667	684	706	657
1773	683	679	693	659
1760	758	748	777	734

What makes that dataset messy?

Why is that data messy? How does it violate the tidy data rules?

Problem 1: The columns contain observations

The first problem is the f2_trial1, f2_trial2, etc., represent observations instead of variables. We want to convert all of these columns to just three columns: trial (1 or 2), measurement type (F1 or F2), and value.

To do this we use gather() from the package tidyr, which converts from a wide format to long format.

data.long <- gather(data.messy,
                  key=measurement,
                  value=value,
                  -participant,-vowel,-word)

Long data

participant	vowel	word	measurement	value
f.29	a	apple	f2_trial1	1736
f.19	u	boot	f2_trial1	737
f.29	a	apple	f2_trial2	1742
m.30	u	boot	f2_trial2	642
f.21	a	apple	f2_trial3	1636
f.48	u	boot	f1_trial2	279
f.34	u	Hoop	f1_trial4	224

Problem 2: Two variables in one column

The next problem that we have to deal with is that two of our columns have two variables in them. Participant has both gender and the participant number. Measurment has both the trial and the measurement type (F1 or F2). To separate the columns we use separate() from tidyr.

Separating columns

# separate measurement
data.long <- separate(data.long,
                      measurement,
                      c("measurement","trial"),
                      sep="_")
# separate participant (note: for . need regex \. 
# but need to escape the \ with a \, so \\. )
# convert=T allows the new columns to have different data types
data.long <- separate(data.long,
                      participant,
                      c("gender","participant"),
                      sep="\\.",
                      convert=T)

Almost tidy data

gender	participant	vowel	word	measurement	trial	value
f	29	a	apple	f2	trial1	1736
m	27	u	hoop	f2	trial1	731
f	30	u	hoop	f2	trial2	614
f	37	u	boot	f2	trial3	390
m	18	u	boot	f2	trial4	588
f	37	u	hoop	f1	trial1	238
f	21	a	apple	f1	trial4	727

Problem 3: Irregular spellings

When entering data by hand, there are sometime irregularities in spelling, such a variation in capitalizations, typos, or extra spaces.

To check for irregulaties, we convert all of our string variables to factors and then check what levels of the factors exist.

data.long$gender <- as.factor(data.long$gender)
data.long$participant <- as.factor(data.long$participant)
data.long$word <- as.factor(data.long$word)
data.long$vowel <- as.factor(data.long$vowel)
data.long$measurement <- as.factor(data.long$measurement)
data.long$trial <- as.factor(data.long$trial)

Checking factors

To check the levels of each factor, we can use the summary() command. What problems in spelling are there?

summary(data.long[,1:6])
##  gender   participant  vowel      word     measurement    trial    
##  f:256   29     : 96   a:192   apple:112   f1:224      trial1:112  
##  m:192   30     : 64   A: 32   at   :112   f2:224      trial2:112  
##          18     : 32   u:224   boot :104               trial3:112  
##          19     : 32           Boot :  8               trial4:112  
##          21     : 32           hoop :104                           
##          27     : 32           Hoop :  8                           
##          (Other):160

Using stringr to fix strings

Now that we have identified the errors, we can fix them with str_replace() from stringr.

library(stringr)
data.long$word <- str_replace(data.long$word,"Hoop","hoop")
data.long$word <- str_replace(data.long$word,"Boot","boot")
data.long$vowel <- str_replace(data.long$vowel,"A","a")

Clean data

Let’s check to see if there are other errors.

data.long$gender <- as.factor(data.long$gender)
data.long$participant <- as.factor(data.long$participant)
data.long$word <- as.factor(data.long$word)
data.long$vowel <- as.factor(data.long$vowel)
data.long$measurement <- as.factor(data.long$measurement)
data.long$trial <- as.factor(data.long$trial)
summary(data.long[,1:6])
##  gender   participant  vowel      word     measurement    trial    
##  f:256   29     : 96   a:224   apple:112   f1:224      trial1:112  
##  m:192   30     : 64   u:224   at   :112   f2:224      trial2:112  
##          18     : 32           boot :112               trial3:112  
##          19     : 32           hoop :112               trial4:112  
##          21     : 32                                               
##          27     : 32                                               
##          (Other):160

Exploring dplyr functions

Now that we have clean data we can process and explore our data with the dplyr functions filter(), arrange(), select(), summarise(), group_by(), and mutate(). We will also use the pipe %>% from the magrittr package to make using these (and other) functions easier.

The pipe

The pipe %>% allows us to nest code very easily is read as then.

For example, let’s say we want to select only a few columns of a dataframe, then select row values that are greater than a certain value, then mean for each variable by participant.

The pseudo-code would like this:

new.df <- old.df %>% select columns that we want %>%
  filter the row values that we want %>%
  group by participant %>%
  calculate the mean for each variable

Selecting columns with select()

The function select() from dplyr is used to select certain columns from a dataframe. Columns can selected individually, excluded individually by using - before the name, or selected as a range with :, as in first_column:last_column.

Let’s select the columns participant, vowel, measurement, trial, and value.

data.selected <- data.long %>%
  select(participant, vowel, measurement, trial, value)
# or can select those by excluding the other columns with -
data.selected2 <- data.long %>%
  select(-gender,-word)

Arranging values

The arrange() function from dplyr is in a sort function. You can use it to sort by one column or by multiple columns. It can also sort in descending order by using desc() arround the column name.

Let’s sort the data.selected by participant, then vowel, then descending order of trial.

data.sorted <- data.selected %>% 
  arrange(participant, vowel, desc(trial))

Filtering rows

The function filter() from dplyr allows us to subset the data by selecting rows based on whatever logical rule we want.

For this example let’s keep on the rows where the value is greater than or equal to 500 and the participant is not 19.

data.filtered <- data.sorted %>%
  filter(participant != "19" & value >= 500)

Logical values

The logical values that can be used are

==	equal to
!=	not equal to
<	less than
<=	less than or equal to
>	greater than
>=	greater than or equal too
&	AND
\|	OR

Adding new columns

The mutate() function of dplyr allows us to add columns to our dataframe based on calculations run on other columns. It is a 1:1 calculation. As an example let’s say we want to calculate the z-score for the values we measured. To do this we have to first use a helper function called group_by(). This function groups the data in groups first before doing other calculations. Let’s calculate the z-score for each value grouped by participant and measurement type (F1 or F2).

data.zscore <- data.filtered %>%
  group_by(participant, measurement) %>%
  mutate(z.score = (value - mean(value))/sd(value))

Z score

participant	vowel	measurement	trial	value	z.score
18	a	f2	trial4	1704	0.9467597
18	a	f2	trial4	1734	0.9975243
18	a	f1	trial4	701	-1.2665945
18	a	f1	trial4	731	-0.2533189
18	a	f2	trial3	1687	0.9179931
18	a	f2	trial3	1707	0.9518362

Summarizing data

The summarise() function from dplyr also creates new columns; however, it is not a 1:1 function. The function instead creates a reduced output that summarises the data. It also only returns as many columns as is necessary to display the results.

An example application would be to show the means, standard deviations, or similar statistics.

data.global.means <- data.filtered %>%
  summarise(global.means = mean(value))
data.global.means
## # A tibble: 1 x 1
##   global.means
##          <dbl>
## 1         1051

Grouped central tendencies

data.grouped.ct <- data.filtered %>%
  group_by(participant, vowel, measurement) %>%
  summarise(means = mean(value), sds = sd(value), 
            medians = median(value))
head(data.grouped.ct)
## # A tibble: 6 x 6
## # Groups:   participant, vowel [4]
##   participant vowel measurement means   sds medians
##   <fct>       <fct> <fct>       <dbl> <dbl>   <dbl>
## 1 18          a     f1            738  29.6     733
## 2 18          a     f2           1716  23.0    1707
## 3 18          u     f2            573  17.7     578
## 4 21          a     f1            752  18.5     747
## 5 21          a     f2           1677  23.5    1674
## 6 21          u     f2            744  16.7     745

data-science