Correspondence Analysis

What to do with categorical data?

  • Categorical data can be challenging to analyze quantitatively
  • In language research we often have data that are purely categorical
  • In today’s presentation we will deal with a specific type of categorical data found in questionnaires

Questionnaire Data

  • Questionnaires are frequently used in a variety of language research scenarios
  • They often ask people to rate something (likert scales) or select the most appropriate response
  • There are often multiple questions that have the same answer scale (same choices)
  • Example: select the language that is most appropriate to use in a given domain
  • Example: rate level of agreement with several statements

Research Question

  • Often we are interested in how the questions relate to each other in terms of their answers as well as how the answers relate to each other based on the questions they were most used with

Correspondence Analysis

  • CA is statistical technique that shows how the questions and answers of multiple questions relate to each other
  • Requires the data to have the same scale (all questions must have some possible answers)
  • CA is a descriptive tool and doesn’t give p-values per se

How does CA work?

  • CA takes a series of categorical variables and converts them to a contingency table (a count of the number of responses at each level for each variable)
  • It then uses the chi-squared distribution to convert the data into a series of factors (you can specify the number of factors)
  • These factors are designed to be minimally correlated with each other
  • The first factor represents the most variation in the data and the following ones less so
  • It takes a large number of variables and maps them on to a smaller number of variables
  • Very similar to Principle Component Analysis (PCA)

Example Data

  • For the sample CA, we will be using data from a language attitudes questionnaire used on Pohnpei (PNI) in the Federated States of Micronesia
  • The selected data come from 25 questions where the respondents were asked to select the 1 language (out of 8 possible answers) that is the most important for that specific domain
  • The answers for all 25 questions were the same 8 language choices
domains <- read.csv("domains.csv")

Explore the data

making_friends being_successful good_education happy_relationships getting_money reading writing radio tv accepted_pni talking_teachers talking_villages funerals kamadipw sakau facebook talking_kolonia talking_chief talking_gov good_job friends_school church store talking_neighbors us_relatives
English English English English English English English English English English English English English English Other English English English English English English English English English Other
Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pingelapese Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian
Pohnpeian English English Pohnpeian English English English English English Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian English Pohnpeian Pohnpeian English English Pohnpeian Pohnpeian Pohnpeian Pohnpeian English
Pohnpeian English English English English English English English English Pohnpeian English English Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian Pohnpeian English English English Pohnpeian Pohnpeian Pohnpeian Pohnpeian
English English English English English English English English English Pohnpeian English Pohnpeian Pohnpeian Pohnpeian Pohnpeian English Pohnpeian Pohnpeian Pohnpeian English English Pohnpeian Pohnpeian Pohnpeian English
Pohnpeian English English Pohnpeian English English English English English Pohnpeian English Pohnpeian Pohnpeian Pohnpeian Pohnpeian English Pohnpeian Pohnpeian Pohnpeian English English Pohnpeian Pohnpeian Pohnpeian English

Step 1: Create contingency table

library(tidyverse)
domains.gathered <- domains %>% gather(key=domain,value=language) 
domains.gathered$domain <- as.factor(domains.gathered$domain) 
domains.gathered$language <- as.factor(domains.gathered$language) 
dt <- with(domains.gathered,table(domain,language)) 
Chuukese English Kosraean Mokilese Mortlockese Other Pingelapese Pohnpeian
accepted_pni 0 93 2 0 0 4 1 201
being_successful 0 179 1 0 1 3 0 117
church 2 43 8 6 6 5 6 225
facebook 1 138 4 1 0 4 5 148
friends_school 1 100 1 0 0 5 1 193
funerals 2 20 0 2 1 3 2 271
getting_money 0 176 2 1 1 4 1 116
good_education 0 194 0 2 1 5 0 99
good_job 1 178 2 0 1 7 1 111
happy_relationships 3 91 4 6 7 9 3 178
kamadipw 2 42 1 2 1 3 2 248
making_friends 0 101 1 0 4 3 2 190
radio 1 124 6 0 1 2 0 167
reading 0 175 1 0 2 0 2 121
sakau 0 23 1 1 1 2 3 270
store 1 63 2 1 2 4 3 225
talking_chief 0 16 1 0 2 2 3 277
talking_gov 0 134 0 0 1 4 0 162
talking_kolonia 0 48 1 0 0 4 3 245
talking_neighbors 4 70 3 3 5 4 5 207
talking_teachers 0 183 2 0 2 3 0 111
talking_villages 0 22 1 1 1 2 1 273
tv 1 206 3 1 0 8 0 82
us_relatives 5 107 1 4 4 7 6 167
writing 0 179 2 0 1 6 2 111

Plot raw data

Step 2: Run the CA

library(FactoMineR)
library(factoextra)
domains.CA <- CA(dt, graph=F)

Step 3: Visualize the CA (symmetric)

fviz_ca_biplot(domains.CA,repel=T)

Symmetric vs asymmetric plots

  • In the symmetric plot both columns are rows are plotted on same space
  • The problem is that only the distance between the row points and other row points and the distance between the columns points and other column points are directly interpretable and NOT the distance between columns and rows
  • To intrepret the distance between columns and row, we have to create an asymmetric plot where either the columns are mapped onto row space or rows mapped onto column space

Asymmetric plot (row space)

fviz_ca_biplot(domains.CA,repel=T,map="rowprincipal")

Asymmetric plot (column space)

fviz_ca_biplot(domains.CA,repel=T,map="colprincipal")

Step 4: How many axes do we keep?

  • The CA created many possible axes, but only some are helpful
  • Each axis has an eigenvalue that tell us how much information it explains (larger eigenvalue, more info)
  • If the data were completely random, we’d expect each axis to have an eigenvalue of 1/(nrow(dt)-1) or \(1(25-1) = 1/24 = 4.2\%\) for rows and 1/(ncol(dt)-1) or \(1/(8-1) = 1/7 = 14.3\%\) for columns
  • If eigenvalues are less than the largest of these two values, then should be kept, else can be ignored (though can keep it but will add little)
  • For us, we look for eigenvalues 14.3% or greater

Exploring Eigenvalues

fviz_screeplot(domains.CA,addlabels=T) + 
  geom_hline(yintercept=14.3,linetype=2,color="red")

Step 5: Interpreting axes

  • The axes of the CA are not always immediately interpretable
  • To better understand them, we look at which columns and which rows contributed the most to the creation of that axis
  • We only need to explore axis 1, but will show axis 2 for sake of explanation

Columns contributing to axis 1

fviz_contrib(domains.CA, choice="col",axes=1)

Rows contributing to axis 1

fviz_contrib(domains.CA, choice="row",axes=1)

Meaning of axis 1

  • English and Pohnpeian contributed strongly to axis 1
  • English and Pohnpeian are also on opposite sides
  • Positive values of axis 1 correlate with more English and negative values with more Pohnpeian

Columns contributing to axis 2

fviz_contrib(domains.CA, choice="col",axes=2)

Rows contributing to axis 2

fviz_contrib(domains.CA, choice="row",axes=2)

Meaning of axis 2

  • Mokilese, Mortlockese, and Chuukese contributed strongly to axis 2
  • They are all on thesame side of axis 2
  • Positive values of axis 2 correlate with more values of Mokilese, Mortlockese, and Chuukese

Step 6: Evaluating quality of fit

  • After determining the number of axes to retain and what they mean, we need to see how well each row and column item is represented by the retained number of dimensions
  • The squared cosine (cos2) is a measure of the quality of representation
  • cos2 values range from 0 to 1 and values close to 1 indicate a good representation and 0 a very bad representation

Quality of fit for rows

fviz_ca_row(domains.CA, col.row = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE)

Quality of fit for columns

fviz_ca_col(domains.CA, col.col = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE)

Quality of fit

  • Rows making friends and radio have the lowest quality of fit for rows
  • Columns Kosraean and other have the lowest quality of fit for columns
  • That means that extra axes would better explain them

Step 7: Clustering the results

  • This optional step allows you to cluster the rows together using hierachical clustering
  • By specifying -1 clusters, the algorith automatically chooses the number of clusters for you based off the data
domains.CA.cluster <- HCPC(domains.CA,nb.clust=-1,graph=F)

Visualizing clusters

fviz_cluster(domains.CA.cluster,
             repel = TRUE,            
             show.clust.cent = TRUE, 
             palette = "jco",         
             ggtheme = theme_minimal(),
             main = "Factor map"
             )

Visualizing clusters

Another visualization

fviz_dend(domains.CA.cluster, 
          cex = 0.7,                     
          palette = "jco",               
          rect = TRUE, rect_fill = TRUE, 
          rect_border = "jco",          
          labels_track_height = 0.8      
          )

Another visualization

Step 8: Interpretation of results

  • Overall axis 1 indicates domains with more English selections (positive values) or more Pohnpeian selections (negative values)
  • 5 of the 8 languages occur on the negative side of axis 1 so co-occur often with Pohnpeian domains
  • No other languages are close to English on axis 1
  • Cluster 1 includes domains that have high levels of Pohnpeian and other languages
  • Cluster 2 include domains with mixed English and Pohnpeian (as well as other languages)
  • Cluster 3 mostly English

Benefit of CA

  • For this data, allows us to see (1) what languages pattern in similar ways, (2) what domains pattern in similar ways, and (3) how languages and domains interact
  • Would be harder to see these patterns without CA

Automating CA

  • The CA analysis can be automated somewhat (doesn’t always work)
  • Just input the CA object and the function can create a sample analysis
  • Outputs an html, Word doc, or pdf file
library(FactoInvestigate)
Investigate(domains.CA,document="html_document")

Now try it yourself

  • Trying doing a CA analysis with data(housetask)
  • It is already a contigency table so can skip the first step

Related