Cluster Analysis for categorical data

Feb 7, 2018 7 min read R

What to do with categorical data?

Categorical data can be challenging to analyze quantitatively
In language research we often have data that are purely categorical
In today’s presentation we will deal with a specific type of categorical data found in questionnaires

Questionnaire Data

Questionnaires are frequently used in a variety of language research scenarios
They often ask people to rate something (likert scales) or select the most appropriate response
Example: select the language that is most appropriate to use in a given domain
Example: rate level of agreement with several statements

Research Question

How do the questionnaire respondents relate to each other based on their responses?
In other words: what groups of respodents emerge based on their responses?

Cluster Analysis

Cluster analysis is a family of statistical techniques that shows groups of respondents based on their responses
Cluster analysis is a descriptive tool and doesn’t give p-values per se, though there are some helpful diagnostics

Common cluster analyses

k-means clustering
k-medoids clustering or partitioning around medoids (PAM)
DBSCAN (density based clustering)
hierarchical clusters

PAM

This presentation will focus on the use of PAM
Works in general by finding a pre-determined number of groups in the data and iteratively tries to find an ideal solution based on assigning a center point for each cluster and all other data points to the appropriate center point
Is more robust to outliers than k-means since uses medians rather than means

Example Data

For the sample cluster analysis we will be using data from a questionnaire used on Pohnpei
There are 25 questions where the respondents were asked to select 1 language that is the most important for that specific domain
The answers for all 25 questions were the same 8 language choices
301 respondents

domains <- read.csv("domains.csv")

Explore the data

making_friends	being_successful	good_education	happy_relationships	getting_money	reading	writing	radio	tv	accepted_pni	talking_teachers	talking_villages	funerals	kamadipw	sakau	facebook	talking_kolonia	talking_chief	talking_gov	good_job	friends_school	church	store	talking_neighbors	us_relatives
English	English	English	English	English	English	English	English	English	English	English	English	English	English	Other	English	English	English	English	English	English	English	English	English	Other
Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pingelapese	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian
Pohnpeian	English	English	Pohnpeian	English	English	English	English	English	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	English	Pohnpeian	Pohnpeian	English	English	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	English
Pohnpeian	English	English	English	English	English	English	English	English	Pohnpeian	English	English	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	English	English	English	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian
English	English	English	English	English	English	English	English	English	Pohnpeian	English	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	English	Pohnpeian	Pohnpeian	Pohnpeian	English	English	Pohnpeian	Pohnpeian	Pohnpeian	English
Pohnpeian	English	English	Pohnpeian	English	English	English	English	English	Pohnpeian	English	Pohnpeian	Pohnpeian	Pohnpeian	Pohnpeian	English	Pohnpeian	Pohnpeian	Pohnpeian	English	English	Pohnpeian	Pohnpeian	Pohnpeian	English

Libraries we’ll use

library(tidyverse)
library(hrbrthemes)
library(cluster)
library(NbClust)
library(factoextra)

Step 1: Create a dissimilarity matrix

In order to cluster respondents, we need to calculate how dissimilar each respondent is from each other respondent
To calculate a dissimilarity matrix, we use the Gower dissimilarity calculation that works for categorical data, instead of the Euclidean method for numerical data
We use the daisy() function from the cluster package
Responses range from 0–1
Values of 1 are completely dissimilar and values of 0 means identical responses

Calculate the dissimilarity matrix

domains.dist <- daisy(domains,metric="gower")

Visualize matrix

gradient.color <- list(low = "steelblue",  high = "white")
fviz_dist(domains.dist, 
   gradient = gradient.color,order=T)

Step 2: Determine number of clusters

Clustering algorithms such as PAM require the number of clusters to be pre-specified
You can determine the number of clusters based on theory or research expectations
Or more commonly through some sort of diagnostic that allows the number of clusters to emerge from the data
For categorical data, one common way is the silhouette method (numerical data have many other possible diagonstics)

Silhouette Method

The silhouette method calculates for a range of cluster sizes how similar values in a particular cluster are to each other versus how similar they are to values outside their cluster
For this method, an ideal arrangement would have values being very similar to other members of its cluster and very dissimilar with those values outside its cluster
The method gives score overall from -1 to 1 for each number of clusters, where 1 means very well clustered and -1 very poorly clustered
The number of clusters with the best score is selected

Calculating silhouette method

number_clusters <- NbClust(diss=domains.dist,distance=NULL,
                          min.nc = 2, max.nc = 10,
                          method = "median",
                          index="silhouette")

## 
##  Only frey, mcclain, cindex, sihouette and dunn can be computed. To compute the other indices, data matrix is needed

number_clusters$Best.nc # best solution

## Number_clusters     Value_Index 
##          2.0000          0.3413

Step 3: Run PAM

Based on the silhouette method, we will use 2 clusters

domains.pam <- pam(domains.dist,2)

Visualizing the PAM

To visual the PAM, we can reduce the complexity of the dissimilarity matrix to 2 dimensions via multidimensional scaling
Then can add color for each cluster

domains.mds <- as.data.frame(cmdscale(domains.dist,2))
domains.mds$domains_cluster <- as.factor(domains.pam$clustering)
ggplot(domains.mds,aes(x=V1,y=V2,color=domains_cluster)) + 
  geom_point() + theme_ipsum() +
  labs(title="MDS plot for domain questions",
       subtitle="Colored by PAM cluster") +
  scale_color_brewer(palette="Set1")

Clusters plotted

Step 4: Interpret the clusters

Now that the clusters are created, we have to evaluate whether or not they tell a meaningful story about the data, since it could just be a random grouping of the data
Need to determine: (a) what pattern(s)/behavior(s) each cluster represents and (b) who is in each group
Remember: Clusters are descriptive/exploratory, rather than a statistical test
For (a) can subset data by cluster and compare how each group answered the different questionnaire questions
For (b) can subset data by cluster, then compare each cluster by known demographic variables

Subsetting

domains$domains_cluster <- domains.mds$domains_cluster
language_domains_social_solidarity <- domains %>%
  dplyr::select(making_friends,happy_relationships,accepted_pni,
                talking_villages,talking_kolonia,talking_neighbors,
                us_relatives,domains_cluster)
names(language_domains_social_solidarity) <-c(
  "Making friends",
  "Feeling happy in your relationships",
  "Being accepted in Pohnpei",
  "Talking with people in the sections of Pohnpei",
  "Talking with people in Kolonia",
  "Talking with your neighbors",
  "Speaking with relatives who live in the US",
  "Domain PAM cluster")
language_domains_social_solidarity_gathered <- language_domains_social_solidarity %>%
  gather(key="domain",value="language",-"Domain PAM cluster")

Plot of some answers by cluster

ggplot(language_domains_social_solidarity_gathered,
       aes(x=domain,fill=language)) +
  geom_bar(stat="count",position="dodge",width=0.7)  +
  theme_ipsum(grid="Y") + scale_fill_brewer(palette="Set2") + 
  labs(title="Language importance by PAM cluster", 
       subtitle="Social solidarity domains") + 
  xlab("Domain")+ theme(axis.text.x =
                          element_text(angle = 20,hjust=1,size=8),
                        legend.position="bottom",
                        legend.text=element_text(size=10))+ 
  ylab("Number of respondents") + facet_grid(~`Domain PAM cluster`)

Interpretation: what

For (a), based on the plots, you can describe different patterns and trends that occur, such as group 1 has more English selections as well as languages other than Pohnpeian, while group 2 tends toward only Pohnpeian selections

Determining who is in each cluster

To determine who is in each cluster, we import the demographic data and add the cluster information to it
Then plot the clusters by several demograhpic variables

demos <- read.csv("demos.csv")
demos$cluster <- as.factor(domains.mds$domains_cluster)

Age and PAM cluster

ggplot(demos,aes(x=age,fill=cluster)) +
  geom_bar(stat="count",position="dodge",width=0.7)  +
  theme_ipsum(grid="Y") + scale_fill_brewer(palette="Set2") + 
  labs(title="Age and PAM cluster", 
       subtitle="for language importance") + 
  xlab("Age")+ 
  ylab("Number of respondents")

Gender and PAM cluster

ggplot(demos,aes(x=sex,fill=cluster)) +
  geom_bar(stat="count",position="dodge",width=0.7)  +
  theme_ipsum(grid="Y") + scale_fill_brewer(palette="Set2") + 
  labs(title="Gender and PAM cluster", 
       subtitle="for language importance") + 
  xlab("Gender")+  
  ylab("Number of respondents")

Gender and Age by PAM cluster

ggplot(demos,aes(x=age,fill=cluster)) +
  geom_bar(stat="count",position="dodge",width=0.7)  +
  theme_ipsum(grid="Y") + scale_fill_brewer(palette="Set2") + 
  labs(title="Gender and Age by PAM cluster", 
       subtitle="for language importance") + theme(axis.text.x =
                          element_text(angle = 20,hjust=1,size=5),
                        legend.position="bottom",
                        legend.text=element_text(size=8)) +
  xlab("Age")+
  ylab("Number of respondents") + facet_grid(~sex)

Eduation and PAM cluster

ggplot(demos,aes(x=education,fill=cluster)) +
  geom_bar(stat="count",position="dodge",width=0.7)  +
  theme_ipsum(grid="Y") + scale_fill_brewer(palette="Set2") + 
  labs(title="Education and PAM cluster", 
       subtitle="for language importance") + theme(axis.text.x =
                          element_text(angle = 20,hjust=1,size=5),
                        legend.position="bottom",
                        legend.text=element_text(size=8))+
  xlab("Highest education level")+  
  ylab("Number of respondents")

Current municipality and PAM cluster

ggplot(demos,aes(x=current_muni,fill=cluster)) +
  geom_bar(stat="count",position="dodge",width=0.7)  +
  theme_ipsum(grid="Y") + scale_fill_brewer(palette="Set2") + 
  labs(title="Current municipality and PAM cluster", 
       subtitle="for language importance") + 
  xlab("Current municipality")+  
  ylab("Number of respondents")

Interpretation: who

People aged 18–24 have much greater numbers in cluster 1, ages 25–44 are fairly equally distributed, and ages 45+ have more respondents in cluster 2
Gender is more evenly distributed with somewhat more women in cluster 1
Both women and men aged 18–24 tend to be in cluster 1, but somewhat more women aged 25–34 in cluster 1, and slightly more men in that age group in cluster 2
Those who completed high school tend be in cluster 1 and those who did not are more likely in cluster 2
All municipalities except Sokehs have fairly similar distributions, which tends to be in cluster 1

Conclusion

Cluster analysis can help find emergent patterns in the data
These patterns can be similar to what is found with other statistical models such as regression
But more importantly can help find patterns and global trends across your own coded groups (such as demographic variables) that may be missed by other methods
Can also show more complex (aka non-linear) trends than regression modeling
Cluster analysis can be a very helpful data exploration and analysis technique
Though lacks probabilities of group membership and significance testing

Cluster Analysis for categorical data

What to do with categorical data?

Questionnaire Data

Research Question

Cluster Analysis

Common cluster analyses

PAM

Example Data

Explore the data

Libraries we’ll use

Step 1: Create a dissimilarity matrix

Calculate the dissimilarity matrix

Visualize matrix

Step 2: Determine number of clusters

Silhouette Method

Calculating silhouette method

Step 3: Run PAM

Visualizing the PAM

Clusters plotted

Step 4: Interpret the clusters

Subsetting

Plot of some answers by cluster

Interpretation: what

Determining who is in each cluster

Age and PAM cluster

Gender and PAM cluster

Gender and Age by PAM cluster

Eduation and PAM cluster

Current municipality and PAM cluster

Interpretation: who

Conclusion

Further reading

Related