Data-science

Cluster Analysis for categorical data

What to do with categorical data? Categorical data can be challenging to analyze quantitatively In language research we often have data that are purely categorical In today’s presentation we will deal with a specific type of categorical data found in questionnaires Questionnaire Data Questionnaires are frequently used in a variety of language research scenarios They often ask people to rate something (likert scales) or select the most appropriate response Example: select the language that is most appropriate to use in a given domain Example: rate level of agreement with several statements Research Question How do the questionnaire respondents relate to each other based on their responses?

Correspondence Analysis

What to do with categorical data? Categorical data can be challenging to analyze quantitatively In language research we often have data that are purely categorical In today’s presentation we will deal with a specific type of categorical data found in questionnaires Questionnaire Data Questionnaires are frequently used in a variety of language research scenarios They often ask people to rate something (likert scales) or select the most appropriate response There are often multiple questions that have the same answer scale (same choices) Example: select the language that is most appropriate to use in a given domain Example: rate level of agreement with several statements Research Question Often we are interested in how the questions relate to each other in terms of their answers as well as how the answers relate to each other based on the questions they were most used with Correspondence Analysis CA is statistical technique that shows how the questions and answers of multiple questions relate to each other Requires the data to have the same scale (all questions must have some possible answers) CA is a descriptive tool and doesn’t give p-values per se How does CA work?

Data Science

Advancing linguistic research techniques through quantitative methods in R

Introduction to R Markdown

This document comes from a UH-Mānoa data science group for linguists presentation This is a top level section R Markdown This is an R Markdown presentation. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Intro to Bayesian Linear Mixed Effect Models with rstanarm

Overview of Presentation Bayesian vs. frequentist inference Bayes’ Theorem Example of Bayesian inference Bayesian LMERs with rstanarm how to code models selecting priors displaying & interpreting the posterior distribution model diagnostics and comparisons Bayesian vs. Frequentist Inference Frequentist Inference Uses only the data and compares it to an idealized model to make inferences about the data Example Problem: You lose your cellphone in your house. You have a friend call it and you listen for the sound to find it.

Intro to Git

What is Git? Git is… A distributed version control system Used to allow multiple people to collaborate on the same code Also useful for managing your own code and larger projects Applications for language researchers? Coauthoring papers Working on R scripts or other code Sharing code with others (via github or bitbucket) Managing large projects like a dissertation, thesis, or article What does Git do?

Tidy Data

What is tidy data? Tidy data have the following characteristics: Observations are in rows Variables are in columns Contained in a single dataset An example of tidy data Participant Gender Trial Value 01 M 1 100 01 M 2 210 02 F 1 50 02 F 2 75 An example of messy data Participant Trial1 Trial2 M01 100 210 F02 50 75 R Packages for tidy data library("tidyverse") ## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.