Quantitative Text Analysis Using R

Kenneth Benoit (LSE)

16–18 June 2021

This three-day workshop will cover natural language processing and quantitative text analysis using the R statistical environment.  The main tool will be the quanteda package, which we developed as a comprehensive, flexible, and open framework for powerful and scalable natural language processing quantitative analysis of textual data.

The workshop is intended for students with basic R experience who are willing and able to learn quickly, or for intermediate R users interested in text analysis or in learning how to use quanteda and R for this purpose.  Additional tutorial materials will be available after the workshop to continue their learning.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as metadata for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manipulate the texts in a corpus, by performing the most common natural language processing tasks simply and quickly, such as tokenizing, stemming, or forming ngrams. quanteda’s functions for tokenizing texts and forming multiple tokenized documents into a document feature matrix are both extremely fast and extremely simple to use. quanteda can segment texts easily by words, paragraphs, sentences, or even usersupplied delimiters and tags.

quanteda also offers functionality for corpus management, creating and manipulating tokens and ngrams, exploring keywords in context, forming and manipulating sparse matrices of documents by features and feature cooccurrences, analyzing keywords, computing feature similarities and distances, applying content dictionaries, applying supervised and unsupervised machine learning, visually representing text and text analyses, and more.

We will also cover companion R packages for text analysis, namely spacyr for part of speech tagging and dependency parsing, and readtext for reading texts into R and converting from common formats.  Additional packages will be used for analysis, such as the STM package for fitting structural topic models.

Topics covered will include 

  • Fundamentals of text analysis
  • Getting Started with R and quanteda
  • Working with a corpus
  • Keywords-in-context
  • Importing text files
  • Tokenization
  • Creating a document-feature matrix
  • Dictionary (sentiment) Analysis
  • Textual Statistics
  • Text scaling
  • Document classification and clustering
  • Topic modelling
  • Word embedding models

Extensive material is already available from:

https://quanteda.io

https://tutorials.quanteda.io
https://quanteda.org

And additional material will be available by May 2020 from the book Quantitative Text Analysis Using R currently in progress by Kenneth Benoit (the instructor).

Kenneth Benoit is Professor of Computational Social Science at the London School of Economics and Political Science.