Introduction to Topic Modelling in R and Python

Workshops for Ukraine

Dr. Christian S. Czymara

10/19/23

Before we begin

  • Open this Colab and run the code for “Install packages”
  • Then, open this Colab and run the code for “Install BERTopic library”

Agenda

  • What are Topic Models and what do you use them for?
  • Pre-processing
  • How do Topic Models work?
  • Advantages and limitations of Topic Models
  • Exercise: Topics in news articles

Basics

Logic of topic models

  • Topic Modeling is an algorithm for finding the most important themes (topics) in large text collections (corpus).
  • Little prior knowledge about the content needed
  • What (i.e. which topics) is written about?
  • Which topics are addressed particularly frequently?
  • Do the texts differ systematically in their content (between different persons, newspapers, over time, …)?

But first: Data preparation (preprocessing)

National anthem of Ukraine

  • First of all, the texts have to be broken down into their components (words, punctuation marks, etc.).
  • Example: The national anthem of Ukraine (translated lyrics according to Wikipedia)
Ukraine's glory and freedom have not yet perished,
Still upon us, young brothers, fate shall smile.
Our enemies shall vanish, like dew in the sun.
We too shall rule, brothers, in our country.
  • Let’s assume that each line is a single document

  • Corpus (text collection) with four documents

Vocabulary

  • The texts are divided into their components
dok1 <- "Ukraine's glory and freedom have not yet perished,"

dok2 <- "Still upon us, young brothers, fate shall smile."

dok3 <- "Our enemies shall vanish, like dew in the sun."

dok4 <- "We too shall rule, brothers, in our country."

UA_anthem <- cbind(dok1, dok2, dok3, dok4)

corp_UA_anthem <- corpus(UA_anthem)
toks_UA_anthem <- tokens(corp_UA_anthem)

Vocabulary

toks_UA_anthem
Tokens consisting of 4 documents.
text1 :
[1] "Ukraine's" "glory"     "and"       "freedom"   "have"      "not"      
[7] "yet"       "perished"  ","        

text2 :
 [1] "Still"    "upon"     "us"       ","        "young"    "brothers"
 [7] ","        "fate"     "shall"    "smile"    "."       

text3 :
 [1] "Our"     "enemies" "shall"   "vanish"  ","       "like"    "dew"    
 [8] "in"      "the"     "sun"     "."      

text4 :
 [1] "We"       "too"      "shall"    "rule"     ","        "brothers"
 [7] ","        "in"       "our"      "country"  "."       

Document-Feature-Matrix

  • What we need is a table (matrix)
  • … in which the texts (documents) are in the rows
  • … the words, characters, etc. (features) are in the columns
  • … and in the cells counts how often a feature occurs in the respective text
dfm_UA_anthem <- dfm(toks_UA_anthem)
dfm_UA_anthem
Document-feature matrix of: 4 documents, 30 features (66.67% sparse) and 0 docvars.
       features
docs    ukraine's glory and freedom have not yet perished , still
  text1         1     1   1       1    1   1   1        1 1     0
  text2         0     0   0       0    0   0   0        0 2     1
  text3         0     0   0       0    0   0   0        0 1     0
  text4         0     0   0       0    0   0   0        0 2     0
[ reached max_nfeat ... 20 more features ]

Almost done

  • Better: Keep only actual words as features (no numbers, punctuation, etc.)
toks_UA_anthem_2 <- tokens(corp_UA_anthem,
                           remove_punct = T,
                           remove_numbers = T,
                           remove_symbols = T,
                           remove_separators = T
                           )
  • Remove stop words (common words without real meaning)
toks_UA_anthem_2 <-  tokens_remove(toks_UA_anthem_2,
                                   stopwords(),
                                   case_insensitive = TRUE
                                   )

Almost done

  • Reduce words to their stem (stemming), e.g. “programming”, “programs, and”programmed” all become “program”
toks_UA_anthem_2 <- tokens_wordstem(toks_UA_anthem_2)
  • Create DFM
dfm_UA_anthem_2 <- dfm(toks_UA_anthem_2)

dfm_UA_anthem_2
Document-feature matrix of: 4 documents, 20 features (71.25% sparse) and 0 docvars.
       features
docs    ukrain glori freedom yet perish still upon us young brother
  text1      1     1       1   1      1     0    0  0     0       0
  text2      0     0       0   0      0     1    1  1     1       1
  text3      0     0       0   0      0     0    0  0     0       0
  text4      0     0       0   0      0     0    0  0     0       1
[ reached max_nfeat ... 10 more features ]

Topic Models

Assumptions

  • Each word contributes equally to the text (bag-of-words assumption)
  • Mixed membership models: Each text consists of a mixture of different topics (with different proportions)
  • Texts that discuss similar topics use similar words
  • In other words, certain topics contain some words more than others

The algorithm

  • Two steps:
  • Finding out which words occur together
  • Checking how these words are distributed among the texts
  • Unsupervised machine learning, since no other info needs to be present

The algorithm

  • Crucial question: How many topics should be found?
  • Iterative process designed to maximize two goals simultaneously:
  • Words that occur together frequently are more likely to belong to the same topic
  • Words in the same document are more likely to belong to the same topic

Topic Models in action

Kling (2016): Topic Modelling Portal

Implementations

Structural Topic Models (STM)

  • Arguably most convenient: stm package for R by Roberts, Stewart, and Tingley (2019)
  • Big advantage of stm: adding document-level properties (e.g., newspaper, time of tweet, gender of respondent)
  • Second step (after topic model):
  • Regression with documents as units of analysis
  • Topic frequencies as dependent variables
  • Text properties as predictors (see ?estimateEffect)

Biterm Topic Models (BTM)

  • biterms refers to unordered, co-occurring word-pairs in texts
  • BTM models the biterm occurrences in the whole corpus (unlike LDA or STM, which model the word occurrences in a document)
  • Ideally suited for short texts
  • Automatically creates “garbage” topic (less preprocessing needed)
  • BTM package for R by Wijffels (2023)

Keyword Assisted Topic Models (keyATM)

  • Similar to stm, but allows adding keywords to label topics prior to fitting the model
  • Semisupervised model that combines a small amount of information with a large amount of unlabeled data
  • keyATM package for R by Eshima et al. (2023)

Bidirectional Encoder Representations from Transformers (BERTopic)

  • Uses pre-trained transformer-based language models to create word clusters
  • Based on class-based Term Frequency - Inverse Document Frequency (c-TF-IDF): Which words are typical for topic 1 and not so much for all other topics?
  • Single membership models (each text belongs to only one topic)

BERTopic

  • Automatically chooses number of topics, including a garbage topic
  • Number of topics can be changed with nr_topics, which merges topics after they have been created
  • Usually no preprocessing as all parts of a document are used for document embeddings
  • Might make sense to remove a lot of noise (e.g., HTML-tags)
  • At the moment, the BERTopic library by Grootendorst (2023) is only available in Python (not R)
  • Some best practices recommended by the author

BERTopic models

  • BERTopic supports all kinds of topic modeling techniques
  • Check out the documentation and the GitHub repository to find the right settings for your case

BERTopic models

Summing up

Advantages of Topic Models

  • The entire corpus is decomposed into its thematic components
  • Little prior knowledge needed, models themselves need few decisions (only number of topics, or not even that)
  • Often intuitive results
  • Unsupervised: No training data needed, no hand coding
  • Exploration and description of large text data collections

Limitations of Topic Models

  • Results depend on pre-processing decisions (at least for classical approaches)
  • Naming/labeling topics is subjective
  • What to do with meaningless topics?
  • Analysis of all content can be overwhelming
  • Content of topics can be influenced by the number of topics (at least for classical approaches), especially if the number of topics is small
  • If categories are known before, it is better to classify them directly (supervised ML)

Conclusion

  • Topic Models find out which words occur together
  • … and thus which topics are discussed in which texts
  • Very helpful for general overviews, no prior theoretical knowledge necessary
  • For more specific research questions, supervised classification often better

Exercise

Exercise

  • What topics were discussed in more than 2000 news sources contained in the AG corpus?
  • We will work with Google Colab for this exercise
  • stm and other R packages are in this Colab
  • BERTopic and other approaches in Python are in this Colab