Skip to article frontmatterSkip to article content

Part 3: Advanced Text Processing - LDA and BERTopic Topic Modeling (20 pts)

Authors
Affiliations
UC Berkeley

Import Appropriate Packages

import spacy 
spacy.cli.download("en_core_web_sm")
Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
⚠ Restart to reload dependencies
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
# make a path for the outputs
from pathlib import Path

OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from spacy import displacy
from bertopic import BERTopic
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis.gensim_models
from gensim.corpora.dictionary import Dictionary

Load Data

nlp = spacy.load("en_core_web_sm")
sou = pd.read_csv("data/SOTU.csv")

LDA

def preprocess_text(text): 
    doc = nlp(text) 
    return [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct and not token.is_space and len(token.lemma_) > 3]
# Process all texts - note this takes ~ 5 minutes to run
processed_docs = sou['Text'].apply(preprocess_text)
# Build dictionary
dictionary = Dictionary(processed_docs) 
dictionary.filter_extremes(no_below=5, no_above=0.5) # Filter rare/common words 
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
# train LDA model with 18 topics
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=18, random_state=42, passes=10)
# print the top 10 words for each topic
print("\n--- LDA Topics ---") 
for idx, topic in lda_model.print_topics(-1): 
    print(f"Topic: {idx} \nWords: {topic}\n")

--- LDA Topics ---
Topic: 0 
Words: 0.008*"canal" + 0.005*"tariff" + 0.004*"panama" + 0.004*"statute" + 0.004*"company" + 0.004*"method" + 0.004*"convention" + 0.003*"board" + 0.003*"cent" + 0.003*"china"

Topic: 1 
Words: 0.003*"mexico" + 0.001*"texas" + 0.001*"mexican" + 0.001*"convention" + 0.001*"americans" + 0.001*"minister" + 0.001*"program" + 0.001*"article" + 0.001*"cent" + 0.001*"loan"

Topic: 2 
Words: 0.006*"method" + 0.005*"board" + 0.005*"agricultural" + 0.005*"farmer" + 0.005*"cent" + 0.004*"farm" + 0.004*"project" + 0.004*"veteran" + 0.004*"depression" + 0.004*"committee"

Topic: 3 
Words: 0.004*"cent" + 0.004*"gold" + 0.004*"silver" + 0.003*"indian" + 0.003*"june" + 0.003*"bond" + 0.003*"method" + 0.003*"island" + 0.002*"conference" + 0.002*"tariff"

Topic: 4 
Words: 0.019*"spain" + 0.009*"article" + 0.007*"minister" + 0.006*"likewise" + 0.005*"manufacture" + 0.005*"port" + 0.005*"tribe" + 0.005*"intercourse" + 0.004*"presume" + 0.004*"colony"

Topic: 5 
Words: 0.009*"tariff" + 0.008*"corporation" + 0.007*"evil" + 0.006*"cable" + 0.005*"company" + 0.004*"industrial" + 0.004*"instance" + 0.003*"indian" + 0.003*"canal" + 0.003*"earnestly"

Topic: 6 
Words: 0.030*"thank" + 0.011*"tonight" + 0.010*"border" + 0.008*"u.s.a." + 0.008*"americans" + 0.007*"illegal" + 0.007*"drug" + 0.007*"incredible" + 0.006*"love" + 0.006*"criminal"

Topic: 7 
Words: 0.014*"americans" + 0.011*"tonight" + 0.007*"program" + 0.006*"budget" + 0.006*"today" + 0.006*"percent" + 0.005*"billion" + 0.005*"worker" + 0.005*"thank" + 0.005*"challenge"

Topic: 8 
Words: 0.001*"indians" + 0.001*"mexico" + 0.001*"june" + 0.001*"island" + 0.001*"import" + 0.001*"program" + 0.001*"indian" + 0.001*"convention" + 0.001*"article" + 0.001*"british"

Topic: 9 
Words: 0.012*"democracy" + 0.008*"task" + 0.005*"thought" + 0.005*"modern" + 0.004*"impossible" + 0.004*"undertake" + 0.004*"railway" + 0.003*"billion" + 0.003*"recovery" + 0.003*"railroad"

Topic: 10 
Words: 0.006*"isthmus" + 0.006*"slavery" + 0.005*"kansas" + 0.005*"june" + 0.004*"whilst" + 0.004*"majority" + 0.004*"convention" + 0.004*"1857" + 0.004*"route" + 0.003*"july"

Topic: 11 
Words: 0.002*"americans" + 0.001*"program" + 0.001*"budget" + 0.001*"tonight" + 0.001*"mexico" + 0.001*"challenge" + 0.001*"today" + 0.001*"billion" + 0.001*"cent" + 0.001*"indian"

Topic: 12 
Words: 0.023*"mexico" + 0.009*"texas" + 0.007*"mexican" + 0.005*"article" + 0.004*"convention" + 0.004*"minister" + 0.004*"california" + 0.003*"port" + 0.003*"deem" + 0.003*"oregon"

Topic: 13 
Words: 0.007*"friendship" + 0.006*"useful" + 0.005*"acquisition" + 0.004*"wrong" + 0.004*"tribe" + 0.003*"execution" + 0.003*"disposition" + 0.003*"contemplate" + 0.003*"blessing" + 0.003*"neutral"

Topic: 14 
Words: 0.008*"currency" + 0.005*"specie" + 0.005*"gold" + 0.004*"mail" + 0.004*"june" + 0.004*"convention" + 0.004*"herewith" + 0.003*"bond" + 0.003*"commissioner" + 0.003*"cuba"

Topic: 15 
Words: 0.019*"program" + 0.007*"billion" + 0.006*"soviet" + 0.006*"budget" + 0.005*"area" + 0.005*"major" + 0.004*"today" + 0.004*"level" + 0.004*"farm" + 0.004*"inflation"

Topic: 16 
Words: 0.004*"minister" + 0.004*"british" + 0.003*"intercourse" + 0.003*"france" + 0.003*"convention" + 0.003*"article" + 0.002*"spain" + 0.002*"port" + 0.002*"deem" + 0.002*"indians"

Topic: 17 
Words: 0.008*"indians" + 0.006*"gentleman" + 0.006*"resolution" + 0.006*"subscription" + 0.005*"satisfaction" + 0.005*"naturally" + 0.005*"announce" + 0.005*"pleasing" + 0.004*"pursuant" + 0.004*"sale"

# print the topic distribution for the first speech
first_speech_topics = lda_model.get_document_topics(corpus[0])
print(first_speech_topics)
[(7, 0.99942523)]

The most prominent topic was topic 7 with a percentage of 99.94%. This topic contains words related to national identity/politics.

# make an interactive visualization using pyLDAvis
pyLDAvis.enable_notebook()
lda_intertopic_map = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)

pyLDAvis.save_html(lda_intertopic_map, str(OUTPUT_DIR / "lda_intertopic_distance_map.html"))
lda_intertopic_map
Loading...

BERTopic

docs = sou['Text'].to_list()
# train the model - this takes about 30 seconds
# remove stop words from the topics
vectorizer_model = CountVectorizer(stop_words="english")

topic_model = BERTopic(min_topic_size=3)

topics, probs = topic_model.fit_transform(docs)

topic_model.update_topics(docs, vectorizer_model=vectorizer_model)
Loading...
# output the top 10 words for each topic
topic_model.get_topic_info()
Loading...
# output the topic distribution for the first speech
topic_dist = topic_model.approximate_distribution(docs)
topic_dist_fig = topic_model.visualize_distribution(topic_dist[0][0])

topic_dist_fig.write_html(str(OUTPUT_DIR / "topic_visualization.html"))
topic_dist_fig.show()
Loading...

The most prominent topics seem to be strongly centered around America.

# run this cell to visualize the topics
intertopic_dist_fig = topic_model.visualize_topics()
intertopic_dist_fig.write_html(str(OUTPUT_DIR / "intertopic_distance_map.html"))
intertopic_dist_fig.show()
Loading...