Part 3: Advanced Text Processing - LDA and BERTopic Topic Modeling (20 pts)
Authors
Affiliations
UC Berkeley
Part 3: Advanced Text Processing - LDA and BERTopic Topic Modeling (20 pts)¶
Resources:
LDA:
https://
medium .com /sayahfares19 /text -analysis -topic -modelling -with -spacy -gensim -4cd92ef06e06 https://
www .kaggle .com /code /faressayah /text -analysis -topic -modeling -with -spacy -gensim #📚 -Topic -Modeling (code for previous post) https://
towardsdatascience .com /topic -modelling -in -python -with -spacy -and -gensim -dc8f7748bdbf/
BERTopic:
Import Appropriate Packages¶
import spacy
spacy.cli.download("en_core_web_sm")Collecting en-core-web-sm==3.8.0
Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
⚠ Restart to reload dependencies
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
# make a path for the outputs
from pathlib import Path
OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsfrom spacy import displacy
from bertopic import BERTopic
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis.gensim_models
from gensim.corpora.dictionary import DictionaryLoad Data¶
nlp = spacy.load("en_core_web_sm")sou = pd.read_csv("data/SOTU.csv")LDA¶
def preprocess_text(text):
doc = nlp(text)
return [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct and not token.is_space and len(token.lemma_) > 3]# Process all texts - note this takes ~ 5 minutes to run
processed_docs = sou['Text'].apply(preprocess_text)# Build dictionary
dictionary = Dictionary(processed_docs)
dictionary.filter_extremes(no_below=5, no_above=0.5) # Filter rare/common words
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]# train LDA model with 18 topics
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=18, random_state=42, passes=10)# print the top 10 words for each topic
print("\n--- LDA Topics ---")
for idx, topic in lda_model.print_topics(-1):
print(f"Topic: {idx} \nWords: {topic}\n")
--- LDA Topics ---
Topic: 0
Words: 0.008*"canal" + 0.005*"tariff" + 0.004*"panama" + 0.004*"statute" + 0.004*"company" + 0.004*"method" + 0.004*"convention" + 0.003*"board" + 0.003*"cent" + 0.003*"china"
Topic: 1
Words: 0.003*"mexico" + 0.001*"texas" + 0.001*"mexican" + 0.001*"convention" + 0.001*"americans" + 0.001*"minister" + 0.001*"program" + 0.001*"article" + 0.001*"cent" + 0.001*"loan"
Topic: 2
Words: 0.006*"method" + 0.005*"board" + 0.005*"agricultural" + 0.005*"farmer" + 0.005*"cent" + 0.004*"farm" + 0.004*"project" + 0.004*"veteran" + 0.004*"depression" + 0.004*"committee"
Topic: 3
Words: 0.004*"cent" + 0.004*"gold" + 0.004*"silver" + 0.003*"indian" + 0.003*"june" + 0.003*"bond" + 0.003*"method" + 0.003*"island" + 0.002*"conference" + 0.002*"tariff"
Topic: 4
Words: 0.019*"spain" + 0.009*"article" + 0.007*"minister" + 0.006*"likewise" + 0.005*"manufacture" + 0.005*"port" + 0.005*"tribe" + 0.005*"intercourse" + 0.004*"presume" + 0.004*"colony"
Topic: 5
Words: 0.009*"tariff" + 0.008*"corporation" + 0.007*"evil" + 0.006*"cable" + 0.005*"company" + 0.004*"industrial" + 0.004*"instance" + 0.003*"indian" + 0.003*"canal" + 0.003*"earnestly"
Topic: 6
Words: 0.030*"thank" + 0.011*"tonight" + 0.010*"border" + 0.008*"u.s.a." + 0.008*"americans" + 0.007*"illegal" + 0.007*"drug" + 0.007*"incredible" + 0.006*"love" + 0.006*"criminal"
Topic: 7
Words: 0.014*"americans" + 0.011*"tonight" + 0.007*"program" + 0.006*"budget" + 0.006*"today" + 0.006*"percent" + 0.005*"billion" + 0.005*"worker" + 0.005*"thank" + 0.005*"challenge"
Topic: 8
Words: 0.001*"indians" + 0.001*"mexico" + 0.001*"june" + 0.001*"island" + 0.001*"import" + 0.001*"program" + 0.001*"indian" + 0.001*"convention" + 0.001*"article" + 0.001*"british"
Topic: 9
Words: 0.012*"democracy" + 0.008*"task" + 0.005*"thought" + 0.005*"modern" + 0.004*"impossible" + 0.004*"undertake" + 0.004*"railway" + 0.003*"billion" + 0.003*"recovery" + 0.003*"railroad"
Topic: 10
Words: 0.006*"isthmus" + 0.006*"slavery" + 0.005*"kansas" + 0.005*"june" + 0.004*"whilst" + 0.004*"majority" + 0.004*"convention" + 0.004*"1857" + 0.004*"route" + 0.003*"july"
Topic: 11
Words: 0.002*"americans" + 0.001*"program" + 0.001*"budget" + 0.001*"tonight" + 0.001*"mexico" + 0.001*"challenge" + 0.001*"today" + 0.001*"billion" + 0.001*"cent" + 0.001*"indian"
Topic: 12
Words: 0.023*"mexico" + 0.009*"texas" + 0.007*"mexican" + 0.005*"article" + 0.004*"convention" + 0.004*"minister" + 0.004*"california" + 0.003*"port" + 0.003*"deem" + 0.003*"oregon"
Topic: 13
Words: 0.007*"friendship" + 0.006*"useful" + 0.005*"acquisition" + 0.004*"wrong" + 0.004*"tribe" + 0.003*"execution" + 0.003*"disposition" + 0.003*"contemplate" + 0.003*"blessing" + 0.003*"neutral"
Topic: 14
Words: 0.008*"currency" + 0.005*"specie" + 0.005*"gold" + 0.004*"mail" + 0.004*"june" + 0.004*"convention" + 0.004*"herewith" + 0.003*"bond" + 0.003*"commissioner" + 0.003*"cuba"
Topic: 15
Words: 0.019*"program" + 0.007*"billion" + 0.006*"soviet" + 0.006*"budget" + 0.005*"area" + 0.005*"major" + 0.004*"today" + 0.004*"level" + 0.004*"farm" + 0.004*"inflation"
Topic: 16
Words: 0.004*"minister" + 0.004*"british" + 0.003*"intercourse" + 0.003*"france" + 0.003*"convention" + 0.003*"article" + 0.002*"spain" + 0.002*"port" + 0.002*"deem" + 0.002*"indians"
Topic: 17
Words: 0.008*"indians" + 0.006*"gentleman" + 0.006*"resolution" + 0.006*"subscription" + 0.005*"satisfaction" + 0.005*"naturally" + 0.005*"announce" + 0.005*"pleasing" + 0.004*"pursuant" + 0.004*"sale"
# print the topic distribution for the first speech
first_speech_topics = lda_model.get_document_topics(corpus[0])
print(first_speech_topics)[(7, 0.99942523)]
The most prominent topic was topic 7 with a percentage of 99.94%. This topic contains words related to national identity/politics.
# make an interactive visualization using pyLDAvis
pyLDAvis.enable_notebook()
lda_intertopic_map = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.save_html(lda_intertopic_map, str(OUTPUT_DIR / "lda_intertopic_distance_map.html"))
lda_intertopic_mapLoading...
BERTopic¶
docs = sou['Text'].to_list()# train the model - this takes about 30 seconds
# remove stop words from the topics
vectorizer_model = CountVectorizer(stop_words="english")
topic_model = BERTopic(min_topic_size=3)
topics, probs = topic_model.fit_transform(docs)
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)Loading...
# output the top 10 words for each topic
topic_model.get_topic_info()Loading...
# output the topic distribution for the first speech
topic_dist = topic_model.approximate_distribution(docs)
topic_dist_fig = topic_model.visualize_distribution(topic_dist[0][0])
topic_dist_fig.write_html(str(OUTPUT_DIR / "topic_visualization.html"))
topic_dist_fig.show()Loading...
The most prominent topics seem to be strongly centered around America.
# run this cell to visualize the topics
intertopic_dist_fig = topic_model.visualize_topics()
intertopic_dist_fig.write_html(str(OUTPUT_DIR / "intertopic_distance_map.html"))
intertopic_dist_fig.show()Loading...